Five short links

Fastfive
Photo by Naír la jefa

From Social Data Mining to Forecasting Socio-Economic Crises – An academic manifesto for Big Data. I’m surprised to find myself on the skeptical end of the spectrum, the promises feels a bit too techno-utopian, but this is a great overview. It does cover ‘problematic issues’ in one section, and hits a lot of my concerns, eg that ’08 really was “the first financial crisis sparked by Big Data

The Billion Prices Project – A valuable, practical example of what we can do now that data’s cheap and easy to process on a massive scale. I’ll love to see how this stacks up against survey-based methods for measuring inflation over the next few years.

The Mendeley Data Challenge – Science was the original pioneer of open data, but business models have ossified and restricted the flow of valuable information. Happily, I learned at SciFoo that there’s a ragtag band of researchers and organizations looking to create a much more flexible ecosystem. Mendeley are making available this tasty usage data on millions of scientific articles, to help innovators create new applications, since nothing else like it is available outside of the big publishers.

Mapping America: Every City, Every Block – Damn you New York Times for making such kick-ass map visualizations! Fast, clean and inspirational for my own work.

Don’t buy that internet company – Big company management is mostly about avoiding risk and blame. It’s very hard for anything outside of the corporation’s original DNA to flourish in that atmosphere, since almost everything an upstart is doing will have no precedent internally, and so hit a wall of opposition.

Brown bag lunches on MapReduce/Hadoop

Brownbag
Photo by Barbie Desoto

At Apple, the engineers regularly organized 'brown bag lunches'. These were informal midday presentations where one of the team stood up and taught everyone a subject or tool they're passionate about. I've been missing that, so as an experiment I spent an hour at Get Satisfaction's offices today, giving their folks an introduction to MapReduce and Hadoop. I talked through when it's useful, and then we all got out our laptops and coded up a simple MapReduce job flow in Python, runnable from the command line and ready to be dropped into Hadoop.

I enjoyed it, and the team seemed to get a lot out of it too, so I've decided to do a few more. It's aimed at any company with Big Data problems and traditonal database expertise, and will leave your team with the ability to run simple Hadoop jobs on Amazon's Elastic MapReduce service, or just from the command-line, all using Python.

If this sounds like something you need, and you're in the Bay Area, drop me an email. The deal is that you feed me, give me schwag, and invite me to vacation on your yacht once you're rich and famous. I only have a couple of slots open before I fly off to Europe in the new year though, so do jump in quick.

Visualize your PageRank

Atlanticshot

I've just launched PageRankGraph.com – so what on earth is it?

A few days ago I sat down with the team at Blekko, and as we were geeking out together I realized they'd done something quite revolutionary with their search engine. They'd made the information they use to calculate their rankings public. This may sound arcane, but for the last decade SEO experts have desperately tried to figure out what affects their position in search results, using almost no hard data. Blekko reveals a lot of the raw information that goes into these PageRank-like calculations, especially what sites are linking to any domain, how often and what the rank of each of those sites is. While the source data and algorithm they're using won't be identical to Google and the other search engines, it's close enough to at least draw some rough conclusions.

They provide their own view of this information if you pass in the /seo tag when you search, but I wanted something that told me more about where my Google Juice was coming from. For example, they list Twitter in top place for my site, but I doubt it's helping my ranking very much, since Twitter's high rank will be heavily diluted by how many links it's shared between. Blekko don't offer an API yet, but they did seem relaxed about well-behaved scrapers, so I was able to pull down the HTML of their results pages and extract the information I need. The basic ranking formula is something like

myRank = ((siteARank*#linksFromAToMe)/#linksFromAToEveryone)+((siteBRank*…

I want to know who's making the biggest difference to my site, and Blekko gives me the siteARank and #linksFromAToMe variables I need. The only one that's missing is #linksFromAToEveryone, but they do show the total number of pages on each domain which works as a rough approximation of the number of links there. Pulling in that information for the top 25 sites they list then gives me enough data to calculate which of those are making the greatest contribution.

Just go to PageRankGraph.com, enter the URL of the site you're interested in, wait a few seconds and you'll see a network graph showing the web of links that are contributing to that site's prominence in search results.

So, why is this interesting? Take a look at the graph for the Atlantic website above. The Whitehouse website looks like it's making a significant contribution to the magazine's ranking in the search results. I'm actually a long-time subscriber, but there's something a little odd about a news outlet's prominence on the internet being affected by the government. FoxNews gets much less of a boost, though intriguingly the EPA shows up as a contributor. Should all government sites use rel="nofollow" for external URLs? Don't grab your pitchforks and march to Washington just yet, this data is only suggestive and Google may be doing things differently to Blekko in this case, but it's the sort of question that we couldn't get any good data on before now.

Just think of the recent New York Times article on DecorMyEyes.com – if they'd been able to research where the links were actually coming from, they could have built a much more detailed picture of what was actually happening to drive the site up the rankings.

The code itself is fully open-source on github, including a JQuery HTML5/Canvas network graph rendering plugin that I had fun hacking together. Let me know if you come up with any interesting uses for it, and I'd love ideas on how to improve the quality of the information.

[Update – it looks like there's some major differences between the way Blekko handles nofollow links and Google, etc. This will mean larger differences in the results than I'd hoped for, so take the numbers with an even bigger pinch of salt.]

Five short links

Blockfive
Photo by Leo Reynolds

The biology of sloppy code"A good chunk of code running on servers right now exists just to translate the grunts of one software animal to the chirps that another understands". A compelling analysis of why we can get away with writing extremely inefficient code on modern systems.

What does compound interest have to do with evolving APIs? – In contrast, Kavika demonstrates how the costs of coding shortcuts explodes as the technical debt mounts. One pattern that makes it worse is the accumulation of ever-more requirements as the system evolves. Part of Steve Jobs' genius is his ruthlessness in dropping support for older software and hardware. It's not much fun as a user, but it helps Apple do a lot with comparatively few engineers.

Gold Farming – An in-depth analysis of the transaction patterns and network characteristics of gold farmers in EQ2. There's some interesting similarities to other clandestine groups like drug dealers.

Infinite City – I'm eagerly anticipating the arrival of this 'atlas of the imagination' for my new home. Thanks to Drew for the recommendation.

Whites more upwardly mobile than blacks – A disturbing snippet of the state of income mobility. I'm left wanting to dig into the actual data behind it though, hopefully the upcoming site they mention will follow the World Bank's lead and have full data sets available for download.

Invites Done Right

Invitation
Photo by Zaknitwij

Tonight I launched InvitesDoneRight. Here's why.

Despite being pulled away from my original focus on email, I'm still obsessed by how much valuable information is sitting neglected in our inboxes. Now that both Yahoo and Gmail support OAuth, I decided to release an application that's been on my mind for years.

If you're running a consumer web service, one of your most important distribution channels is your users sharing with their friends. Unfortunately there's never been an easy way to encourage this. Facebook might seem promising, but for good reason the service has made it hard for applications to broadcast indiscriminately to their user's social network. Many services use contact importing, but address books are both notoriously incomplete and full of people you met once at a trade show. Without extra information, you're stuck presenting the user with a space-shuttle control panel full of checkboxes, and asking them to wade through and figure out who to send invites to. If you get it wrong, not only are your invites ineffective but they'll be marked as spam by the recipients, making it very hard to reach even your existing users!

What's the answer? I think it's getting user's permission to scan message headers and pulling out a shortlist of people they actually exchange emails with. The user gets a nice experience, with only a few people to pick from. The web service gets a better-targeted set of recipients, which means higher conversions and fewer spam reports.

To implement this approach I've just launched the InvitesDoneRight service. If you have a website that signs up new users, just add an extra step that directs them to the service, I'll ask them for permission to figure out a shortlist of contacts, and then I'll call back a URL you provide with ten contacts that they've been in touch with recently. It couldn't be much simpler to integrate.

What about privacy? I'm still thinking hard about how this service could be abused, but I'm rigorous in removing all user data the instant they leave my site. It acts purely as a middle-man between the external service and the mail provider, and only passes the short contact list back to that external service. No other information is either passed or stored or shared anywhere. No email content at all is fetched, just who the recipients are.

I think this could be a powerful tool for websites, as well as improving users' experiences, but this launch is an experiment. Is it too creepy? Are there problems I'm missing that will render it ineffective? Tell me what you think in the comments, or email me.

Why user permissions don’t work

Termsandconditions
Photo by Andrew Currie

Tom Scott is shocked [Bad way of putting it, sorry, see the comments] that OAuth granted permission to applications to read user's direct messages. That's obvious to me as a developer. Everything else a non-private user does on the service is freely available through the search API and streams so there's no need to ask for permission. The important point is that Tom's confusion shows how little even sophisticated users understand Twitter's comparatively simple security model. What made my heart sink was his suggestion that Twitter go down the Facebook path and offer more fine-grained permissions.

If users don't have the time to understand the current model, does he really think they'll spend time tweaking a set of checkboxes? When Internet Explorer throws up dialogs complaining about a mix of secure and unsecure content on a page, does any user know what that means? The Facebook/Microsoft approach is a bunch of legalistic ass-covering by people keen to avoid blame, not a good way of getting informed consent from users.

Just look at this exhaustive study of the effect of contract disclosures on internet user's habits. Hardly anybody reads them, and the few who do don't change their behavior at all! The prevailing model of user security permissions hits exactly the same problem. We've trained our users to click through screens of goobledegook without reading or caring.

So what's the answer? No checkboxes. No space-shuttle control panel of permissions. The only model that people understand is completely binary, public or private, open or closed. Look back to the phone book, everyone understood going ex-directory, and that's all you needed to know. In Twitter's case, this means redesigning the OAuth process so that it's just a 'give me access to your DMs' dialog, since for almost all users that's what it really means. It's not a technology problem, it's a user-experience one.

Messy beats tidy

Messypainting
Painting by Mark Chadwick

I was pleased to see James Clark admitting JSON has overtaken XML, at least for the web world, and William Vambenepe pointing out that RPC over HTTP is giving True REST a good kicking. They both imposed a lot of up-front demands on developers and promised a lot of automated benefits once the world caught up. Neither delivered.

They failed for web developers because they require planning and design to use effectively. Most of us barely have an idea of who our customers when we start a project, let alone a strong set of requirements. The strength of our world is that our systems are malleable enough that we can make it up as we go along. I spent most of my career in embedded systems and on the desktop, and it was an absolute delight to discover how simple it is to produce workable non-trivial applications on the web.

XML asks you to figure out a schema before you start. Proper REST APIs require providers to plan a set of URL locations and verbs to apply to those resources, and clients to figure out their conventions. Enthusiasts counter that you should be doing this sort of planning anyway to build a robust system, but that's a preference, not a law of nature. Doing a shoddy job quickly and learning from it often beats long development cycles.

Technologies like JSON, or APIs using HTTP for transport but with a domain-specific payload, fit into this chaotic development model. There's a similar dynamic with Hadoop, one of its biggest advantages is that you never have to pick a database schema, you just run jobs on the raw source files like logs or other data dumps.

The painful thing for any adult supervisor watching this sort of development is that they know how much technical debt is being accumulated. The reason it makes sense is that the debt almost never comes due. We re-write whole systems as the requirements change out from under us, projects fail (almost never for technical reasons) or we switch to a well-designed open source framework someone else has built. It's no free ride, you can see how hard it was for Twitter to retrofit their systems to cope with massive scaling, but the successful startups have been the ones able to move fast. We're in an era where punk rockers beat chamber orchestras.

Five short links

Fivetrees
Photo by Elod Horvath

An academic wanders into Washington D.C. – Arvind went to attend an IETF workshop on privacy, presenting a one-pager we co-authored (though his contribution was far greater than mine). His experiences of dealing with government people ring very true, and I think he's absolutely right to advocate we all get more involved. As hackers we tend to assume we can code our way around government restrictions, but as everything from Napster to Wikileaks shows, if they get riled enough they can shut services down very effectively.

Glu – LinkedIn have open-sourced their application deployment framework. Not much material here for a sexy demo, but this addresses a problem I see a lot of data companies with large clusters struggling with. A battle-hardened system like this should be a big time-saver for the whole community.

The history of selling personal data – We all know somebody's making money from our personal information, so why not cut out the middle-man and sell it direct? A great run-down of some experiments in this area, including a tantalizing tale of a UK local council contemplating selling data and offering a tax cut in return (sadly with a dead link to the story).

Why Rosetta Stone's attack on Google's keyword advertising system should be rejected – This is one of the reasons technologists should be more involved with the world of government. Rosetta is attempting to prevent Google from showing competitor's ads when users search for "Rosetta Stone". I have issues with how much power Google wields in the search space, but this is an obvious attempt by the language software company to limit user choice for their own benefit. Trademarks exist to prevent confusion, nobody will be tricked into buying a competing product just because their ads appear in this context.

A concise and brilliant peer-reviewed article on writer's block – I've never suffered from writer's block, presumably because my standards are so lax.

“Strata Rejects” lightning talks

Madscientist
Photo by Jennifer Rouse

I've met several people in the last few days who had interesting-sounding proposals that didn't make it through the review process for the Strata conference. After the last conversation, I was struck by the idea of an unofficial get-together the evening before the main conference. So (with absolutely no implied connection to Strata®!) I'm running a 'Rejects' series of lightning talks on Monday. Anyone who had a talk rejected, or didn't get it in before the deadline, gets five minutes to give the PDQ version.

It will be starting at 7pm on Monday January 31st, in an undisclosed location near the hotel in Santa Clara. Munchies, beer and transport will be provided. Email me with your name, and talk title if you'd like to speak, and I'll get back to you with the full details. Bonus points will be given for starting with 'They laughed at my ideas, but I'll show those fools, muhahahahah!'

Map your leads with ForceMapper

I've had a lot of sales professionals using OpenHeatMap to visualize their customers, and they've often asked if I could make it even easier. It doesn't get much simpler than an integrated Salesforce version, so just in time for Dreamforce, here's an early preview of ForceMapper:

https://forcemapper.com/

Just log in with your Salesforce ID and you'll get a dashboard showing your leads and accounts by state, country and city. It's only available for Enterprise-level Salesforce users currently, but once it's been accepted into AppExchange it should be usable by Partner customers too. We're pleased to offer this completely free for the next 30 days, and early users will be rewarded for their help once the premium version is rolled out.

It's not just a visualization tool – load the site up on your iPhone and it will use your current location to suggest nearby customers to visit, complete with directions.

Have questions? Call us free on 1 800 408 6046