Five short links

Fivegirl
Picture by Mike Lay

So THIS Is How Bloomberg Gets Earnings Reports Hours Before They’re Publicly Released… – Remember kids, anything you post to a web server is accessible, even before you link to it. I’m still aghast that Apache’s default behavior is serving up directory listings for folders with no index, which makes this sort of thing even easier.

Law and the Multiverse – Real lawyers deal with the implications of imaginary superheroes. The beauty of this site is how much depth and rigor the participants bring to the problems.

MapEveryBit – Early-stage but interesting tool for mapping your social network across Twitter and Facebook.

Visualizations to show causality – I like this exploration of graphing techniques to explain cause-and-effect. Animated and interactive graphs can be a lot more than toys or eye-candy.

Lenana – Tales from the front-lines of education in Tanzania. The statistics are powerful (only a third of the kids are in school now, but that’s up from just 12% a few years ago) but I was really affected by the stories of kids like Nawasa. The practical steps they’re taking to solve their problems are heartening, like the Empowered Girls Club they’ve set up.

Bad writing is good

Gotcha

Last night I ended up at Shotwell's with Mike Melanson, and we spent quite a lot of our time talking about journalism. He's a professionally-trained reporter with a masters degree, but the sheer pace of blogging at ReadWriteWeb means a lot of that education is not directly applicable. I'm not saying his standards have lowered, but producing twenty articles a week requires a whole different approach to writing than a traditional US newspaper article.

The discussion reminded me of an article defending Stieg Larsson's books against the critics complaining that they're crowded out high literature. Laura Miller makes the point that literary books are demanding to read, they require us to put in effort to understand unfamiliar ways of expressing ideas and emotions. That effort is rewarded by the revelations and sense of wonder you can only get from challenging works, but sometimes we don't have the energy left to tackle a tough read. Bad writing is often more enjoyable, because clichés, genre conventions and predictable plots all help a book 'flow' more smoothly. They demand a lot less from the reader. That got me thinking about how I approach producing five or six posts a week.

American journalism is built on the assumption that reporters are providing a public service, and the top priority is communicating important truths to their readership. In turn, the readers are expected to be engaged and curious, willing to put in some effort to understand a complex story. This is a worthy goal, but leads to some painfully dry writing.

In contrast, the only value British newspapers hold sacred is entertainment. Even the serious newspapers go out of their way to avoid boring their readers, and the tabloids are full-blown three-ring circuses of populism, happy to publish blatant lies or fan prejudice in the pursuit of higher circulation numbers. I'm sure that sounds like a nightmare to American reporters, but somehow it works, producing a better-informed readership than the US model.

That background leaves me very comfortable with the blogging approach to news. We still need traditional in-depth newspaper articles, but the popularity of blog-like news sites with off-the-cuff writing styles, liberal use of clichés and a willingness to publish before all the facts are in, shows that there was an unmet demand for digestible stories. I'm not saying we should emulate the dark side of the British tabloids, but we need to understand that journalism is writing for a purpose, and it sometimes requires embracing the tools that bad writers rely on.

Don't expect the public to read you because what you're writing is important, just grab them by the throat by using every cheap trick at your disposal, from sensational, teaser headlines to hyperbole and synthesized conflict within the article. If the story is worth telling, you'll be doing more good than harm by reaching more readers.

Correlation, Causation and Thor’s Raincoat

Thorsraincoat

My dog Thor hates getting wet, but even when there's rain lashing against the windows he still starts off dancing in circles when it's time for his walk. It's only when I pull out his yellow rain jacket that he slumps and stares at me mournfully. He seems convinced that if I just left the jacket off, the rain would go away.

Much as I try and convince him of the error in his logic, he's unmoved, and it's hard to blame him. Humans will happily swallow studies that use the weasel word 'link' to claim something that is associated with an outcome is its cause. Does obesity spread through your friendships? No, you just share the same risk-factors as your friends.

As the Big Data revolution gives us more and more data to play with, we'll find many more suggestive correlations like these appearing. Our whole mental architecture is about seeing meaningful patterns, even if we're staring at random phenomena like clouds in the sky. How much these mirages matter depends on how we want to apply them –

Prediction

Sometimes you don't care about whether something causes an outcome, you just want some early warning of what that outcome will be. Thor knows it will be wet when he sees the raincoat. The main danger is that the two variables aren't actually dependent in any way, they just happen to have been moving in an apparently synchronized way recently. The more variables you have to compare the more likely these sort of false correlations are, so expect a lot of them with Big Data.

If you're going to rely on a correlation to predict outcomes, you need at least a plausible story for the mechanism behind the correlation, and ideally multiple independent data sets that back it up.

Reaction

If I notice I'm struggling to get Thor out of the front door, then maybe I'll hide the rain jacket until we're in the porch. Thor's resistance means that he can no longer use the raincoat as a reliable signal of rain. The only reason to make a prediction is to take some action, and those actions may destroy the correlation. This is a painfully common problem in economics, and is usually expressed as Goodhart's Law: "once a social or economic indicator or other surrogate measure is made a target for the purpose of conducting social or economic policy, then it will lose the information content that would qualify it to play such a role".

This means that even if you've found a correlation with predictive power, you have to constantly measure its effectiveness, since the very act of relying on it as a guide may degrade its usefulness.

Control

I half-expect to get up one morning and discover that Thor's eaten the raincoat, in the hope of bringing back the sun. Once we notice a correlation, it's easy to convince ourselves and others that it's actually causing the outcome. Humans love stories, and stories have their own rules. If X happens, followed by Y, narrative logic requires that X caused Y. That makes it simple to persuade other people that they're seeing cause and effect rather than correlation, without really having to prove it.

The only reliable way I know to figure out whether you really can affect outcomes is by experiment, so before you put time or money behind an attempt, require a prototype. If the guy or girl trying to persuade you to take action can't show a small-scale proof-of-concept, then they might as well be trying to sell you a bridge in Brooklyn. Even if they show compelling results, the Hawthorne Effect may be kicking in, but at least you've got some weak proof.

If you want to be effective in a world awash with data, it pays to be skeptical of correlations, since you'll be seeing a lot more of them over the next few years.

Five short links

Fastfive
Photo by Naír la jefa

From Social Data Mining to Forecasting Socio-Economic Crises – An academic manifesto for Big Data. I’m surprised to find myself on the skeptical end of the spectrum, the promises feels a bit too techno-utopian, but this is a great overview. It does cover ‘problematic issues’ in one section, and hits a lot of my concerns, eg that ’08 really was “the first financial crisis sparked by Big Data

The Billion Prices Project – A valuable, practical example of what we can do now that data’s cheap and easy to process on a massive scale. I’ll love to see how this stacks up against survey-based methods for measuring inflation over the next few years.

The Mendeley Data Challenge – Science was the original pioneer of open data, but business models have ossified and restricted the flow of valuable information. Happily, I learned at SciFoo that there’s a ragtag band of researchers and organizations looking to create a much more flexible ecosystem. Mendeley are making available this tasty usage data on millions of scientific articles, to help innovators create new applications, since nothing else like it is available outside of the big publishers.

Mapping America: Every City, Every Block – Damn you New York Times for making such kick-ass map visualizations! Fast, clean and inspirational for my own work.

Don’t buy that internet company – Big company management is mostly about avoiding risk and blame. It’s very hard for anything outside of the corporation’s original DNA to flourish in that atmosphere, since almost everything an upstart is doing will have no precedent internally, and so hit a wall of opposition.

Brown bag lunches on MapReduce/Hadoop

Brownbag
Photo by Barbie Desoto

At Apple, the engineers regularly organized 'brown bag lunches'. These were informal midday presentations where one of the team stood up and taught everyone a subject or tool they're passionate about. I've been missing that, so as an experiment I spent an hour at Get Satisfaction's offices today, giving their folks an introduction to MapReduce and Hadoop. I talked through when it's useful, and then we all got out our laptops and coded up a simple MapReduce job flow in Python, runnable from the command line and ready to be dropped into Hadoop.

I enjoyed it, and the team seemed to get a lot out of it too, so I've decided to do a few more. It's aimed at any company with Big Data problems and traditonal database expertise, and will leave your team with the ability to run simple Hadoop jobs on Amazon's Elastic MapReduce service, or just from the command-line, all using Python.

If this sounds like something you need, and you're in the Bay Area, drop me an email. The deal is that you feed me, give me schwag, and invite me to vacation on your yacht once you're rich and famous. I only have a couple of slots open before I fly off to Europe in the new year though, so do jump in quick.

Visualize your PageRank

Atlanticshot

I've just launched PageRankGraph.com – so what on earth is it?

A few days ago I sat down with the team at Blekko, and as we were geeking out together I realized they'd done something quite revolutionary with their search engine. They'd made the information they use to calculate their rankings public. This may sound arcane, but for the last decade SEO experts have desperately tried to figure out what affects their position in search results, using almost no hard data. Blekko reveals a lot of the raw information that goes into these PageRank-like calculations, especially what sites are linking to any domain, how often and what the rank of each of those sites is. While the source data and algorithm they're using won't be identical to Google and the other search engines, it's close enough to at least draw some rough conclusions.

They provide their own view of this information if you pass in the /seo tag when you search, but I wanted something that told me more about where my Google Juice was coming from. For example, they list Twitter in top place for my site, but I doubt it's helping my ranking very much, since Twitter's high rank will be heavily diluted by how many links it's shared between. Blekko don't offer an API yet, but they did seem relaxed about well-behaved scrapers, so I was able to pull down the HTML of their results pages and extract the information I need. The basic ranking formula is something like

myRank = ((siteARank*#linksFromAToMe)/#linksFromAToEveryone)+((siteBRank*…

I want to know who's making the biggest difference to my site, and Blekko gives me the siteARank and #linksFromAToMe variables I need. The only one that's missing is #linksFromAToEveryone, but they do show the total number of pages on each domain which works as a rough approximation of the number of links there. Pulling in that information for the top 25 sites they list then gives me enough data to calculate which of those are making the greatest contribution.

Just go to PageRankGraph.com, enter the URL of the site you're interested in, wait a few seconds and you'll see a network graph showing the web of links that are contributing to that site's prominence in search results.

So, why is this interesting? Take a look at the graph for the Atlantic website above. The Whitehouse website looks like it's making a significant contribution to the magazine's ranking in the search results. I'm actually a long-time subscriber, but there's something a little odd about a news outlet's prominence on the internet being affected by the government. FoxNews gets much less of a boost, though intriguingly the EPA shows up as a contributor. Should all government sites use rel="nofollow" for external URLs? Don't grab your pitchforks and march to Washington just yet, this data is only suggestive and Google may be doing things differently to Blekko in this case, but it's the sort of question that we couldn't get any good data on before now.

Just think of the recent New York Times article on DecorMyEyes.com – if they'd been able to research where the links were actually coming from, they could have built a much more detailed picture of what was actually happening to drive the site up the rankings.

The code itself is fully open-source on github, including a JQuery HTML5/Canvas network graph rendering plugin that I had fun hacking together. Let me know if you come up with any interesting uses for it, and I'd love ideas on how to improve the quality of the information.

[Update – it looks like there's some major differences between the way Blekko handles nofollow links and Google, etc. This will mean larger differences in the results than I'd hoped for, so take the numbers with an even bigger pinch of salt.]

Five short links

Blockfive
Photo by Leo Reynolds

The biology of sloppy code"A good chunk of code running on servers right now exists just to translate the grunts of one software animal to the chirps that another understands". A compelling analysis of why we can get away with writing extremely inefficient code on modern systems.

What does compound interest have to do with evolving APIs? – In contrast, Kavika demonstrates how the costs of coding shortcuts explodes as the technical debt mounts. One pattern that makes it worse is the accumulation of ever-more requirements as the system evolves. Part of Steve Jobs' genius is his ruthlessness in dropping support for older software and hardware. It's not much fun as a user, but it helps Apple do a lot with comparatively few engineers.

Gold Farming – An in-depth analysis of the transaction patterns and network characteristics of gold farmers in EQ2. There's some interesting similarities to other clandestine groups like drug dealers.

Infinite City – I'm eagerly anticipating the arrival of this 'atlas of the imagination' for my new home. Thanks to Drew for the recommendation.

Whites more upwardly mobile than blacks – A disturbing snippet of the state of income mobility. I'm left wanting to dig into the actual data behind it though, hopefully the upcoming site they mention will follow the World Bank's lead and have full data sets available for download.