OpenHeatMap now supports states and provinces worldwide

Afghanexample
One of the most frequent requests for OpenHeatMap has been better support for provinces/states outside of the ones I already offer. The bottleneck's been finding the data in a form I can use, with a lot of help from locals I found a handful of usable maps for India, Mexico and Canada, but it was a slow process. All that changed when I discovered the public domain Natural Earth data set. Taking one map containing top-level administrative districts for every country worldwide, I was able to extract the states and provinces for hundreds of nations. This means you can now upload a spreadsheet containing province names for almost any country and get a detailed map.

This is the map you get when you upload Afghan provinces, and below is a complete list of the examples for the countries I support. I'm very excited to see how people are able to use this, so let me know how you get on.

Greekprovinces

Afghanistan
Angola
Land Islands
Albania
United Arab Emirates
Argentina
Armenia
Australia
Austria
Azerbaijan
Burundi
Belgium
Benin
Burkina Faso
Bangladesh
Bulgaria
Bosnia-Herzegovina
Belarus
Belize
Bolivia
Brazil
Barbados
Brunei
Bhutan
Botswana
Central African Republic
Canada
Cocos (Keeling) Islands
Switzerland
Chile
China
Cote D'Ivoire
Cameroon
The Democratic Republic of the Congo
Congo
Colombia
Costa Rica
Cuba
Christmas Island
CYN
Cyprus
Czech Republic
Germany
Djibouti
Dominica
Denmark
Dominican Republic
Algeria
Ecuador
Egypt
Eritrea
Spain
Estonia
Ethiopia
Finland
Fiji
France
Faroe Islands
Gabon
United Kingdom
Georgia
Guernsey
Ghana
Gibraltar
Guinea
The Gambia
Guinea-Bissau
Equatorial Guinea
Greece
Greenland
Guatemala
French Guiana
Guyana
Heard Island And McDonald Islands
Honduras
Croatia
Haiti
Hungary
Indonesia
India
Ireland
Iran
Iraq
Iceland
Israel
Italy
Jamaica
Jordan
Japan
Kazakhstan
Kenya
Kyrgyz Republic
Cambodia
KO-
South Korea
Kuwait
Lao People's Democratic Republic
Lebanon
Liberia
Libyan Arab Jamahiriya
Sri Lanka
Lesotho
Lithuania
Luxembourg
Latvia
Morocco
Moldova, Republic Of
Madagascar
Mexico
The Former Yugoslav Republic of Macedonia
Mali
Myanmar
Montenegro
Mongolia
Mozambique
Mauritania
Martinique
Malawi
Malaysia
Namibia
New Caledonia
Niger
Norfolk Island
Nigeria
Nicaragua
Netherlands
Norway
Nepal
New Zealand
Oman
Pakistan
Panama
Peru
Philippines
Papua New Guinea
Poland
Democratic People's Republic of Korea
Portugal
Paraguay
Qatar
Romania
Russia
Rwanda
Saudi Arabia
Sudan
Senegal
Svalbard And Jan Mayen
Solomon Islands
Sierra Leone
El Salvador
Somalia
Serbia And Montenegro
Suriname
Slovakia
Slovenia
Sweden
Swaziland
Syria
Chad
Togo
Thailand
Tajikistan
Turkmenistan
Timor-Leste
Trinidad & Tobago
Tunisia
Turkey
Taiwan
Tanzania
Uganda
Ukraine
Uruguay
America
Uzbekistan
Venezuela
Vietnam
Vanuatu
Yemen
South Africa
Zambia
Zimbabwe

Five short links

Fivesign
Photo by Joseph Robertson

OmniMark – Old-school but impressive tool for turning arbitrary semi-structured data into XML. I’ll be trying to learn from this as I look to improve my ETL process – via Kevin Marshall

The rise and fall of Swivel – So many lessons for any data startup in here. Swivel took several million dollars in funding before they had a plan of where they were going, and built a generic platform instead of a focused application targeted at users who would them some benefit. That they had less than ten paying customers, despite tens of thousands of registered users is a good reminder of the work you have to put in to create revenue, you don’t just get a fixed percentage of active users upgrading – via Joe Parry

The Obese Surfer Problem – Russell explores a compelling visualization that serious surfers are willing to pay money for. I like the idea of ‘predictive models’ as a more general category for what I often talk about as recommendations. Showing you what could happen is a lot more valuable than just a rear-view mirror showing the history – via Russell Jurney

HexFiend – “A fast and clever open source hex editor for Mac OS X.” Does exactly what it says on the tin, I’ve been searching for a good hex editor since Codewright died, and so far it’s been great

Benoit Mandelbrot is gone, but he shouldn’t be forgotten – A strong reminded to everyone; don’t assume a Gaussian for your probabilities if your events don’t follow that distribution. We have so much faith in numbers as summaries of reality, but like spherical cows, unrealistic assumptions can lurk behind the most solidly calculated figure – via Behavior Gap

Visualization myths around Snow’s cholera map

Choleramap

Thanks largely to Tufte's evangelization, John Snow's map of the 1854 cholera outbreak in Soho has become the classic example of the power of visualizations. I've just finished Steven Johnson's The Ghost Map that tells the story behind the graphic, and it's surprisingly different from the simplified explanation that usually accompanies the picture.

The map wasn't that innovative

Snow wasn't the first person to draw these kinds of maps, he wasn't the first to draw them to track disease, and in fact he wasn't even the first person to map this particular outbreak! The Sewer Commision produced a very detailed map showing the death locations. The power of Snow's version came from his decision to leave out a lot of details (sewer locations, old grave sites, etc) that cluttered up the Commision's version. Their map was so muddled that it didn't tell a story, but Snow's was stripped-down to show exactly what he needed to bolster his theory that the epidemic spread from the water pump.

The only technical innovation that Johnson identifies was his use of boundary lines to mark the areas that were closest to particular pumps by walking distance, to demonstrate that many of the cases nearer to other water sources as the crow flies were actually in the catchment area of the Broad Street pump. Unfortunately that version of the map is rarely shown, and Tufte himself dismisses it as "Voronoi baloney"!

Theory came first

From the popular account it's easy to imagine that Snow plotted the deaths on his map, then the pump locations, and that triggered a revelation. In fact he'd been fighting for a decade to prove that cholera was a waterborne disease, not spread atmospherically as the miasma theory claimed. He'd already gathered a lot of evidence from the differing rates of the disease amongst neighbors using piped water from different suppliers. It was a tool for "hypothesis testing" not "hypothesis generating".

Data gathering was the key

Together with the Henry Whitehead and local doctors, Snow spent weeks going door-to-door gathering detailed information from area residents. He was then able to present that data as evidence for his theory in a variety of forms, including anecdotal case histories, numerical analyses and his maps. The key was that this hands-on experience with the raw data gave him the story he wanted to tell, and then he was able to make his argument using a variety of different presentation tools.

These two ideas are essential points for my work; a lot of the recent approaches to visualization assumes that you can give ordinary people simple map or graph creation tools, and they'll be inspired to create powerful graphics. With OpenHeatMap I've concentrated on people who already have a story to tell; journalists, activists and other people who are highly motivated to make an argument. It's about empowering people who are looking for a solution, not hoping that we'll turn passive observers into active participants just by handing them the tools.

The map became marketing

The actual story and evidence behind Snow's work is complex and hard to explain. As his theory became widely accepted as a massive historical advance, the map came to stand as shorthand for the story behind it. After that, it was easy to imagine that the graphic was the central evidence of his report on the outbreak. In fact it was just one piece of evidence, but it was so accessible and easy to use as an illustration that it spread slowly but virally through different publications. As Johnson puts it in his book "the map was a triumph of marketing as much as empirical science".

This is something I've seen in my own work too. Visualizations are fantastic at engaging people, everyone loves maps. When it comes down to detailed analysis though, a spreadsheet or other list-based interface is almost always better. Maps and other visualizations tell stories so well because of how much they leave out, but textual representations still rule when it comes to actually working with the full data. Think of your visualizations as powerful marketing tools, as bait to get people in the door, but expect to offer them something deeper when they want to work with that data.

There's a lot more to the story than I can cover here, so if you've got any involvement in data analysis or visualization you should pick up The Ghost Map, it's full of so many lessons and is a gripping read on top. I also recommend this short academic paper "Essential, Illustrative, or . . . Just Propaganda?" that argues for a different perspective on Snow's work than both the traditional popular account, and Johnson's revised approach.

How to turn data into money

Shortsnorter
Photo by Jerry Swantek (fascinating tradition behind it)

The most important unsolved question for Big Data startups is how to make money. I consider myself somewhat of an expert on this, having discovered a thousand ways not to do it over the last two years. Here's my hierarchy showing the stages from raw data to cold, hard cash:

Data

You have a bunch of files containing information you've gathered, way too much for any human to ever read. You know there's a lot of useful stuff in there though, but you can talk until you're blue in the face and the people with the checkbooks will keep them closed. The data itself, no matter how unique, is low value, since it will take somebody else a lot of effort to turn it into something they can use to make money. It's like trying to sell raw mining ore on a street corner; the buyer will have to invest so much time and effort processing it, they'd much prefer to buy a more finished version even if it's a lot more expensive.

Down the road there will definitely be a need for data marketplaces, common platforms where producers and consumers of large information sets can connect, just as there are for other commodities. The big question is how long it will take for the market to mature; to standardize on formats and develop the processing capabilities on the data consumer side. Companies like InfoChimps are smart to keep their flag planted in that space, it will be a big segment someday, but they're also moving up the value chain for near-term revenue opportunities.

Charts

You take that massive deluge of data and turn it into some summary tables and simple graphs. You want to give an unbiased overview of the information, so the tables and graphs are quite detailed. This now makes a bit more sense to the potential end-users, they can at least understand what it is you have, and start to imagine ways they could use it. The inclusion of all the relevant information still leaves them staring at a space shuttle control panel though, and only the most dogged people will invest enough time to understand how to use it.

Reports

You're finally getting a feel for what your customers actually want, and you now process your data into a pretty minimal report. You focus on a few key metrics (eg unique site visitors per-day, time on site, conversion rate) and present them clearly in tables and graphs. You're now providing answers to informational questions the customers are asking; "Is my website doing what I want it to?", "What areas are most popular?", "What are people saying about my brand on Twitter?". There's good money to be had here, and this is the point many successful data-driven startups are at.

The biggest trouble is that it can be very hard to defend this position. Unless you have exclusive access to a data source, the barriers to entry are low and you'll be competing against a lot of other teams. If all you're doing is presenting information, that's pretty easy to copy, and caused a race to the bottom in prices in spaces like 'social listening platforms'/'brand monitoring' and website analytics.

Recommendations

Now you know your customers really well, and you truly understand what they need. You're able to take the raw data and magically turn it into recommendations for actions they should take. You tell them which keywords they should spend more AdWords money on. You point out the bloggers and Twitter users they should be wooing to gain the PR they're after. You're offering them direct ways to meet their business goals, which is incredibly valuable. This is the Nirvana of data startups, you've turned into an essential business tool that your customers know is helping them make money, so they're willing to pay a lot. To get here you also have to have absorbed a tremendous amount of non-obvious detail about the customer's requirements, which is a big barrier to anyone copying you. Without the same level of background knowledge they'll deliver something that fails to meet the customer's need, even if it looks the same on the surface.

This is why Radian6 has flourished and been able to afford to buy out struggling 'social listening platforms' for a song. They know their customers and give them recommendations, not mere information. If this sounds like a consultancy approach, it's definitely approaching that, though hopefully with enough automation that finding skilled employees isn't your bottleneck.

Of course the line between the last two stages is not clear-cut (Radian6 is still very dashboard-centric for example), and it does all sound a bit like the horrible use of 'solution' as a buzz-word for tools back in the 90's, but I still find it very helpful when I'm thinking about how to move forward. More actionable means more valuable!

Is ingestion the Achilles Heel of Big Data?

Wellies
Photo by Jon Appleyard

Drew Bruenig asked me a very worthwhile question via email:

"Outside of a handful of few predictable cases (website analytics, social exchange, finance) big data piles are each incredibly unique. In the smaller data sets of consumer feedback (that are still much larger than our typical sets) it’s more efficient for me to craft an ever expanding library of scripts to deal with each set. I have yet to have a set that doesn’t require writing a new routine (save for exact reruns of surveys).

So the question is: can big data ever become big business, or are the variables too varied to allow a scalable industry"

This gets to the heart of the biggest practical problem with Big Data right now. Processing the data keeps getting easier and cheaper, but the job of transforming your source material into a usable form remains as hard as it's ever been. As Hilary Mason put it, are we stuck using grep and awk?

A lot of the hype around Big Data assumes that it will be a growth industry as ordinary folks learn to analyze these massive data sets, but if the barrier is the need to craft custom input transformations for each new situation, it will always be a bespoke process, a cottage industry populated solely by geeks hand-rolling scripts.

Part of the hope is that new tools, techniques and standards will emerge that remove some of the need for that sort of boiler-plate code. activitystrea.ms/ is a good example of that in the social network space, maybe if there were more consistent ways of specifying the data in other domains we wouldn't need as many custom scripts? That's an open question, even the Activity Streams standard hasn't removed the need to ingest all the custom data formats from Twitter, etc.

Another big hope is that we'll do a better job of generalizing about the sort of data transformations we commonly need to do, and so build tools and libraries that let us specify the operations in a much more high-level way. I know there's a lot of repetition in my input handling scripts, and I'm searching for the right abstraction to use to simplify the process of creating them.

I also think we should be learning from the folks who have been dealing with Big Data for decades; enterprise database engineers. There's a cornucopia of tools for the Extract, Transform, Load stage of database processing, including some nifty open-source visual toolkits like Talend. Maybe these don't do exactly what we need, but there has to be a lot of accumulated wisdom we can build on. The commercial world does tend to be a blindspot for folks like me from a more academic/research background, so I'll be making an effort to learn more from their existing practices. On the other hand the fact that ETL is still a specialized discipline in its own right is a sign that ingestion is still an unsolved problem even after decades of investment, so maybe our hopes shouldn't get too high!

Five short links

Strawberries
Photos by the_moment

Optimizing conversion rates with qualitative tests – First in a series, this post does a great job of walking through the steps that you can take to figure out simple ways to improve your site. It alerted me to some services I wasn’t aware of like feedbackarmy.com and fivesecondtest.comvia Healy Jones

Orange – An interesting node-based graphical environment for building data-mining pipelines – via Dániel Molnár

Lies, Damned Lies and Medical Science – A compelling portrait of a ‘meta-researcher’ who has made a career out of proving how bogus most medical research is. Everyone involved in data analysis should read this; as a culture we have an irrational respect for charts and tables, when in fact they’re just useful ways of telling stories. Just like normal prose those stories are only as good as the evidence behind them, and should be treated just as sceptically. via Alexis Madrigal

Scrapy – Solid, simple and mature, so far this framework for building web crawlers in Python looks very useful and I’ll be using it on some upcoming projects. I’m still not convinced that XPath is flexible enough for the sort of content extraction I need to do, but I’ll see how far I can get with it and if alternative methods are easy to bolt on. via Alex Dong

The ignorance of what is possible – Growing up, my highest ambition was to work in an office, since that let you sit down in a private space and everybody I knew had jobs that involved standing up and dealing with customers. Reading this article reminded me of how limited my horizons were when I was young, it was only when I moved to the US that I realized how much more was possible. There must be so much potential wasted because kids don’t see how wide the world can be and limit their ambitions without even knowing what they’re losing.

What rules should govern robots?

Asimov

Image by Andre D'Macedo

Shion Deysarker of 80legs recently laid out his thoughts on the rules that should govern web crawling. The status quo is a free-for-all with robots.txt providing only bare-bones guidelines for crawlers to follow. Traditionally this hasn't mattered, because only a few corporations could afford the infrastructure required to crawl on a large scale. These big players have reputations and business relationships to lose, so any conflicts not covered by the minimalist system of rules can be amicably resolved through gentleman's agreements.

These days any punk with a thousand bucks can build a crawler capable of scanning hundreds of millions of pages. Startups like mine have no cosy business relationships to restrain them, so when we're doing something entirely new we're left scratching our heads about how it fits into the old rules. There's several popular approaches:

Scofflaws

None of the major sites I've looked at and talked to have any defense or even monitoring of crawlers, so as long as you stay below denial-of-service levels they'll probably never even notice your crawling. They rely on the legal force of robots.txt to squash you like a bug if you publicize your work, but there's a clearly a black market developing where shady marketers will happily buy data, no questions asked, much like the trade in email lists for spammers.

An extension of this approach is crawling while logged in to a site, getting access to non-public information. This WSJ article is a great illustration of how damaging that can be, and Shion is right to single it out as unacceptable. I'd actually go further and say that any new rules should build on and emphasize the authority of robots.txt. It has accumulated a strong set of legal precedents to give it force, and it's an interface webmasters understand.

Everything not forbidden is permitted

If your gathering obeys robots.txt, then the resulting data is yours to do with as you see fit. You can analyze it to reveal information that the sources thought they'd concealed, publish derivative works, or even the underlying data if it isn't copyrightable. This was my naive understanding of the landscape when I first began crawling, since it makes perfect logical sense. What's missing is the fact that all of those actions I list above, while morally defensible, really piss website owners off. That matters because the guys with the interesting data also have lots of money and lawyers, and whatever the legal merits of the situation they can tie you up in knots longer than you can keep paying your lawyer.

Hands off my data!

To the folks running large sites, robots.txt is there to control what shows up in Google. The idea that it's opening up their data to all comers would strike them as bizarre. They let Google crawl them so they'll get search traffic, why would they want random companies copying the information they've worked so hard to accumulate?

It's this sense of ownership that's the biggest obstacle to the growth of independent crawler startups. Shion mentions the server and bandwidth costs, but since most crawlers only pull the HTML without any images or other large files, these are negligible. What really freaks site owners out is the loss of control.

Over the next few years, 'wildcatter' crawlers like mine will become far more common. As site owners become more aware of us, they'll be looking for ways to control how their data is used. Unless we think of a better alternative, they'll do what Facebook did and switch to a whitelist containing a handful of the big search engines, since they're the only significant drivers of traffic. This would be a tragedy for innovation, since it would block startups off from massive areas of the Internet and give the existing players in search a huge structural advantage.

To prevent this, we need to figure out a simple way of giving more control to site that won't block innovative startups. Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail. I'm not the only one to realize this, and I'm hopeful we'll have a more detailed proposal ironed out soon.

At the same time, sites need to take stock of what information they are exposing to the outside world, since the 'scofflaw' crawlers will continue happily ignoring robots.txt. Any security audit should include a breakdown of exactly what they're handing over to scofflaw crawlers – I bet they'd be unpleasantly surprised!

A Wisconsin century

We’re visiting Liz’s mom in her hometown of Hayward, WI this week and thought we’d take a day off from work and try to bike over a hundred miles! We’ve both been doing a lot more road-riding this summer in Boulder and with the breathing boost from going to near sea-level we thought this would be a great chance to manage our first ‘century’. It turned out to still be quite a workout even though the terrain looks flat when you’re driving, with the turn-around of the out-and-back almost 1,000 feet lower than the start and lots of hills in between. We made it though, riding 110 miles all the way to Lake Superior and back.

It wasn’t exactly a route custom-built for biking, it was all along the edge of highways but they weren’t too busy and in most places there was a wide, clear shoulder. The drivers that passed us were courteous and we felt pretty safe, especially compared to biking in Los Angeles. Highlights included the tiny store advertising “Minnows, Movie Rentals, Tanning and Turkey Registration” in the misleadingly named Grand View and a John Deere dealership. I have to confess I was more excited by the rival Kubota dealership down the road, thanks to fond memories of the Kubotas we drive on Santa Cruz Island when we’re helping out the rangers.

Anyway, I have no idea if anyone else will ever want to ride this route, but the map’s up above and we had a wonderful, exhausting time. Who knows, maybe we can give Hayward a biking rival to the Birkie?

How to fetch URLs in parallel using Python

Parallellines
Photo by FunkyMonkey

Here's my class to run web requests in parallel in Python

One of the most common patterns in my work is fetching large numbers of web resources, whether it's crawling sites directly or through REST APIs. Often the majority of the execution time is spent waiting for the HTTP request to make the round trip across the internet and return with the results. The obvious way to speed things up is to change from a synchronous 'call-wait-process' model where each call has to wait for the previous one to finish, to an asynchronous one where multiple calls can be in flight at once.

Unfortunately that's hard to do in most scripting languages, despite being a common idiom in Javascript thanks to Ajax. Threads are too heavy-weight in both resources and in programming complexity since we don't actually need any user-side code to run in parallel, just the wait on the network. In most languages the raw functions you need to build this are available through libcurl, but its multi_curl interface is nightmarishly obscure.

In PHP I ended up writing my own ParallelCurl class that provides a simple interface on top of that multi_curl complexity, letting you specify how many fetches to run in parallel and then just feed it URLs and callback functions. Recently though I've been moving away to using Python for longer-lived offline processing jobs, and since I couldn't find an equivalent to ParallelCurl I ported my PHP code over.

This is the result. You'll need to easy_install pycurl to use it and I'm a Python newbie so I'm sure there's ugliness in the code, but I'm really excited that one of the big barriers to migrating more of my code is now gone. Let me know how you get on with it.

Five short links

Fivedollar
Photo by Cayusa

Large Scale Social Media Analysis with Hadoop – A great introduction to the power of MapReduce on massive social network data, courtesy of Jake Hofman

Junk Conferences – A guide to spotting bogus conferences and journals, aimed at the scientific world but equally applicable to startup scammers – via Felix Salmon

Honoring Cadavers – Growing up in a family of nurses I heard plenty of tales of doctors' lack of human empathy, so I love the idea of connecting them with the people behind the corpses they start with. I'd like to see every technical profession forced to deal meaningfully with the people at the receiving end of their work. At Apple, most Pro Apps engineers would volunteer to spend a shift demo-ing our software to users at the NAB show every year, and I got an amazing amount of insight and motivation from that experience.

Gary, Indiana's unbroken spirit – It seems like Gary is in even more dire straits than Detroit. I spent five years in Dundee just after the jute mills had closed, and it felt a lot like this. The European approach is to try to retain the residents and bring jobs to them, but during my time in Scotland that seemed to result in a middle class almost entirely employed by the government, and very few private companies.

The world is full of interesting things – A Google labs roundup of over a hundred cool sites on the internet, with Fan Page Analytics tucked away on page 86.