Five short links

Photo by Mitchell Gerskup

Kaggle – A site dedicated to improving our data analysis algorithms by running frequent Netflix-style contests. Both the data providers and the scientists win, I think this is an excellent idea and I’m pleased to see it looks like there’s been a lot of uptake – via Anthony Goldbloom

Storytelling, statistics and other grave insults – “Statistics is often too dry and too abstract for us to understand intuitively, to generate that comfortable internal feeling of understanding“. There’s a lot of truth in this analysis of how we crave, and believe, narratives. Stories persuade in a way that numbers don’t.

The Illogicality of Stock-Brokers – In a similar vein, this study shows how even experts let their intuition override basic logic. The authors experimented by posing trading questions to experienced stockbrokers, and found that the plausibility of the answer strongly determined whether they chose it, even if applying simple logic would clearly show it was wrong. More evidence that we rely more on our pattern-matching skills than our rationality when making decisions – via FreeExchange

myWorld Demo – The web opens up so many ways of putting advanced geo tools like this into the hands of everyday people – via Peter Batty

Datavis Tumblr – Beautiful examples of data visualization – via Julian Green

How to use the new Salesforce REST API from PHP

Photo by Napanee Gal

Salesforce just released the REST version of their API, and while there's a Java example, there's no sample code for other languages. Since I'll be calling it from PHP, I used their documentation to build my own sample code. The source is available at and you can see a live version running at The code demonstrates how to authenticate, get an access token and then call the API to grab information about the sales accounts for the current user.

To use the API at all, you'll need a server setup up with an SSL certificate and https, since the OAuth 2.0 authentication requires a secure connection. I found this guide from Ubuntu useful in getting that set up, and bought my certificate from GoDaddy.

With that sorted out, go to to create a Developer Edition salesforce account. You'll also want to sign up for the REST API preview beta program (though they're currently experiencing a few technical hiccups with the process).

Next, navigate to the Setup link in the top-right corner of the page, then click on Develop, then Remote Access. Pick an application name, and add the location where you'll be uploading the example index.php file as the callback URL. I also checked the No user approval required box even though I'm not sure exactly what it does! After you've saved you should see a screen giving you your access credentials. Copy the Consumer Key, Consumer Secret and callback URL values into the start of your copy of the index.php sample code and then upload it to the server.

Now point your web browser at the address you uploaded the sample code to. The first time through it should redirect you to a login page on the Salesforce site, ask you whether you want to let the application access your data, and then send you back to the sample code location. If all goes well, you should see a list of your sales accounts:

Congratulations, you've just written your first Salesforce application!

The dark side of entrepreneurship, continued

My last post led to a flood of comments and emails. There's so much sobering insight packed into the reactions, so many personal stories that stand out, I'll highlight a few of them below.

First though, I want to talk a little about the connection between my work and the failure of the relationship. As I said, there wasn't a direct, clear link. I fought hard to carve out time to spend together, since I knew that was the classic mistake. It was more subtle – when at some level I started to sense deeper problems with our relationship, I'd try to spend more and more time together as a fix instead of talking about the issues and confronting them. I had my work to soak up any frustration and bring me some measure of satisfaction, so I let things linger. Looking back, I should have known what was going wrong, but it was easier to hide in the world I could control than face painful conversations. I shut myself off, and left her facing our problems alone. When she finally had the courage to confront me with them, she felt it was too late for a fix. It wasn't that work kept me physically away, it's that it gave me a place to hide from our problems.

123 wrote:

"I know how it is to be on the other side of the story. It's hard for us too. I'm really sorry."

That's what I'm going to regret for the rest of my life, the pain I caused someone I love.

Alex Dong paints a picture of where I could be in 50 years time if I'm not careful:

"When I met Frank, he was 97. Just came back from his first accident falling off from stairs. A visit to his house was like a tour of the car history museum. He was there when Ford was just starting up. Frank invented a shitload of gadgets and mechanical devices that made the model-T possible. Most bleeding edge technologies they created have long gone. Today we don't even have a chance to see them anymore. Like a reflective mirror on top of the front light that can tell the driver whether the light is on or not. He was so lonely by then that he took us for a ride in his still brand new Ford Model A and refused to stop and let us go home.

Frank's personal life was a complete failure. He was so passionate about changing the world that he had a workshop at home, with lathe and tons of great manual tools. His wife became an alcoholic because Frank completely ignored her. His two sons hated him so much that none of them came to visit him in the last 5 years.

A month ago, I heard that Frank finally moved into a nursery home. He sold his house before he left. My friend Walter was there when one of his son hired a dumpster to take away all Frank's tools. The whole workshop was thrown away. Nobody, even a museum, wanted his stuff. All the great books, models, designs, gadgets. All into the dumpster."

Jud Valeski talks about the impossible balancing act:

"Thanks for the honesty. I'm sorry this happened. The apparent out-of-the-blue nature scares me. Things seem "fine" in my world. I fear coming home one day to an empty house and a note on the table.

There are so many conflicting desired behaviors between work and non-work relationships. In order to make my company succeed, I have to pour everything I have into it. In order to make my marriage and parenthood succeed, I have to pour everything I have into it. What the!?!

Over the past six months, since taking on the CEO position at work, my world has shifted. Even more than before, as a founder, my time is absorbed by work. I've never been good at balance, but I like to think I'm doing ok with it. I get support from my spouse (tremendous support), but I don't know how real it is. Not because she's potentially dishonest, but because I don't know if these are the kinds of things that even she can know in the moment (can any of us?). I can't ignore the frustration in her voice when talking to her on the phone on a work trip asking "how many more nights are you gone?"

For better or worse will be determined in time, but we've split the balance up at a unit level. We've curiously fallen straight back into the 1950's; I bring home the bacon, and she runs the house and kids. I try to be a good father by showing my children what hard work means. What it means to dedicate yourself. What it means to be passionate. What it means to run fast. What it means to pick yourself up after you face-plant. What it means to work and love. What it means to love work. Those are the examples I can give. That is my life. That is how I can contribute to our child rearing as a parent. It's not necessarily my preference, but it's what I can do, and do well, at the moment."

Adrian Ashton on the hidden costs behind work we admire:

"I occassionally get to guest lecture to enterprenuers clubs in colleges and universities and always end with an image of Edward Much's 'Scream' to illustrate the point that the piece of art has changed the world, touched and changed countless lives… and all it took was for the artist to have a breakdown."

Mitch Fillet on a time limit for startups:

"It is very hard to navigate the desire for achiievement, the demands of your employer aas they heap on both responsibility and compensation and the needs of your family.
Wrap those demands woth some aging parents and a few chldren and it becomes an almost impossible cycle to shoulder for more then a year or two. That is why we speak about a 36 month exit of some sort. This, of course, does not mean an IPO. It just means a recognition that a combination of delegation, infrastructure build-out and possibly the inclusion of other stakeholders preserves the sanity and family of the founder group."

Nicholas Napp on the sacrifices you have to make:

"We talked and agreed that there needed to be more ground rules other than simply keeping to the truth. If you're not willing to sacrifice your relationships (something I am no longer willing to do) you have to make other compromises. That's why I now have a startup and a consulting business. Yes, VC's hate the idea, but they're not the people I want in my personal life.

For me, there has to be more balance. You can't stay happy and walk away from your passion and a desire to build things, but that passion can easily blow up the rest of your life. Being hyper focused on work is the natural tendency for an entrepreneur, but for most of us, I don't believe it's effective. You lose perspective, miss opportunities and make mistakes. Not that you don't make mistakes otherwise, but at least if my work life sinks, I have a rich personal life to anchor me. That gives me the chance to reset and try again."

Kin Lane on sharing the obsession:

"Definitely an area that young entrepreneurs do not consider. My marriage eventually ended after 10 years due to my chronic entrpreneurialism.

I still suffer from it, but manage it better these days. Also found an equally geeky, obsessive GF….and our obsessive world is shared. "

The dark side of entrepreneurship

My work has never been better, but my personal life is in tatters – an eight-year relationship just ended. I probably shouldn't even be discussing this here, but writing it down helps as I try to make sense of what went wrong – what I did wrong.

There's no direct line I can draw between what happened and my startup life, but I have to wonder. Something people seldom talk about with entrepreneurship is how corrosive it can be to relationships. Founders are driven, and that doesn't make for a comfortable world for them or those around them. I always tried to make my home life the top priority, but my obsession is a part of me.

There's so many characteristics of entrepreneurs that make us hard to live with, from a constantly uncertain future to long working hours and a monomaniacal focus on our projects, but these tend to get lost in the classic Romantic mythology that is perpetuated by our willingness to lie about the reality of startup life. There's plenty of counter-examples that show it's possible to be a founder and a good spouse or parent, but the 'craziness' that drives us to build castles in the sky is not always a benign force.

Please, take a minute to think about the people you love, and be certain you're giving them everything they deserve, that you're truly there for them. I have lots of time to reflect on that, now it's too late.

OpenHeatMap now supports states and provinces worldwide

One of the most frequent requests for OpenHeatMap has been better support for provinces/states outside of the ones I already offer. The bottleneck's been finding the data in a form I can use, with a lot of help from locals I found a handful of usable maps for India, Mexico and Canada, but it was a slow process. All that changed when I discovered the public domain Natural Earth data set. Taking one map containing top-level administrative districts for every country worldwide, I was able to extract the states and provinces for hundreds of nations. This means you can now upload a spreadsheet containing province names for almost any country and get a detailed map.

This is the map you get when you upload Afghan provinces, and below is a complete list of the examples for the countries I support. I'm very excited to see how people are able to use this, so let me know how you get on.


Land Islands
United Arab Emirates
Burkina Faso
Central African Republic
Cocos (Keeling) Islands
Cote D'Ivoire
The Democratic Republic of the Congo
Costa Rica
Christmas Island
Czech Republic
Dominican Republic
Faroe Islands
United Kingdom
The Gambia
Equatorial Guinea
French Guiana
Heard Island And McDonald Islands
Kyrgyz Republic
South Korea
Lao People's Democratic Republic
Libyan Arab Jamahiriya
Sri Lanka
Moldova, Republic Of
The Former Yugoslav Republic of Macedonia
New Caledonia
Norfolk Island
New Zealand
Papua New Guinea
Democratic People's Republic of Korea
Saudi Arabia
Svalbard And Jan Mayen
Solomon Islands
Sierra Leone
El Salvador
Serbia And Montenegro
Trinidad & Tobago
South Africa

Five short links

Photo by Joseph Robertson

OmniMark – Old-school but impressive tool for turning arbitrary semi-structured data into XML. I’ll be trying to learn from this as I look to improve my ETL process – via Kevin Marshall

The rise and fall of Swivel – So many lessons for any data startup in here. Swivel took several million dollars in funding before they had a plan of where they were going, and built a generic platform instead of a focused application targeted at users who would them some benefit. That they had less than ten paying customers, despite tens of thousands of registered users is a good reminder of the work you have to put in to create revenue, you don’t just get a fixed percentage of active users upgrading – via Joe Parry

The Obese Surfer Problem – Russell explores a compelling visualization that serious surfers are willing to pay money for. I like the idea of ‘predictive models’ as a more general category for what I often talk about as recommendations. Showing you what could happen is a lot more valuable than just a rear-view mirror showing the history – via Russell Jurney

HexFiend – “A fast and clever open source hex editor for Mac OS X.” Does exactly what it says on the tin, I’ve been searching for a good hex editor since Codewright died, and so far it’s been great

Benoit Mandelbrot is gone, but he shouldn’t be forgotten – A strong reminded to everyone; don’t assume a Gaussian for your probabilities if your events don’t follow that distribution. We have so much faith in numbers as summaries of reality, but like spherical cows, unrealistic assumptions can lurk behind the most solidly calculated figure – via Behavior Gap

Visualization myths around Snow’s cholera map


Thanks largely to Tufte's evangelization, John Snow's map of the 1854 cholera outbreak in Soho has become the classic example of the power of visualizations. I've just finished Steven Johnson's The Ghost Map that tells the story behind the graphic, and it's surprisingly different from the simplified explanation that usually accompanies the picture.

The map wasn't that innovative

Snow wasn't the first person to draw these kinds of maps, he wasn't the first to draw them to track disease, and in fact he wasn't even the first person to map this particular outbreak! The Sewer Commision produced a very detailed map showing the death locations. The power of Snow's version came from his decision to leave out a lot of details (sewer locations, old grave sites, etc) that cluttered up the Commision's version. Their map was so muddled that it didn't tell a story, but Snow's was stripped-down to show exactly what he needed to bolster his theory that the epidemic spread from the water pump.

The only technical innovation that Johnson identifies was his use of boundary lines to mark the areas that were closest to particular pumps by walking distance, to demonstrate that many of the cases nearer to other water sources as the crow flies were actually in the catchment area of the Broad Street pump. Unfortunately that version of the map is rarely shown, and Tufte himself dismisses it as "Voronoi baloney"!

Theory came first

From the popular account it's easy to imagine that Snow plotted the deaths on his map, then the pump locations, and that triggered a revelation. In fact he'd been fighting for a decade to prove that cholera was a waterborne disease, not spread atmospherically as the miasma theory claimed. He'd already gathered a lot of evidence from the differing rates of the disease amongst neighbors using piped water from different suppliers. It was a tool for "hypothesis testing" not "hypothesis generating".

Data gathering was the key

Together with the Henry Whitehead and local doctors, Snow spent weeks going door-to-door gathering detailed information from area residents. He was then able to present that data as evidence for his theory in a variety of forms, including anecdotal case histories, numerical analyses and his maps. The key was that this hands-on experience with the raw data gave him the story he wanted to tell, and then he was able to make his argument using a variety of different presentation tools.

These two ideas are essential points for my work; a lot of the recent approaches to visualization assumes that you can give ordinary people simple map or graph creation tools, and they'll be inspired to create powerful graphics. With OpenHeatMap I've concentrated on people who already have a story to tell; journalists, activists and other people who are highly motivated to make an argument. It's about empowering people who are looking for a solution, not hoping that we'll turn passive observers into active participants just by handing them the tools.

The map became marketing

The actual story and evidence behind Snow's work is complex and hard to explain. As his theory became widely accepted as a massive historical advance, the map came to stand as shorthand for the story behind it. After that, it was easy to imagine that the graphic was the central evidence of his report on the outbreak. In fact it was just one piece of evidence, but it was so accessible and easy to use as an illustration that it spread slowly but virally through different publications. As Johnson puts it in his book "the map was a triumph of marketing as much as empirical science".

This is something I've seen in my own work too. Visualizations are fantastic at engaging people, everyone loves maps. When it comes down to detailed analysis though, a spreadsheet or other list-based interface is almost always better. Maps and other visualizations tell stories so well because of how much they leave out, but textual representations still rule when it comes to actually working with the full data. Think of your visualizations as powerful marketing tools, as bait to get people in the door, but expect to offer them something deeper when they want to work with that data.

There's a lot more to the story than I can cover here, so if you've got any involvement in data analysis or visualization you should pick up The Ghost Map, it's full of so many lessons and is a gripping read on top. I also recommend this short academic paper "Essential, Illustrative, or . . . Just Propaganda?" that argues for a different perspective on Snow's work than both the traditional popular account, and Johnson's revised approach.

How to turn data into money

Photo by Jerry Swantek (fascinating tradition behind it)

The most important unsolved question for Big Data startups is how to make money. I consider myself somewhat of an expert on this, having discovered a thousand ways not to do it over the last two years. Here's my hierarchy showing the stages from raw data to cold, hard cash:


You have a bunch of files containing information you've gathered, way too much for any human to ever read. You know there's a lot of useful stuff in there though, but you can talk until you're blue in the face and the people with the checkbooks will keep them closed. The data itself, no matter how unique, is low value, since it will take somebody else a lot of effort to turn it into something they can use to make money. It's like trying to sell raw mining ore on a street corner; the buyer will have to invest so much time and effort processing it, they'd much prefer to buy a more finished version even if it's a lot more expensive.

Down the road there will definitely be a need for data marketplaces, common platforms where producers and consumers of large information sets can connect, just as there are for other commodities. The big question is how long it will take for the market to mature; to standardize on formats and develop the processing capabilities on the data consumer side. Companies like InfoChimps are smart to keep their flag planted in that space, it will be a big segment someday, but they're also moving up the value chain for near-term revenue opportunities.


You take that massive deluge of data and turn it into some summary tables and simple graphs. You want to give an unbiased overview of the information, so the tables and graphs are quite detailed. This now makes a bit more sense to the potential end-users, they can at least understand what it is you have, and start to imagine ways they could use it. The inclusion of all the relevant information still leaves them staring at a space shuttle control panel though, and only the most dogged people will invest enough time to understand how to use it.


You're finally getting a feel for what your customers actually want, and you now process your data into a pretty minimal report. You focus on a few key metrics (eg unique site visitors per-day, time on site, conversion rate) and present them clearly in tables and graphs. You're now providing answers to informational questions the customers are asking; "Is my website doing what I want it to?", "What areas are most popular?", "What are people saying about my brand on Twitter?". There's good money to be had here, and this is the point many successful data-driven startups are at.

The biggest trouble is that it can be very hard to defend this position. Unless you have exclusive access to a data source, the barriers to entry are low and you'll be competing against a lot of other teams. If all you're doing is presenting information, that's pretty easy to copy, and caused a race to the bottom in prices in spaces like 'social listening platforms'/'brand monitoring' and website analytics.


Now you know your customers really well, and you truly understand what they need. You're able to take the raw data and magically turn it into recommendations for actions they should take. You tell them which keywords they should spend more AdWords money on. You point out the bloggers and Twitter users they should be wooing to gain the PR they're after. You're offering them direct ways to meet their business goals, which is incredibly valuable. This is the Nirvana of data startups, you've turned into an essential business tool that your customers know is helping them make money, so they're willing to pay a lot. To get here you also have to have absorbed a tremendous amount of non-obvious detail about the customer's requirements, which is a big barrier to anyone copying you. Without the same level of background knowledge they'll deliver something that fails to meet the customer's need, even if it looks the same on the surface.

This is why Radian6 has flourished and been able to afford to buy out struggling 'social listening platforms' for a song. They know their customers and give them recommendations, not mere information. If this sounds like a consultancy approach, it's definitely approaching that, though hopefully with enough automation that finding skilled employees isn't your bottleneck.

Of course the line between the last two stages is not clear-cut (Radian6 is still very dashboard-centric for example), and it does all sound a bit like the horrible use of 'solution' as a buzz-word for tools back in the 90's, but I still find it very helpful when I'm thinking about how to move forward. More actionable means more valuable!

Is ingestion the Achilles Heel of Big Data?

Photo by Jon Appleyard

Drew Bruenig asked me a very worthwhile question via email:

"Outside of a handful of few predictable cases (website analytics, social exchange, finance) big data piles are each incredibly unique. In the smaller data sets of consumer feedback (that are still much larger than our typical sets) it’s more efficient for me to craft an ever expanding library of scripts to deal with each set. I have yet to have a set that doesn’t require writing a new routine (save for exact reruns of surveys).

So the question is: can big data ever become big business, or are the variables too varied to allow a scalable industry"

This gets to the heart of the biggest practical problem with Big Data right now. Processing the data keeps getting easier and cheaper, but the job of transforming your source material into a usable form remains as hard as it's ever been. As Hilary Mason put it, are we stuck using grep and awk?

A lot of the hype around Big Data assumes that it will be a growth industry as ordinary folks learn to analyze these massive data sets, but if the barrier is the need to craft custom input transformations for each new situation, it will always be a bespoke process, a cottage industry populated solely by geeks hand-rolling scripts.

Part of the hope is that new tools, techniques and standards will emerge that remove some of the need for that sort of boiler-plate code. is a good example of that in the social network space, maybe if there were more consistent ways of specifying the data in other domains we wouldn't need as many custom scripts? That's an open question, even the Activity Streams standard hasn't removed the need to ingest all the custom data formats from Twitter, etc.

Another big hope is that we'll do a better job of generalizing about the sort of data transformations we commonly need to do, and so build tools and libraries that let us specify the operations in a much more high-level way. I know there's a lot of repetition in my input handling scripts, and I'm searching for the right abstraction to use to simplify the process of creating them.

I also think we should be learning from the folks who have been dealing with Big Data for decades; enterprise database engineers. There's a cornucopia of tools for the Extract, Transform, Load stage of database processing, including some nifty open-source visual toolkits like Talend. Maybe these don't do exactly what we need, but there has to be a lot of accumulated wisdom we can build on. The commercial world does tend to be a blindspot for folks like me from a more academic/research background, so I'll be making an effort to learn more from their existing practices. On the other hand the fact that ETL is still a specialized discipline in its own right is a sign that ingestion is still an unsolved problem even after decades of investment, so maybe our hopes shouldn't get too high!

Five short links

Photos by the_moment

Optimizing conversion rates with qualitative tests – First in a series, this post does a great job of walking through the steps that you can take to figure out simple ways to improve your site. It alerted me to some services I wasn’t aware of like and fivesecondtest.comvia Healy Jones

Orange – An interesting node-based graphical environment for building data-mining pipelines – via Dániel Molnár

Lies, Damned Lies and Medical Science – A compelling portrait of a ‘meta-researcher’ who has made a career out of proving how bogus most medical research is. Everyone involved in data analysis should read this; as a culture we have an irrational respect for charts and tables, when in fact they’re just useful ways of telling stories. Just like normal prose those stories are only as good as the evidence behind them, and should be treated just as sceptically. via Alexis Madrigal

Scrapy – Solid, simple and mature, so far this framework for building web crawlers in Python looks very useful and I’ll be using it on some upcoming projects. I’m still not convinced that XPath is flexible enough for the sort of content extraction I need to do, but I’ll see how far I can get with it and if alternative methods are easy to bolt on. via Alex Dong

The ignorance of what is possible – Growing up, my highest ambition was to work in an office, since that let you sit down in a private space and everybody I knew had jobs that involved standing up and dealing with customers. Reading this article reminded me of how limited my horizons were when I was young, it was only when I moved to the US that I realized how much more was possible. There must be so much potential wasted because kids don’t see how wide the world can be and limit their ambitions without even knowing what they’re losing.