Why you should blog

Typewriter2
Photo by Anne Bowerman

Yesterday I talked about why startups should blog, and was pretty negative. I stand by everything I wrote there – Brad's recent post about intrinsic motivation captures why I don't believe counting up the external benefits will be enough to get you blogging, as does Bukowski's poem So you want to be a writer?

I'm a strong believer in the power of counting my blessings though, so here's some of the things that blogging has given me:

The people

Many work days they only person I'll talk to face-to-face is Liz, which is a big change from the my previous career spent sitting at a desk surrounded by my team. Blogging is a way for me to hold the sort of water-cooler conversations I miss, and build relationships with interesting people. I'm excited every time I see a comment, and amazed when I blog about an event or get-together and people actually show up. It's led to many off-line conversations that give me whole new perspectives on the problems I'm dealing with.

The ideas

The best way to understand something completely is to explain it to someone else. Blogging forces me to think through my ideas in a disciplined way and expose them to criticism from some very smart people. Just the practice of spending an hour a day truly thinking hard and in depth about an issue has helped me immensely.

The communication skills

I spent a decade building my engineering chops, but at the end of it I was still constantly frustrated by my inability to explain my ideas and persuade people. Writing hundreds of blog posts has left me with a much stronger ability to get across my thoughts in a pithy and effective way, both in writing and general conversation. It helps that I can often cast my mind back to old posts for arguments and evidence, but the sheer mental workout of writing accurate articles quickly helps me think on my feet.

The publicity

Having a thousand people willing to spend their time checking out what I'm up to is incredibly powerful. That's usually enough to get meaningful feedback about what's working and what's failing, and to start a viral spread if it's really a winner. I'm also hopeful you'll all be a good pool of potential customers once I roll out something that actually generates revenue!

The credibility

It's not who you know, it's who knows you. Having someone in the room who's actually heard of your work changes the whole tenor of a meeting. I don't hear from the folks who read my blog and think I'm a clown of course, but some of my strongest and most rewarding business relationships have come about thanks to this blog.

Why should startups blog?

Secretary
Photo by Anne Bowerman

A friend just asked me a simple but important question – why should a startup have a blog? What are the practical, concrete benefits?

The short answer is, you probably shouldn't have one! It takes a lot of time and mental energy to keep an active blog going, and if you spent that time on external consulting instead and channeled the revenue into AdWords you'd almost certainly get more traffic for your effort. If you don't have a burning desire to say something, you'll get dispirited by the initial lack of interest and give up.

It's a bit like choosing to do a startup instead of sticking with a salary job – if you stare at the cold probabilities it's an irrational choice. You hear about the successes, but not about about the hundreds of other blogs that get ignored despite great content.

So why am I blogging? For the same reason I'm building my startup, it's an itch I have to scratch. It's had some great personal benefits, from improving my ability to communicate complex ideas, to making friends I'd never have discovered, but those came slowly. In the first year I had posts which literally nobody read, and it took several months of daily posts before I got my first RSS subscriber! The only things that kept me going were my pig-headed stubborness and the pleasure of looking at a completed post.

After over 800 posts and five years of blogging, I now actually have an audience; 1,600 RSS subscribers and 30,000 web visitors a month. This brings all sorts of bonuses. I love people trusting me enough to invest their time checking out my projects, and meeting readers in person and having conversations that flow from my articles. The constant practice has made me a competent enough writer to pop up as a guest author on some of my favorite sites, which is both a thrill and generates some nice publicity for my startup work. I'm incredibly glad I'm in this position, but it took an obsession bordering on the insane to get here.

So don't look for hard-nosed practical reasons to blog. Unless there's something that you just have to shout from the rooftops it won't make sense.

Fighting the drug war with statistics

Cartels

I only ran across Diego Valle's blog because he sent me a bug report for OpenHeatMap, but reading through his posts I'm amazed he's not famous for his work. He's calmly telling his story through statistics, but the results are powerful, shocking, and surprising. An epic example is his Statistical Analysis and Visualization of the Drug War in Mexico. With the help of just Benford's law and data sets to compare he's able to demonstrate how the police are systematically hiding over a thousand murders a year in a single state, and that's just in one small part of the article.

Mexicomurder
There's been a lot of discussion about data journalism but few examples have been more than diverting eye candy. Diego shows us how it should be done. If you're like me you'll have read other stories about the drug war down south, and all that sticks in your mind is photos of bodies and the gargantuan number of deaths. With these posts he's given me a fresh and apalling perspective on the details of what that actually means, from a criminal justice system that can't even record the murders effectively, how military interventions correlate with increased death rates, to the impact on Mexico of the US ending its assault weapon ban. Equally important, he also shows how some of my preconceptions are wrong; Mexico continues to have a far lower homicide rate than Brazil for example, and the violence tends to be highly concentrated in areas the cartels are disputing.

Diego's work is important because he's using hard evidence to tell the truth about a vitally important story, and the data hints at how we can get out of this mess. We need people like him working in the tradition of John Snow, applying their analytical skills to illuminate life-or-death problems instead of just Twitter trends.

Why we lie on stage (and what we can do about it)

Bedtimestory
Photo by Tim Johnson

Eric Ries has an impassioned plea to entrepreneurs to stop lying on stage and he's spot on, this is a massive problem for first-time founders. Mistaken impressions of how startups worked helped me make terrible decisions and misjudge progress. Almost every startup history I thought I knew turned out to be completely cooked, once I talked privately to people who'd actually been there.

I don't think 'stop lying' is going to be a very effective solution though. If you read diaries by almost anyone living through great historical events, whether they're presidents or foot-soldiers it's just one damn thing after another. The over-arching narrative gets pulled together from that raw material, first by journalists and later by historians. Most of what actually happened gets left out in the telling, because it doesn't fit into the pattern of cause-and-effect that stories require. Really good historians will rescue some of this material by building new and more nuanced descriptions of events, but there's always so much going on and so little space to tell it, inevitably key information gets lost in the compression process.

We're hard-wired to respond to coherent stories, so this sort of lying by omission is never going to go away. The Lean Startup movement itself is built around the story that any company can reach success by mechanically applying some simple techniques to their product development. If you look at the details of their writings this isn't actually what Eric or Steve say, there's a lot more complexity and space for more traditional approaches, but that basic story is what sticks in people's minds. There's endless debates about how much of a cargo cult the movement is, because that over-simplified version is has taken hold of both sides.

The story is so popular because it's an antidote to the traditional 'auteur theory' approach to explaining the success of a startup; pick a charismatic individual and ascribe everything to their brilliant strokes of genius. Customer development's disparagement of visionaries and exhultation of hard, repetitive work is an appealingly puritanical backlash against this classically Romantic picture.

Since we can never truly transfer the totality of our experiences to anyone else, we need to take our role as storytellers seriously. If we're going to reach people, we have to build more truthful stories that win out over the bogus ones. Part of that is encouraging blogging by entreprenuers like Tim Bull. Seeing it unfold before your eyes in realtime makes it both a lot closer to the truth and more compelling . I'm depressed that I don't know more startup founders with active blogs, I worry we're all too concerned about projecting a confident image and afraid to display how imperfect and accidental the real path of most startups is.

Another step is to seriously think about how to incorporate failures into our 'act'. I've usually managed to get a laugh by titling my recent data-processing talks 'How to get sued by Facebook'. I start out my introductions to new business contacts by talking about the 'fruitful failures' of the past two years, all the thousand ways not to build a light-bulb I've discovered and how much that's taught me.

The fundamental thing to recognise though is that we need that Romantic narrative of startup success, or no one would ever persist in trying to do something as crazy as building a company from scratch. As Tom Evslin says, "nothing great has ever been accomplished without irrational exuberance". There's plenty of great raw material that's both exciting and true in every startup's story, so let's learn to be better storytellers and spin that into an gripping tale.

Information wants to be paid

Dubloon
Photo by Swamibu

I want to pay for API access. That probably sounds nuts coming from a starving entreprenuer, but I don’t want to be treated as a charity case by the services I rely on.

Part of my job at Apple was third-party developer support, and even before the iPhone made it so high-profile, the company was brutally self-interested in its relationship with outside developers. With that in my background I was wary of the siren call of becoming a third-party developer when I entered the web world. In the short-term the distribution advantages are hard to resist, but the service provider always has a gun to your head. Look at the arc of both the Facebook and Twitter ecosystems. Is FbFund even still alive after all the restrictions on apps within Facebook? Tweetie’s acquisition was a big win for the team, but also made it clear that Twitter is happy to expand at the expense of other external developers. You can counter with Zynga, but I’d argue that their history shows you need to grow large enough to change the power relationship, and even then you’re at a high-risk of being cut-off.

Fundamentally the problem is that the relationship between developers and API providers is all take and no give. The big guys have no incentive to keep their APIs open and stable. They love the free R&D, but as soon as something looks like it might make money, the temptation to bring it in-house is irresistable.

What I’m looking for in a relationship is reciprocity. The oldest and most successful API on the web is search-engine crawling. This works because providers have a strong incentive to allow Google to index their sites – in return for handing over their content, Google sends them visitors. In the real world, it’s not normal for this sort of business relationship to work through this sort of bartering. In most cases if company A makes money and depends on company B, some of A’s revenue ends up in B’s pocket.

I want to know where I stand relative to the business model of any company I depend on. If API access and the third-party ecosystem makes them money, then I feel a lot more comfortable that I’ll retain access over the long term. If it’s a drain on their resources, then I’ll assume they’re doing it for free R&D and may yank the plug at any point. It doesn’t stop me experimenting, but I’d never build a business that relied on them.

So, I’m basically stuck with Salesforce as my only option, until I can persuade Twitter or Facebook to take my cash!

“On the one hand information wants to be expensive, because it’s so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.”

Thanks to Jud and Rob for contributing a lot to my thoughts around this

The Boulder/Denver Hadoop meetup is tomorrow!

Cray
Photo by Steve Jurvetson

Tomorrow night (Wednesday October 6th) we'll be holding our monthly Hadoop meetup, this time in the Gnip offices in downtown Boulder. There's always a great mix of folks, including Return Path's Jacob at the helm. This month's theme is sorting out each other's HBase issues, but it's always very informal and we're never quite sure where the discussion will end up until the beer is flowing! If you're using or thinking about using Hadoop, or just have an interest in big data processing, come and geek out with us.

Do you want a map of your blog?

Sunsurfer

http://www.openheatmap.com/view.html?map=HawthornsUncontrollablePilaf

One of my goals with OpenHeatMap is building a better interface to access and explore the firehose of data we’re bombarded with. I’ve always wanted a way to navigate content by location and so I’m experimenting with mapping blogs and other media sites. Above is a visualization I built of a few hundred posts from the Sunsurfer Tumblr blog. You can see at a glance the locations that the site covers, and then by mousing over you’ll see a selection of photos from that place.

I’m looking at turning this into a self-serve interface where anyone can enter a URL and receive a visualization of that whole site, but I need testers. If you have a blog or site that features a lot of locations, whether it’s pictures from around the world, news stories or anything you’re keen to see visualized like this, drop an email to pete@mailana.com and I’ll get on the case!

Geodict – an open-source tool for extracting locations from text

Cheddar
Photo by Mukumbura

One of my big takeaways from the Strata pre-conference meetup was the lack of standard tools (beyond grep and awk) for data scientists. With OpenHeatMap I often need to pull location information from natural-language text, so I decided to pull together a releasable version of the code I use for this. Behold, geodict!

It's a GPL-ed Python library and app that takes in a stream of text and outputs information about any locations it finds. Here's the command-line tool in action:
./geodict.py < testinput.txt

That should produce something like this:
Spain
Italy
Bulgaria
New Zealand
Barcelona, Spain
Wellington New Zealand
Alabama
Wisconsin

For more detailed information, including the lat/lon positions of each place it finds, you can specify JSON or CSV output instead of just the names, eg

./geodict.py -f csv < testinput.txt
location,type,lat,lon
Spain,country,40.0,-4.0
Italy,country,42.8333,12.8333
Bulgaria,country,43.0,25.0
New Zealand,country,-41.0,174.0
"Barcelona, Spain",city,41.3833,2.18333
Wellington New Zealand,city,-41.3,174.783
Alabama,region,32.799,-86.8073
Wisconsin,region,44.2563,-89.6385

For more of a real-world test, try feeding in the front page of the New York Times:
curl -L "http://newyorktimes.com/&quot; | ./geodict.py
Georgia
Brazil
United States
Iraq
China
Brazil
Pakistan
Afghanistan
Erlanger, Ky
Japan
China
India
India
Ecuador
Ireland
Washington
Iraq
Guatemala

The tool just treats its input as plain text, so in production you'd want to use something like beautiful soup to strip the tags out of the HTML, but even with messy input like that it works reasonably well. You will need to do a bit of setup before you run it, primarily running populate_database.py to load information on over 2 million locations into your MySQL server.

There are some alternative technologies out there like Yahoo's Placemaker API or general semantic APIs like OpenCalais, Zemanta or Alchemy, but I've found nothing open-source. This is important to me on a very practical level because I can often get far better results if I tweak the algorithms to known characteristics of my input files. For example if I'm analyzing a blog which often mentions newspapers then I want to ignore anything that looks like "New York Times" or "Washington Post", they're not meaningful locations. Placemaker will return a generous helping of locations based on those sort of mentions, adding a lot of noise to my results, but with geodict I can filter them out with some simple code changes.

Happily MaxMind made a rich collection of location information freely available, so I was able to combine that with some data I'd gathered myself on countries and US states to make a simple-minded but effective geo-parser for English-language text. I'm looking forward to improving it with more data and recognized types of locations, but also to seeing what you can do with it, so let me know if you do get some use out of it too!

Hack the planet?

Planet
Photo by Alby Headrick

I've just finished reading Hack the Planet, and I highly recommend it to anyone with an opinion on climate change (which really should be everyone). Eli Kintisch has written a comprehensive guide to the debates around geo-engineering, but it's also a strong argument for reducing our CO2 emissions. Like me, he's an unabashed believer in the scientific method as the best process for answering questions of fact and he's not afraid to challenge his subject's assertions. It's fundamentally a documentary approach, not a polemic, but all the more powerful for how he tells the stories of those involved in the debate.

The core of the book is the idea that we may need to perform engineering on a massive scale to mitigate the climate changes caused by increases in greenhouse gases. There's a wide variety of schemes, from seeding the upper atmosphere with sulphur to reduce incoming radiation, spraying salt water into clouds to increase the amount of light reflected, scrubbing CO2 directly from the atmosphere or sequestering it underground as we're generating it. His conclusion is that the potentially cheap options, like injecting new material into the atmosphere, are risky and unproven, while the safer ones, like capturing CO2 as it's generated, are way too expensive to be practical.

The risks mostly come from our extremely poor understanding of how the planet actually works. We've had a few natural experiments with volcanic gases emitted by eruptions which demonstrate that cooling does occur, but also that rainfall and other patterns may be radically altered. Research may help quantify the risks, but the large-scale field trials needed face widespread public opposition. That also highlights a long-term political risk; if war or other disruption stops the engineering effort then the climate could change suddenly and catastrophically. On the other hand, it's also vital that we understand as much as possible about the techniques. If we end up with a 'long tail' event causing far more severe climate change than we expect, we may need to rapidly implement something as an emergency measure.

That all makes me hope that we can bring the costs down of sequestration technologies. The outlook there is uncertain, despite a lot of smart folks attacking the problem the costs of capturing one ton of carbon are still far too high to be deployed commercially. I'm optimistic this will change, but it's sobering to see the history of high hopes and broken promises from the proponents.

After his detailed view of the realities of the different geo-engineering approaches, reducing emissions emerges as a far more attractive option. As an engineer I'm inherently drawn to technological solutions, and if I was an investor I'd be betting on some form of carbon capture working out in the end, but in software terms this book paints the planet as the most convuluted legacy system you could imagine. We'll never be quite sure what will happen when we monkey with its spaghetti code, so lets hope we don't have to.

How I ended up using S3 as my database

Bucket
Photo by Longhorn Dave

I've spent a lot of the last two years wrestling with different database technologies from vanilla relational systems to exotic key/value stores, but for OpenHeatMap I'm storing all data and settings in S3. To most people that sounds insane, but I've actually been very happy with that decision. How did I get to this point?

Like most people, I started by using MySQL. This worked pretty well for small data sets, though I did have to waste more time than I'd like on housekeeping tasks. The server or process would crash, or I'd change machines, or I'd run out of space on the drive, or a file would be corrupted, and I'd have to mess around getting it running again.

As I started to accumulate larger data sets (eg millions of Twitter updates) MySQL started to require more and more work to keep running well. Indexing is great for medium-scale data sets, but once the index itself grows too large, lots of hard-to-debug performance problems popped up. By the time that I was recompiling the source code and instrumenting it, I'd realized that its abstraction model was now more of a hindrance than a help. If you need to craft your SQL around the details of your database's storage and query optimization algorithms, then you might as well use a more direct low-level interface.

That led me to my dalliance with key-value stores, and my first love was Tokyo Cabinet/Tyrant. Its brutally minimal interface was delightfully easy to get predictable performance from. Unfortunately it was very high maintenance, and English-language support was very hard to find, so after a couple of projects using it I moved on. I still found the key/value interface the right level of abstraction for my work; its essential property was the guarantee that any operation would take a known amount of time, regardless of how large my data grows.

So I put Redis and MongoDB through their paces. My biggest issue was their poor handling of large data loads, and I submitted patches to implement Unix file sockets as a faster alternative to TCP/IP through localhost for that sort of upload. Mongo's support team are superb, and their reponsiveness made Mongo the winner in my mind. Still, I realized I was finding myself wasting too much time on the same mundane maintenance chores that frustrated me back in the MySQL days, which led me to look into databases-as-a-service.

The most well-known of these is Google's AppEngine datastore, but they don't have any way of loading large data sets, and I wasn't going to be able to run all my code on the platform. Amazon's SimpleDB was extremely alluring on the surface, so I spent a lot of time digging into it. They didn't have a good way of loading large data sets either, so I set myself the goal of building my own tool on top of their API. I failed. Their manual sharding requirements, extremely complex programming interface and mysterious threading problems made an apparently straightforward job into a death-march.

While I was doing all this, I had a revelation. Amazon already offered a very simple and widely used key/value database; S3. I'm used to thinking of it as a file system and anyone who's been around databases for a while knows that file systems make attractive small-scale stores that become problematic with large data sets. What I realized was that S3 was actually a massively key/value store dressed up to look like a file system, and so it didn't suffer from the 'too many files in a directory' sort of scaling problems. Here's the advantages it offers:

Widely used. I can't emphasize how important this is for me, especially after spending so much time on more obscure systems. There's all sorts of beneficial effects that flow from using a tool that lots of others also use, from copious online discussions to the reassurance that it won't be discontinued.

Reliable. We have very high expectations of up-time for file systems, and S3 has had to meet these. It's not perfect, but backups are easy as pie, and with so many people relying on it there's a lot of pressure to keep it online.

Simple interface. Everything works through basic HTTP calls, and even external client code (eg Javascript via AJAX) can access public parts of the database without even touching your server.

Zero maintenance. I've never had to reboot my S3 server or repair a corrupted table. Enough said.

Distributed and scalable. I can throw whatever I want at S3, and access it from anywhere else. The system hides all the details from me, so it's easy to have a whole army of servers and clients all hammering the store without it affecting performance.

Of course there's a whole shed-load of features missing, most obviously the fact that you can't run any kind of query. The thing is, I couldn't run arbitrary queries on massive datasets anyway, no matter what system I used. At least with S3 I can fire up Elastic MapReduce and feed my data through a Hadoop pipeline to pull out analytics.

So that's where I've ended up, storing all of the data generated by OpenHeatMap as JSON files within both private and public S3 buckets. I'll eventually need to pull in a more complex system like MongoDB as my concurrency and flexibility requirements grow, but it's amazing how far a pseudo-file system can get you.