Geodict – an open-source tool for extracting locations from text

Cheddar
Photo by Mukumbura

One of my big takeaways from the Strata pre-conference meetup was the lack of standard tools (beyond grep and awk) for data scientists. With OpenHeatMap I often need to pull location information from natural-language text, so I decided to pull together a releasable version of the code I use for this. Behold, geodict!

It's a GPL-ed Python library and app that takes in a stream of text and outputs information about any locations it finds. Here's the command-line tool in action:
./geodict.py < testinput.txt

That should produce something like this:
Spain
Italy
Bulgaria
New Zealand
Barcelona, Spain
Wellington New Zealand
Alabama
Wisconsin

For more detailed information, including the lat/lon positions of each place it finds, you can specify JSON or CSV output instead of just the names, eg

./geodict.py -f csv < testinput.txt
location,type,lat,lon
Spain,country,40.0,-4.0
Italy,country,42.8333,12.8333
Bulgaria,country,43.0,25.0
New Zealand,country,-41.0,174.0
"Barcelona, Spain",city,41.3833,2.18333
Wellington New Zealand,city,-41.3,174.783
Alabama,region,32.799,-86.8073
Wisconsin,region,44.2563,-89.6385

For more of a real-world test, try feeding in the front page of the New York Times:
curl -L "http://newyorktimes.com/&quot; | ./geodict.py
Georgia
Brazil
United States
Iraq
China
Brazil
Pakistan
Afghanistan
Erlanger, Ky
Japan
China
India
India
Ecuador
Ireland
Washington
Iraq
Guatemala

The tool just treats its input as plain text, so in production you'd want to use something like beautiful soup to strip the tags out of the HTML, but even with messy input like that it works reasonably well. You will need to do a bit of setup before you run it, primarily running populate_database.py to load information on over 2 million locations into your MySQL server.

There are some alternative technologies out there like Yahoo's Placemaker API or general semantic APIs like OpenCalais, Zemanta or Alchemy, but I've found nothing open-source. This is important to me on a very practical level because I can often get far better results if I tweak the algorithms to known characteristics of my input files. For example if I'm analyzing a blog which often mentions newspapers then I want to ignore anything that looks like "New York Times" or "Washington Post", they're not meaningful locations. Placemaker will return a generous helping of locations based on those sort of mentions, adding a lot of noise to my results, but with geodict I can filter them out with some simple code changes.

Happily MaxMind made a rich collection of location information freely available, so I was able to combine that with some data I'd gathered myself on countries and US states to make a simple-minded but effective geo-parser for English-language text. I'm looking forward to improving it with more data and recognized types of locations, but also to seeing what you can do with it, so let me know if you do get some use out of it too!

Hack the planet?

Planet
Photo by Alby Headrick

I've just finished reading Hack the Planet, and I highly recommend it to anyone with an opinion on climate change (which really should be everyone). Eli Kintisch has written a comprehensive guide to the debates around geo-engineering, but it's also a strong argument for reducing our CO2 emissions. Like me, he's an unabashed believer in the scientific method as the best process for answering questions of fact and he's not afraid to challenge his subject's assertions. It's fundamentally a documentary approach, not a polemic, but all the more powerful for how he tells the stories of those involved in the debate.

The core of the book is the idea that we may need to perform engineering on a massive scale to mitigate the climate changes caused by increases in greenhouse gases. There's a wide variety of schemes, from seeding the upper atmosphere with sulphur to reduce incoming radiation, spraying salt water into clouds to increase the amount of light reflected, scrubbing CO2 directly from the atmosphere or sequestering it underground as we're generating it. His conclusion is that the potentially cheap options, like injecting new material into the atmosphere, are risky and unproven, while the safer ones, like capturing CO2 as it's generated, are way too expensive to be practical.

The risks mostly come from our extremely poor understanding of how the planet actually works. We've had a few natural experiments with volcanic gases emitted by eruptions which demonstrate that cooling does occur, but also that rainfall and other patterns may be radically altered. Research may help quantify the risks, but the large-scale field trials needed face widespread public opposition. That also highlights a long-term political risk; if war or other disruption stops the engineering effort then the climate could change suddenly and catastrophically. On the other hand, it's also vital that we understand as much as possible about the techniques. If we end up with a 'long tail' event causing far more severe climate change than we expect, we may need to rapidly implement something as an emergency measure.

That all makes me hope that we can bring the costs down of sequestration technologies. The outlook there is uncertain, despite a lot of smart folks attacking the problem the costs of capturing one ton of carbon are still far too high to be deployed commercially. I'm optimistic this will change, but it's sobering to see the history of high hopes and broken promises from the proponents.

After his detailed view of the realities of the different geo-engineering approaches, reducing emissions emerges as a far more attractive option. As an engineer I'm inherently drawn to technological solutions, and if I was an investor I'd be betting on some form of carbon capture working out in the end, but in software terms this book paints the planet as the most convuluted legacy system you could imagine. We'll never be quite sure what will happen when we monkey with its spaghetti code, so lets hope we don't have to.

How I ended up using S3 as my database

Bucket
Photo by Longhorn Dave

I've spent a lot of the last two years wrestling with different database technologies from vanilla relational systems to exotic key/value stores, but for OpenHeatMap I'm storing all data and settings in S3. To most people that sounds insane, but I've actually been very happy with that decision. How did I get to this point?

Like most people, I started by using MySQL. This worked pretty well for small data sets, though I did have to waste more time than I'd like on housekeeping tasks. The server or process would crash, or I'd change machines, or I'd run out of space on the drive, or a file would be corrupted, and I'd have to mess around getting it running again.

As I started to accumulate larger data sets (eg millions of Twitter updates) MySQL started to require more and more work to keep running well. Indexing is great for medium-scale data sets, but once the index itself grows too large, lots of hard-to-debug performance problems popped up. By the time that I was recompiling the source code and instrumenting it, I'd realized that its abstraction model was now more of a hindrance than a help. If you need to craft your SQL around the details of your database's storage and query optimization algorithms, then you might as well use a more direct low-level interface.

That led me to my dalliance with key-value stores, and my first love was Tokyo Cabinet/Tyrant. Its brutally minimal interface was delightfully easy to get predictable performance from. Unfortunately it was very high maintenance, and English-language support was very hard to find, so after a couple of projects using it I moved on. I still found the key/value interface the right level of abstraction for my work; its essential property was the guarantee that any operation would take a known amount of time, regardless of how large my data grows.

So I put Redis and MongoDB through their paces. My biggest issue was their poor handling of large data loads, and I submitted patches to implement Unix file sockets as a faster alternative to TCP/IP through localhost for that sort of upload. Mongo's support team are superb, and their reponsiveness made Mongo the winner in my mind. Still, I realized I was finding myself wasting too much time on the same mundane maintenance chores that frustrated me back in the MySQL days, which led me to look into databases-as-a-service.

The most well-known of these is Google's AppEngine datastore, but they don't have any way of loading large data sets, and I wasn't going to be able to run all my code on the platform. Amazon's SimpleDB was extremely alluring on the surface, so I spent a lot of time digging into it. They didn't have a good way of loading large data sets either, so I set myself the goal of building my own tool on top of their API. I failed. Their manual sharding requirements, extremely complex programming interface and mysterious threading problems made an apparently straightforward job into a death-march.

While I was doing all this, I had a revelation. Amazon already offered a very simple and widely used key/value database; S3. I'm used to thinking of it as a file system and anyone who's been around databases for a while knows that file systems make attractive small-scale stores that become problematic with large data sets. What I realized was that S3 was actually a massively key/value store dressed up to look like a file system, and so it didn't suffer from the 'too many files in a directory' sort of scaling problems. Here's the advantages it offers:

Widely used. I can't emphasize how important this is for me, especially after spending so much time on more obscure systems. There's all sorts of beneficial effects that flow from using a tool that lots of others also use, from copious online discussions to the reassurance that it won't be discontinued.

Reliable. We have very high expectations of up-time for file systems, and S3 has had to meet these. It's not perfect, but backups are easy as pie, and with so many people relying on it there's a lot of pressure to keep it online.

Simple interface. Everything works through basic HTTP calls, and even external client code (eg Javascript via AJAX) can access public parts of the database without even touching your server.

Zero maintenance. I've never had to reboot my S3 server or repair a corrupted table. Enough said.

Distributed and scalable. I can throw whatever I want at S3, and access it from anywhere else. The system hides all the details from me, so it's easy to have a whole army of servers and clients all hammering the store without it affecting performance.

Of course there's a whole shed-load of features missing, most obviously the fact that you can't run any kind of query. The thing is, I couldn't run arbitrary queries on massive datasets anyway, no matter what system I used. At least with S3 I can fire up Elastic MapReduce and feed my data through a Hadoop pipeline to pull out analytics.

So that's where I've ended up, storing all of the data generated by OpenHeatMap as JSON files within both private and public S3 buckets. I'll eventually need to pull in a more complex system like MongoDB as my concurrency and flexibility requirements grow, but it's amazing how far a pseudo-file system can get you.

Is ‘journalizing’ the future of journalism?

Presspass
Photo by MacWagen

Stowe Boyd just posted about a new program specializing in 'Entrepreneurial Journalism' at CUNY, and I got excited because I thought it was going to focus on strange folks like me who combine building companies and writing stories. I was disappointed to see that it's actually just a traditional course for producing professional journalists, with just a bit of lipstick added. So what's my big idea?

I spend a lot of my time working on articles that look awfully like newspaper or magazine stories. I've had those articles excerpted and discussed on places like the New York Times site, NPR and The Atlantic. I do original research, including picking up the phone and calling people. Despite all this I don't consider myself a journalist, nor do I particularly want to be called one. Everything I'm doing is driven by a combination of my own curiosity and the benefits that publicity brings to my startup work.

In the twentieth century, journalism became professionalized and turned into something you are instead of something you do. This had a lot of benefits in an age where the means of distribution were concentrated in a handful of editors' hands – notably it enforced norms of behavior that restrained reporters from abusing the great power they possessed.

These days there's much more of a continuum of people who produce articles that look, smell and feel like journalism to their readers. Are these part-time creators journalists? That term's the cause of far too many arguments, as the professionals feel offended at the devaluing of their own hard-earned credentials and the part-timers at the denial that their work is worthwhile. I think it's much more productive to return to talking about journalism as something you do, not something you are. Since I'm practically American now (two years until I can convert my green card into a passport!) I felt I should coin a neologism to join 'deplane' and other horrid barbarities, so from now on I'll be talking about my journalizing.

If journalizing is seen as much more of a skill that ordinary people can learn to varying levels, then maybe we can escape the loggerheads of journalists-versus-bloggers arguments. Do paramedics bitch and moan about people learning first aid and devaluing their jobs? Hell no, my most recent class was even taught by one volunteering in his spare time!

In the last few years that I've spent blogging, I've gained so much respect for the people who can produce streams of well-written articles for a deadline, year after year. Maybe if more people had hands-on experience of the work that it takes, it would actually be a better world for professional journalists?

Five short links – Turbo Encabulator edition

Highfive
Photo by Joachim S Mueller

EXHIBIT – This lightweight library for creating visualizations by harvesting data embedded in pages looks like a great way of encouraging people to add semantic structure to their HTML. Which is good for me, since it makes my crawling a lot easier. Via Carole Goble

Outwit – I can't find much out about these folks, but it looks like a step towards the simple unstructured data parsing tool I was just dreaming about. Via Conrad Quilty-Harper

ScraperWiki – Another find from my search for better scraping tools, what impresses me about this site is their active community. There's been a lot of attempts at bringing tools like this to the masses, but I'm hopeful the time is right for this one to succeed. Via Dan Armstrong

AddToIt – An enterprise take on converting unstructured text into useful information. There's a whole quiet world of commercial companies offering scraping services, which is partly what gives the field a shady reputation. Their promise to handle cases where "The data you would like to scrape is protected" certainly adds to that impression. That's a shame, because I bet there's a lot of interesting technology behind these approaches that will never see the light of day.

Turbo Encabulator – This is probably what I sound like when I talk to normal people about my day at work. A games artist friend would always mutter 'glib-glob.cpp' when I started to get too jargonified, after a source code file name I mentioned once that caused him to go into hysterics. Hey, it's a GameLIBrary-GLOBals C Plus Plus module, made sense to me! After they caused most of our early Motion bugs, 'Pbuffers' became a shorthand codeword for technical nonsense talk in our team at Apple, since they sounded so made up but were the engineer's explanation for everything that went wrong.

The missing tool for data scientists?

Plowwrench
Photo by Noel C Hankamer

I'm flying back from a preparatory meeting for Strata, and as always with the O'Reilly events it left my mental cogs whirring. One of the biggest revelations for me was that every single person was using the equivalent of string and chicken wire to build their analysis pipeline, from OKCupid analysts to folks helping the New York Times. As Hilary Mason put it, are we stuck just using grep and awk? This was actually a massive relief for me as I've always been highly embarassed by the hacky code that's behind most of my data pipelines and assumed that I was just ignorant and didn't know the right tools! Though R got an honorable mention from several attendees, almost everyone just picked their favorite scripting language, then cut-and-pasted code and shell commands to get the job done.

The fundamental reason for this is the aptly named 'suffering' phase of data munging where we transform raw HTML pages, JSON, text, CSV, XML, legacy binary or any other semi-structured format into strictly arranged data that we can then run through standard analysis tools. It turns out that (apart from the binary formats) this means writing regular expressions with a thin veneer of logic glueing them together. You can see an example of that in my crawler that gathers school ranking data from the LA Times. There's a single massive regex at the top of the file that contains a group for each piece of data I want to gather. Using PHP to pull the HTML, I match against the raw text, grab out the contents of each group and write them out as a row to a CSV file. My Buzz profile crawler uses the same idea, but with a much more complex set of regexes. To convert unemployment data files I use a similar approach on the strange text file format the BLS uses.

After spending so long building all this throwaway code, I think I know the tool I want. It would let me load a large data set into an interactive editor. I'd then highlight a series of 'records' that would each be transformed into a row in the output CSV file. It would automagically figure out a good regex that matched all the records but none of the intervening filler text. Then you'd select the text that represented the values for each record, the numbers or strings you'd expect to see in each column of the CSV output. By selecting equivalent values in three or four records, the tool could build nested regexes or groups that extracted them from each record. As I was doing all this, I'd see a sample of the results show up in a sidebar pane. I'd be able to hand-tweak the underlying regular expressions driving each part too, and see how the matches change.

This sounds a lot like something you could build into Emacs (but then anything sounds a lot like something you could build into Emacs). I have no experience creating extensions though, so I'd love to know if there's anything already out there, in any environment? It doesn't have to do everything I'm after, just an interactive grep a little more advanced than control-S in Emacs would be an interesting step forward compared to running it all from the command line. Is anybody aware of anything like that? Is there a completely different approach I'm missing?

Is London the greenest place in England?

http://www.openheatmap.com/view.html?map=IncisoryPetromyzontSphaeriaceae

I was helping Tony Hirst get up-and-running with OpenHeatMap when I first ran across the Guardian's analysis of data on CO2 emissions across the UK. What leapt out at me was how little London was pumping out. Almost every borough is producing around 4 or 5 tonnes per person, and nowhere else in the country other than the Highlands can beat that. Northumberland manages to produce 18 tonnes for every inhabitant!

There are some anomalies, the City of London and Westminster are off the charts because there's so few people who count as living there despite a massive daily influx, and I wonder how much of Kent's output is related to supplying London's neeeds. The overall picture is clear though – if you want to reduce your carbon emissions, the city is the place to go. I vaguely knew this already, but the difference here was eye-popping. It turns out that encouraging denser development to reduce the impact on the environment is gaining ground as an idea, but seeing the difference so starkly makes me think twice about my choices.

Here's an Excel spreadsheet with the raw data, ready to upload into OpenHeatMap if you want to see the full details. I'm no environmental scientist, so what do you think? Am I missing something, or should we all flee to the cities?

Netezza shows there’s more than one way to handle Big Data

Alternativeroute
Photo by Nick Dimmock

As you may have noticed I'm a strong advocate of MapReduce, Hadoop and NoSQL, but I'm not blind to their limits. They're perfect for my needs primarily because they're dirt-cheap. There's no way on God's earth I could build my projects on enterprise systems like Oracle, the prices are just too high. The trade-off I make is that I spend a lot of my own time on learning and maintenance, whereas a lot of the cost of commercial systems reflects the support companies receive, so they don't need their own employees to geek out like that.

Another thread has been the relatively poor price/performance ratio of standard SQL databases when it comes to terabytes of data. That's where Netezza has been interesting, and today's announcement that they were being acquired by IBM highlighted how much success they've had with their unique approach.

The best way to describe Netezza's architecture was that they built the equivalent of graphics cards, but instead of being focused on rendering they worked at the disk controller level to implement common SQL operations as the data was being loaded from the underlying files, before it even reached the CPU. As a practical example, instead of passing every row and column from a table up to main system memory, their FPGA-based hardware has enough smarts to weed out the unneeded rows and cells and upload a much smaller set for the CPU to do more complex operations on. For more information, check out Curt Monash's much more in-depth analysis.

Why does this matter? This completely flies in the face of the trend towards throwing a Mongolian horde of cheap commodity servers at Big Data problems in the way that Google popularized. It also suggests that there might be another turn of the Wheel of Reincarnation about to get started. These happen when a common processing problem's requirements outstrip a typical CPU's horsepower, and the task is important enough that it becomes worthwhile to build specialized co-processors to handle it. CPU power then increases faster than the problem's requirements, and so eventually the hardware capability gets folded back into the main processor. The classic example from my generation is the Amiga-era graphics blitters for 2D work which got overwhelmed by software renderers in the 90's, but were then reincarnated as 3D graphics cards.

At the moment we're in the 'Everything on the CPU with standard hardware' trough of the wheel for Big Data processing, but with Google hinting at limits to its MapReduce usage, maybe we'll be specializing more over the next few years?

OpenHeatMap expands to Ireland

Irishscreenshot

Thanks to some dedicated work from Steve White and Ben Raue's creative-common's licensed release of the boundaries, now you can visualize counties and constituencies in the Republic of Ireland. Steve's already dived into mapping both general election results and the Lisbon referendum, and I have the 2012 electoral boundaries prepared too.

Would you like your country added to OpenHeatMap? Drop me an email and I'll get on the case!

Calling all Boston and New York data geeks

Nypd
Photo by Marcos Vasconcelos

I'm flying out to the East Coast next weekend and have some gaps in my schedule. I'm attending a planning meeting for the upcoming Strata O'Reilly conference, and I would love to get thoughts on sessions from other data geeks. So if you're free in downtown Boston at 2pm on Friday September 24th, or in Manhattan at 3pm on Monday September 27th, and you're interested in Big Data and visualizations, drop an email to pete@mailana.com and I'll send you the meetup details.