My last post was a quick rant about the need for a decent open geocoder, but what's wrong with the ones we have? I've created a command-line tool to explore their quality: https://github.com/petewarden/geocodetest.
As a first pass, I pulled together a list of six addresses, some from my past and a few from spreadsheets users have uploaded to OpenHeatMap. The tool runs through the list (or any file of addresses you give it) and geocodes them through DSTK and Nominatim, returning a CSV of whether the locations are within 100m of Google's result. Run the script with -h to see all the options. Here are the results, produced by running ./geocodetest.rb -i testinput.txt
Y,Y,Y,2543 Graystone Place, Simi Valley, CA 93065
Y,Y,N,400 Duboce Ave, #208, San Francisco CA 94117
Y,Y,N,11 Meadow Lane, Over, Cambridge, CB24 5NF, UK
N,Y,N,VIC 3184, Australia
N,Y,Y,Lindsay Crescent, Cape Town, South Africa
N,Y,N,3875 wilshire blvd los angeles CA
The first three are standard test cases for me, so it's not be a massive surprise that my DSTK (based on Schuyler Erle and GeoIQ's original work) works better than Nominatim for two of them. It does highlight one of the reasons I've struggled to use Nominatim though – it's not good at coping with alternative address forms. This makes it quite brittle, especially around addresses like the UK where there are multiple common permutations of village, city, and county names. Nominatim doesn't return any results for #2 or #3 at all, when I'd hope for at least a town-level approximation.
The Australian postal code is about 30 km from Google's result, whereas the open GeoNames data in the DSTK gets me to within 400m of Google. Nominatim does much better on the SA address, since I haven't imported OSM data into the DSTK for anywhere but the UK. I did have to correct the original user-entered spelling of 'Cresent' first though, and I'd love to see an open geocoder that was robust to this sort of common mistake. The last address is another sloppy one, but we should be able to cope with that one too!
Part of the reason there hasn't been more progress on open geocoders is that the problems are not very visible. I hope having an easy test harness changes that, and while this first pass is far from scientific, it's already inspired me to put in several fixes to my own code. I'm a big fan of the effort that's been put into the Nominatim project (I'm using their OSM loading code myself) I'm just disappointed that the results haven't been good enough to build services like OpenHeatMap on top of. I'll be expanding this tool to cover more addresses and so build a better 'map' of how we're doing, and what remains to be done. I'm excited by the opportunities to make progress here, I'll be busy working more on my own efforts and I can't wait to hear other folks thoughts too.