Five short links

Dicefive
Photo by Doug88888

Stanford’s Wrangler – A promising data-wrangling tool, with a lot of the interactive workflow that I think is crucial.

Open Knowledge Conference – They’ve gathered an astonishing selection of speakers. I’m really hoping I can make it out to Berlin to join them.

The Privacy Challenge in Online Prize Contests – It’s good to see my friend Arvind getting his voice heard in the debate around privacy.

The Profile Engine – A site that indexes Facebook profiles and pages, with their permission.

Acunu – I met up with this team in London, and they’re doing some amazing work at the kernel level to speed up distributed key/value stores, thanks to some innovative data structures.

Kindles Profiles are so close to being wonderful

"Propose to an Englishman any … instrument, however admirable, and you will observe that the whole effort of the English mind is directed to find a difficulty, defect, or an impossibility in it. If you speak to him of a machine for peeling a potato, he will pronounce it impossible; if you peel a potato with it before his eyes, he will declare it useless, because it will not slice a pineapple"

I'd completely forgotten about this deliciously bitter quote from Charles Babbage in The Philosophical Breakfast Club, but thanks to Amazon's Kindle profiles site, I re-discovered it listed in my highlights. I was very excited when I stumbled across this social feature, I've been looking for an automatic way to share my reading list with friends. I've even experimented with scripts to scrape the reading history from my account, but never got anything complete enough to use. My dream is a simple blog widget showing what I'm reading, but without the maintenance involved in updating GoodReads with my status. I'm often reading on a plane or in bed at night, so the only way I'll have something up to date is if it uses information directly from my Kindle. I looked at the highlights page, and it looked like exactly what I was after, a chronological list of notes and the books I'd been reading recently:

Kindleshot

Now all I needed to do was figure out how to make that page public. First, I had to go through all 160 books, and manually mark two check boxes next to each of them, making the book public, and then making my notes on it available. That was a bit of a grind (and something I guess I'll need to do for every book as I read it), but worth it if I could easily publish my highlights. After that though, I realized there was nothing like a 'blog' page for my notes that was available to anyone else. The closest is this one for my public notes:

https://kindle.amazon.com/profile/Peter-C–Warden/11996/public_notes

It just has covers for the five books I most recently altered the state of, whether or not they have any notes or highlights, and you have to click through to find any actual notes. The "Your Highlights" section that only I can access is perfect, exactly what I would like to share with people, its simplicity is beautiful. Short of posting my account name and password here, does anyone have any thoughts on how I could get it out there? Anybody at Amazon I can beg?

Facebook and Twitter logins aren’t enough

Bulgariastamp
Photo by Karen Horton

A couple of months ago I claimed "These days it doesn't make much sense to build a consumer site with its own private account system" and released a Ruby template that showed how to rely on just Facebook and Twitter for logins. It turns out I was wrong! I always knew there would be some markets that didn't have enough adoption of those two services, but thought that the tide of history would make them less and less relevant. What I hadn't counted on was kids.

My Wordlings custom word cloud service has seen a lot of interest from teachers who want to use it with their students, but especially amongst pre-teens, there's little chance they're on either Facebook or Twitter. They may not even have an email address to use! Since that's not likely to change, I added a new "Sign in for Kids" option that just requires a name, skipping a password even. It has the disadvantage that once you log out, you can't edit any of your creations, but that seems a small price to pay to make the service more accessible.

Using Hadoop with external API calls

Flyingelephant
Photo by Joe Penniston

I've been helping a friend who has a startup which relies on processing large amounts of data. He's using Hadoop for the calculation portions of his pipeline, but has a home-brewed system of queues and servers for handling other parts like web crawling and calls to external API providers. I've been advising him to switch almost all of his pipeline to run as streaming jobs within Hadoop, but since there's not much out there on using it for those sort of problems, it's worth covering why I've found it makes sense and what you have to watch out for.

If you have a traditional "Run through a list of items and transform them" job, you can write that as a streaming job with a map step that calls the API or does another high-latency operation, and then use a pass-through reduce stage.

The key advantage is management. There's a rich ecosystem of tools like ZooKeeper, MRJob, ClusterChef and Cascading that let you define, run and debug complex Hadoop jobs. It's actually comparatively easy to build your own custom system to execute data-processing operations, but in reality you'll spend most of your engineering time maintaining and debugging your pipeline. Having tools available to make that side of it more efficient lets you build new features much faster, and spend much more time on the product and business logic instead of the plumbing. It will also help as you hire new engineers, as they may well be familiar with Hadoop already.

The stumbling block for many people when they think about running a web crawler or external API access as a MapReduce job is the picture of an army of servers hitting the external world far too frequently. In practice, you can mostly avoid this by using a single-machine cluster tuned to run a single job at a time, which serializes the access to the resource you're concerned about. If you need finer control, a pattern I've often seen is a gatekeeper server that all access to a particular API, etc has to go through. The MapReduce scripts then call that server instead of going directly to the third-party's end-point, so that the gatekeeper can throttle the frequency to stay within limits, back off when there's 50x errors, and so on.

So, if you are building a new data pipeline or trying to refactor an existing one, take a good look at Hadoop. It almost certainly won't be as snug a fit as your custom code, it's like using lego bricks instead of hand-carving, but I bet it will be faster and easier to build your product with. I'll be interested to hear from anyone who has other opinions or suggestions too of course!

The Best Park in San Francisco

Buenavista0
All photos © Heather Champ, with permission

My biggest worry when I moved to San Francisco in December was that my dog Thor would find urban life tough, without the wide-open spaces we'd got used to in Colorado. I found an apartment with a wide sunny window-sill for him to lie on and next to the great off-leash Duboce dog park, but what I didn't realize until I started exploring was that there was another gem just half a mile away. The first signs I saw of Buena Vista Park were the trees at its peak looming over the neighborhood, their tops wrapped in fog. Following Duboce Avenue uphill to its end, and then meandering through 37 acres of woodland, I found myself 575 feet high, looking out over the city as the clouds cleared.

It's now become our morning walk, and we can make it to the peak and back in 45 minutes if I'm in a hurry. That's not often though, because catching up with the other regulars has become part of the pleasure. In the peculiar way of the dog-walking world, I often feel like I know the canines before I've properly met the owners, especially since I can't compete with Thor's natural charms. A great case in point is Bug and Chieka's owner, or as she's better known in the tech world, Heather Champ, the pioneering former Flickr community manager who furnished these photos.

Though it's not on the same scale as Golden Gate or Presidio, it's actually the oldest official park in the city, and is full of corners and history to explore. It was only when I was chatting to one of the gardeners that I realized the marble chunks lining some of the paths were actually fragments of gravestones from the cemetery that was razed by WPA workers in the 30's. At night its nooks can make it less welcoming, with hoboes camping out and casual hookups, but in the morning it's a slice of heaven.

Buenavista1

If you live anywhere near the Lower Haight, Upper Castro or Noe Valley areas, you really should check out this urban garden. There's nothing like wandering through groves of Eucalyptus and Redwoods, watching the fog blow through the canopy, to refresh your soul after working through painfully obtuse YouTube comments (just to pick a random example, ahem). And if you see a cute Chihuahua mix with a snaggle-tooth, be sure to make a fuss of him.

Buenavista2

Location Tracking as Art

Livingbrushstrokes

I just discovered Maria Scileppi’s Living Brushstroke project that uses iPhone location tracking to create artistic views of people’s movements. The picture above is from the Chiditarod, “Probably the world’s largest pub crawl/food drive”. Almost everyone’s first reaction to seeing the trace of their movements retrieved from the iPhone logs is the same as ours, “Cool!”. If we can give users control over their own data (asking permission to record as Maria’s app does), there’s so many amazing projects like this we can build.

Anthem from Maria Scileppi on Vimeo.

Five Short iPhone Tracking Links

Tinfoildog
Photo by Evil Science Chick

Some of you may have noticed a weekend hack I put together with Alasdair Allan for visualizing location data on your iPhone. Here's some random links related to the project:

Tell-All Telephone – View a German politician's life as a visualization, after he agreed to have detailed recording software added to his phone. [Update, I misread the story, and it was actually information gathered without his knowledge, but that he agreed to share afterwards to raise awareness. As commenter Seve says "The movement data of the german politician Matle Spitz were collected by law and not with his agreement. The movement and call data of everybody connected to a german mobil phone network were stored for 6 months. Finally the german supreme court stopped that law and the data were deleted. "

Geoloqi – A fascinating system that lets you volunteer to track your own movements, and share certain aspects of them with people you trust.

Location tracking on Android too – Detailed and fair technical analysis of the way that Android monitors your location, and how it differs from Apple's approach.

A Cryptographic Approach to Location Privacy – There are ways to get a lot of the benefits of location services without recording or revealing your position. Arvind's proposal shows how one application could work in a secure way.

22 Free Tools for Data Visualization and Analysis – The navigation is tough without a table of contents, but this is a guide overview of a lot of the tools you can use to turn your data into a visual story.

Data Science Toolkit 0.35 released

Toolkit
Photo by Wonderlane

I've just completed deploying a new version of the toolkit. It contains quite a few bug fixes and improvements, along with two new features:

UK Support 

You can now enter British postal addresses into the street2coordinates geocoder, and it will return the coordinates. With post codes included, it's normally accurate to within a couple of hundred feet. You can then run those positions through coordinates2politics to get the parliamentary constituency, county, council district and ward, NHS area and post code.

Time and date extraction

The new text2times method will scan through the text that you pass it, and pull out any strings that it can understand as times or dates. These include both a variety of formal date/time combinations like '10/28/01' as well as informal descriptions like 'next Friday'.

You can try it out by going to http://www.datasciencetoolkit.org/ , the command-line tools are at http://www.datasciencetoolkit.org/python_tools.zip , the new AMI is ami-f6e11d9f, and there's a new VM at http://static.openheatmap.com/dstk_v0.35.vmwarevm.tar.bz2

You should still be able to use your existing code with no changes, I've done my best to ensure everything's backwards-compatible, but let me know if anything breaks.

Five Short Links

Tetrahedra
Photo by Nicolas Suzor

Newscoop – A content-management system designed by and for journalists. It’s been used in conjunction with Ushahidi to interesting effect.

Crisismappers extends UN capacity in Libya – As the power of crowd-mapping becomes more obvious, there will be pressure to use it in more ambiguous situations than natural disasters or the toppling of tyrants. Mapping military airbases in Libya gives a hint of what crowd-sourced warfare could look like.

StarCluster – A simple way to run and manage clusters of EC2 machines for scientific computing, complete with AMIs pre-loaded with useful software and sensible defaults. If this is your thing, you should check out Infochimp’s impressive ClusterChef too.

Fathom – A design firm with a portfolio of clear, crisp and beautiful infographics,

Trunkler – Link curation for your iPhone, built on top of the powerful Trunk.ly service. Shows why having a third-party API can pay off, even in the early days.

The cock-up theory of technology news

Brachiosaurus
Photo by Joshua Mellin

A few times over the last few weeks I've been talking to friends about big tech company news, and one of the hardest struggles I have is to convince them that the latest Twitter or Google happening could just be a random cockup, rather than a sign of hidden plans by the company's management. I could well be wrong, but Google releasing a group-messaging app that doesn't run on Android reminds of the time I was involved in launching a GPU image processing API at Apple that competed against a similar interface created by another team at almost the same time. We mostly managed to keep that sort of thing from reaching the outside world, but every team and department in a large corporation is competing against all the other teams for a slice of the budget pie, and can have very different goals. Good management will keep the worst excesses in check, but any large organization is a massively distributed system where the communication overhead of keeping everyone totally in line would be crippling.

I was sorry to learn that large dinosaurs are no longer thought to have had an extra brain in their buttocks, as I'd memorably learned as a child from the Dinosaur Club. It's still a pretty good image for the decision-making apparatus in big companies though. Upper-management's most precious resource is time, and so attention has to be rationed. Especially at a company like Google that prizes experimentation, that means lower-level people can release projects into the wild that don't fit into the grander strategy. As another example from Apple, I know several teams that still hadn't ported their code over to the Cocoa framework even by the time I left in '08. Eating your own dog food is a fine goal, but if it comes down to that or shipping, sometimes pragmatism wins.

Think of my approach as the cock-up theory of technology news. The next time one of the big firms does something that makes no sense at all, consider taking it at face value. As General Nasser said in the 50's – "The genius of you Americans is that you never make clear-cut stupid moves, only complicated stupid moves which make the rest of us wonder at the possibility that we might be missing something".