Five short links

Chain
Photo by Wink

Tribalytic – My friends Alex and Tim have been doing some fascinating work applying statistical analysis to Twitter conversations. Their 'space shuttle control panel' interface can be a bit off-putting, but if you dig down you can see insights into the Chirp talks that prompted the most traffic and spot Binh from Klout looking forward to the after-party.

Breadcrumbs – Kate McKinley's created a great demonstration of how many different ways there are to store data about a user, so advertisers can identify them as they move around the web. She's also produced a good paper covering the details, which reminds me of Arvind's work on 'super-cookies'.

Spinn3r have announced in their email newsletter that they're now crawling public Facebook pages and making the results available as a feed for their commercial subscribers. "We're indexing all Facebook public pages, which do not require login, including public fan pages and their wall posts, videos, albums and pictures. We also index Facebook public groups including topics and the comments responding to these topics. The current volume is in excess of 50k permalinks and 30k comments per hour."

Sendgrid have received $5m in VC funding. I went through Techstars this summer with Isaac and Jose, and loved their quiet focus on solving a vital problem, helping companies reach their email subscribers without ending up in the spam box. They have been kicking ass and earning revenue, and this injection of cash will help them reach even higher.

"Plates of Spaghetti" graphs – Almost all visualizations are terrible at communicating information, but are often fantastic marketing devices, drawing people into looking into the source data. I like the quote "The “data visualizations of the year” really are impressive if you think
of them as super-cool illustrations (replacements for the usual photos
or drawings that might accompany a newspaper or magazine article) rather
than as visual displays of quantitative information
". I've long been mulling a post entitled "Most visualizations are useless" (including mine!)

Five short links

Sausages
Photo by Tammy Green

I'm flying off to visit my family in the UK today, but I have a backlog of interesting URLs I wanted to blog about, so I'm temporarily stealing Nat Torkington's Four Short Links format. However, since I go up to 11, my version has five.

Thoughts from the Man Who Would Sell the World, Nicely – I've long been a fan of 80leg's service, they're democratizing crawling. How will the crawled companies react to this now that literally anyone can download millions of profiles from services like LinkedIn and MySpace, with no licensing or terms-of-service restrictions?

Fetch Technologies – On the same topic, I don't know that much about Fetch but they seem to be a sophisticated and well-funded company based on crawling the public web to gather information for commercial purposes.

The World Bank Bares All – I was very excited to discover that the World Bank offers over 1,000 different measures for countries for free. Not only that, but you can download a CSV file of all the data, instead of being restricted to an API. I'm now using this for an upcoming project, I hope more providers consider data dumps in addition to APIs, they open up so many more uses.

IndieMapper.com – A well-produced service for visualizing geographic data on the web. It's great to see more GIS tools migrating online, it's opens up the results to a much larger audience.

Never hire job hoppers. Never. They make terrible employees – Mark's since walked this article back a bit. It reminded me of the evidence that we all have a bias to hire people exactly like ourselves, and Bob Sutton's take on it: "Interviews are strange in that people have excessive confidence in them, especially in their own abilities to pick winners and losers — when in fact the real explanation is that most of us have poor and extremely self-serving memories."

How to look up locations from IP addresses for free

Youarehere
Photo by Mag3737

I'm working on a project where I need to convert large numbers of IP addresses to latitude/longitude positions, and I was pretty depressed looking at Quova's rates starting at $8 per thousand queries. I was happy to lose a bit of quality for a cheaper rate, so I was overjoyed to come across MaxMind's free database of city-level IP lookups. Even better, I could install it on my own server rather than making remote API calls, which makes dealing with large amounts of lookups a lot quicker.

There was some example PHP code available, but it had PEAR dependencies I'd rather avoid, so I made some alterations and uploaded my sample code to github.com/petewarden/geoip_example with a live demo running at web.mailana.com/labs/geoip_example/

Before you can run it on your own server you'll need to install the data files, either using the one included in the package or downloading the latest from http://www.maxmind.com/app/geolitecity

Once you have the GeoLiteCity.dat file downloaded and unzipped, copy it to /usr/local/share/GeoIP, or update the code to reflect the location you've actually installed it in.

Big thanks to the MaxMind folks for making this available under the LGPL, they'll definitely be getting my business next time I need a paid geo-location service.

Is a phone book for the internet emerging?

Glasgowphonebook
Photo by Martin Deutsch

A programmer's basic instinct is to automate any manual task you find yourself doing repeatedly. That's why I'm amazed we haven't built better solutions for finding people online. Most people go through the following steps when they want to know more about someone they've just met:

1 – Type a name into Google, LinkedIn or Facebook and see what public profiles appear.

2 – Figure out which profiles are for right person based on what else you know about them, either a rough location, their job, friends you have in common or other sites they list.

As more and more information is published about individual users step two gets easier, because you can cross-check across multiple accounts. Maybe my LinkedIn profile doesn't give enough details to be sure that I'm the Pete Warden you met, but it links to my Twitter account where I'm rambling about my upcoming UK visit, and that fits with the funny accent you remember.

What's missing is a good set of tools to assist the second step. It's silly to have people wasting time doing this sort of detective work manually, when some simple automation would speed up the whole process. The data on Twitter, LinkedIn and other public profiles has some structure, it just requires some smarter indexing on the search engine side to make use of it. My Twitter profile lists data in hCard format so it's easy to figure out that http://twitter.com/petewarden is about a person called "Pete Warden" based in Boulder, CO. My LinkedIn profile also uses hCard and describes a person called "Pete Warden" in the Greater Denver Area. Why not make a wild guess and present all the profiles that are close matches like that together in the search results? Sure, the grouping will be wrong sometimes, but most of the time it will cut out a lot of messing around on the user's part to do the same process manually.

Google's Profiles would be great holders for that sort of information, but they require users to fill out yet another set of forms. Sites like 123people.com try to automate the whole process, but frankly don't do a good job and are packed with off-putting ads. 

It's the spread of services like Gist, Xobni and Rapportive that gives me hope that change is on the horizon. Data flows into them from either their own customer base or providers like Rapleaf, and they're starting to build unified pictures of people online. Just like a phone book in the old days, you should be able to enter someone's name and get whatever information they've chosen to publish about themselves.

Sheep, sex and Nazis

Sheep
Photo by Gisela Giardino

Maybe it's growing up with British tabloid headlines, but I wish Sam Apple had chosen "Sheep, sex and Nazis" as a title instead of "Schlepping through the Alps". The phrase is his own description of the world he chronicles, and is closer to the spirit of the book than the actual title which had me expecting a light-hearted travelogue in the vein of "Round Ireland with a Fridge".

Sam's the editor of The Faster Times and he got in touch to offer his sympathies after my run-in with Facebook. Since anybody who likes my blog obviously has excellent judgment, I googled him and was intrigued to see he'd written about his experiences as a young American Jew following a wandering Yiddish-singing shepherd around Austria.

I started reading unsure what to expect, but what struck me was the honesty of his descriptions. That quality sounds unremarkable, but is amazingly hard to achieve because people are so contradictory. Whenever you're telling a story rough edges get rounded off to make it flow, even if it's just omitting certain details. He manages to capture an intimate portrait of a few people he got close to, and through them of a whole country with a lingering dark past, without simplifying details to make answers to their dilemmas seem easier than they truly are.

He flies to Austria to learn more about Hans Breuer, the last shepherd to wander through the Austrian Alps with his flock. Hans' father is a non-practicing Jew, and Hans himself has become obsessed with the rich Yiddish culture from pre-war Europe, taking it on himself to memorize and perform the old songs wherever he can. Sam's family came to the US from the old world before the Nazis came to power, and he's aghast at Austria's post-war response to the Holocaust. The heart of the book is his attempt to pin down that collective failure by understanding individuals, and his honesty forces him to acknowledge the less noble sides of his own quest, Hans' faults and even the human side of the single open anti-semite he tracks down. These nuances mean you can't help but see parts of yourself in all the characters, and realize that people much like us committed and covered up horrors that are hard to imagine.

If you've ever enjoyed Orwell, I recommend you pick up Schlepping. As he put it, "To see what is in front of one's nose needs a constant struggle", and Sam's obvious struggle to be true to the reality of his subjects brings the book into the same league as "The Road to Wigan Pier" in its insights on a crucial topic.

How to download a list of your Facebook fans

Windturbines
Photo by Tochis

A friend recently pointed me to this New York Times story on web coupons and privacy. Aside from the implications of tying in your social accounts to your buying habits, what struck me was Jonathan Treiber's assertion that "when someone joins a fan club, the user’s Facebook ID becomes visible to
the merchandiser
".

One of the biggest complaints I've heard from companies involved in Facebook fan pages is that they don't know how their fans are. In traditional direct marketing they have a list of customer names and their postal or email addresses, but with fan pages they only get permission to contact their fans indirectly. Everything has to go through Facebook, and at any point the network could decide to cut them off or charge them to talk to their own customers. Presumably that control has a lot of potential value to Facebook, they don't offer an official API for page owners to get a list of their fans, but Jonathan's quote got me wondering if there was another way?

Googling around, I ran across this post from Gist's very own Adam Loving explaining how he'd managed to download all the fans for his page. The process is a bit technical, and may well violate the ToS so use it at your own discretion, but it sounds like it's already in widespread usage by page owners.

The story itself makes me think that Beacon 2.0 is likely to be run by third-party companies on top of Facebook and other social network data. There's now widespread access to public profiles across all the networks, it's going to be hard to stuff that genie back into the bottle.

Is making public data more accessible a threatening act?

Megaphone
Photo by Altemark

One of the most interesting questions to come out of the Facebook debate was about making public data more easily accessible. Everything I was looking at releasing was available through a Google search and through many other commercial companies, so in a simplistic view it was already completely public and releasing it in a convenient form made no difference. However that doesn't match our intuitive reactions, we are a lot more relaxed when data is theoretically available to anyone but hard to get to than when there's an easy way to access it.

One of my favorite researchers in this area, Arvind Narayanan, recently started a series of articles that try to turn this gut reaction into a usable model. I also spent a very productive lunch with Jud Valeski, Josh Fraser and Jon Fox hashing out the implications of the coming wave of accessibility, so here's a few highlights from that discussion.

Prop 8. Information about donors to political campaigns has always been public, but traditionally required a visit to city hall to dig through piles of paper. Suddenly the donors behind Prop 8 in California found themselves listed on a map anyone could access on the internet. While predictions of violence or boycotts didn't materialize, Scott Eckern ended up resigning from his job once his donation became widely known. I'm pretty certain he wasn't aware that his donation would be public knowledge, it's a clear case where the the distribution channel made the information much more powerful.

InfoUSA. Imagine a thought experiment where I downloaded the income, charitable donations, pets and military service information for all 89,000 Boulder residents listed in InfoUSA's marketing database, and put that information up in a public web page. That's obviously pretty freaky, but absolutely anyone with $7,000 to spare can grab exactly the same information! That intuitive reaction is very hard to model. Is it because at the moment someone has to make more of an effort to get that information? Do we actually prefer that our information is for sale, rather than free? Or are we just comfortable with a 'privacy through obscurity' regime?

So what's my conclusion? On the one hand, the web has created so many amazing innovations because it's a fantastic way to make information more available, and initial privacy concerns have faded into the background as people become more used to services. On the other, the jury's not back on how the revolution will end. Is everyone really going to be their own public broadcaster on Twitter, or are we going to retreat into more private forums in the wake of future freakouts? I don't know the answer, but everyone working in this area needs to be thinking about more than the technical aspects of data accessibility.