Five short links

Chain
Photo by Wink

Tribalytic – My friends Alex and Tim have been doing some fascinating work applying statistical analysis to Twitter conversations. Their 'space shuttle control panel' interface can be a bit off-putting, but if you dig down you can see insights into the Chirp talks that prompted the most traffic and spot Binh from Klout looking forward to the after-party.

Breadcrumbs – Kate McKinley's created a great demonstration of how many different ways there are to store data about a user, so advertisers can identify them as they move around the web. She's also produced a good paper covering the details, which reminds me of Arvind's work on 'super-cookies'.

Spinn3r have announced in their email newsletter that they're now crawling public Facebook pages and making the results available as a feed for their commercial subscribers. "We're indexing all Facebook public pages, which do not require login, including public fan pages and their wall posts, videos, albums and pictures. We also index Facebook public groups including topics and the comments responding to these topics. The current volume is in excess of 50k permalinks and 30k comments per hour."

Sendgrid have received $5m in VC funding. I went through Techstars this summer with Isaac and Jose, and loved their quiet focus on solving a vital problem, helping companies reach their email subscribers without ending up in the spam box. They have been kicking ass and earning revenue, and this injection of cash will help them reach even higher.

"Plates of Spaghetti" graphs – Almost all visualizations are terrible at communicating information, but are often fantastic marketing devices, drawing people into looking into the source data. I like the quote "The “data visualizations of the year” really are impressive if you think
of them as super-cool illustrations (replacements for the usual photos
or drawings that might accompany a newspaper or magazine article) rather
than as visual displays of quantitative information
". I've long been mulling a post entitled "Most visualizations are useless" (including mine!)

Five short links

Sausages
Photo by Tammy Green

I'm flying off to visit my family in the UK today, but I have a backlog of interesting URLs I wanted to blog about, so I'm temporarily stealing Nat Torkington's Four Short Links format. However, since I go up to 11, my version has five.

Thoughts from the Man Who Would Sell the World, Nicely – I've long been a fan of 80leg's service, they're democratizing crawling. How will the crawled companies react to this now that literally anyone can download millions of profiles from services like LinkedIn and MySpace, with no licensing or terms-of-service restrictions?

Fetch Technologies – On the same topic, I don't know that much about Fetch but they seem to be a sophisticated and well-funded company based on crawling the public web to gather information for commercial purposes.

The World Bank Bares All – I was very excited to discover that the World Bank offers over 1,000 different measures for countries for free. Not only that, but you can download a CSV file of all the data, instead of being restricted to an API. I'm now using this for an upcoming project, I hope more providers consider data dumps in addition to APIs, they open up so many more uses.

IndieMapper.com – A well-produced service for visualizing geographic data on the web. It's great to see more GIS tools migrating online, it's opens up the results to a much larger audience.

Never hire job hoppers. Never. They make terrible employees – Mark's since walked this article back a bit. It reminded me of the evidence that we all have a bias to hire people exactly like ourselves, and Bob Sutton's take on it: "Interviews are strange in that people have excessive confidence in them, especially in their own abilities to pick winners and losers — when in fact the real explanation is that most of us have poor and extremely self-serving memories."

How to look up locations from IP addresses for free

Youarehere
Photo by Mag3737

I'm working on a project where I need to convert large numbers of IP addresses to latitude/longitude positions, and I was pretty depressed looking at Quova's rates starting at $8 per thousand queries. I was happy to lose a bit of quality for a cheaper rate, so I was overjoyed to come across MaxMind's free database of city-level IP lookups. Even better, I could install it on my own server rather than making remote API calls, which makes dealing with large amounts of lookups a lot quicker.

There was some example PHP code available, but it had PEAR dependencies I'd rather avoid, so I made some alterations and uploaded my sample code to github.com/petewarden/geoip_example with a live demo running at web.mailana.com/labs/geoip_example/

Before you can run it on your own server you'll need to install the data files, either using the one included in the package or downloading the latest from http://www.maxmind.com/app/geolitecity

Once you have the GeoLiteCity.dat file downloaded and unzipped, copy it to /usr/local/share/GeoIP, or update the code to reflect the location you've actually installed it in.

Big thanks to the MaxMind folks for making this available under the LGPL, they'll definitely be getting my business next time I need a paid geo-location service.

Is a phone book for the internet emerging?

Glasgowphonebook
Photo by Martin Deutsch

A programmer's basic instinct is to automate any manual task you find yourself doing repeatedly. That's why I'm amazed we haven't built better solutions for finding people online. Most people go through the following steps when they want to know more about someone they've just met:

1 – Type a name into Google, LinkedIn or Facebook and see what public profiles appear.

2 – Figure out which profiles are for right person based on what else you know about them, either a rough location, their job, friends you have in common or other sites they list.

As more and more information is published about individual users step two gets easier, because you can cross-check across multiple accounts. Maybe my LinkedIn profile doesn't give enough details to be sure that I'm the Pete Warden you met, but it links to my Twitter account where I'm rambling about my upcoming UK visit, and that fits with the funny accent you remember.

What's missing is a good set of tools to assist the second step. It's silly to have people wasting time doing this sort of detective work manually, when some simple automation would speed up the whole process. The data on Twitter, LinkedIn and other public profiles has some structure, it just requires some smarter indexing on the search engine side to make use of it. My Twitter profile lists data in hCard format so it's easy to figure out that http://twitter.com/petewarden is about a person called "Pete Warden" based in Boulder, CO. My LinkedIn profile also uses hCard and describes a person called "Pete Warden" in the Greater Denver Area. Why not make a wild guess and present all the profiles that are close matches like that together in the search results? Sure, the grouping will be wrong sometimes, but most of the time it will cut out a lot of messing around on the user's part to do the same process manually.

Google's Profiles would be great holders for that sort of information, but they require users to fill out yet another set of forms. Sites like 123people.com try to automate the whole process, but frankly don't do a good job and are packed with off-putting ads. 

It's the spread of services like Gist, Xobni and Rapportive that gives me hope that change is on the horizon. Data flows into them from either their own customer base or providers like Rapleaf, and they're starting to build unified pictures of people online. Just like a phone book in the old days, you should be able to enter someone's name and get whatever information they've chosen to publish about themselves.

Sheep, sex and Nazis

Sheep
Photo by Gisela Giardino

Maybe it's growing up with British tabloid headlines, but I wish Sam Apple had chosen "Sheep, sex and Nazis" as a title instead of "Schlepping through the Alps". The phrase is his own description of the world he chronicles, and is closer to the spirit of the book than the actual title which had me expecting a light-hearted travelogue in the vein of "Round Ireland with a Fridge".

Sam's the editor of The Faster Times and he got in touch to offer his sympathies after my run-in with Facebook. Since anybody who likes my blog obviously has excellent judgment, I googled him and was intrigued to see he'd written about his experiences as a young American Jew following a wandering Yiddish-singing shepherd around Austria.

I started reading unsure what to expect, but what struck me was the honesty of his descriptions. That quality sounds unremarkable, but is amazingly hard to achieve because people are so contradictory. Whenever you're telling a story rough edges get rounded off to make it flow, even if it's just omitting certain details. He manages to capture an intimate portrait of a few people he got close to, and through them of a whole country with a lingering dark past, without simplifying details to make answers to their dilemmas seem easier than they truly are.

He flies to Austria to learn more about Hans Breuer, the last shepherd to wander through the Austrian Alps with his flock. Hans' father is a non-practicing Jew, and Hans himself has become obsessed with the rich Yiddish culture from pre-war Europe, taking it on himself to memorize and perform the old songs wherever he can. Sam's family came to the US from the old world before the Nazis came to power, and he's aghast at Austria's post-war response to the Holocaust. The heart of the book is his attempt to pin down that collective failure by understanding individuals, and his honesty forces him to acknowledge the less noble sides of his own quest, Hans' faults and even the human side of the single open anti-semite he tracks down. These nuances mean you can't help but see parts of yourself in all the characters, and realize that people much like us committed and covered up horrors that are hard to imagine.

If you've ever enjoyed Orwell, I recommend you pick up Schlepping. As he put it, "To see what is in front of one's nose needs a constant struggle", and Sam's obvious struggle to be true to the reality of his subjects brings the book into the same league as "The Road to Wigan Pier" in its insights on a crucial topic.

How to download a list of your Facebook fans

Windturbines
Photo by Tochis

A friend recently pointed me to this New York Times story on web coupons and privacy. Aside from the implications of tying in your social accounts to your buying habits, what struck me was Jonathan Treiber's assertion that "when someone joins a fan club, the user’s Facebook ID becomes visible to
the merchandiser
".

One of the biggest complaints I've heard from companies involved in Facebook fan pages is that they don't know how their fans are. In traditional direct marketing they have a list of customer names and their postal or email addresses, but with fan pages they only get permission to contact their fans indirectly. Everything has to go through Facebook, and at any point the network could decide to cut them off or charge them to talk to their own customers. Presumably that control has a lot of potential value to Facebook, they don't offer an official API for page owners to get a list of their fans, but Jonathan's quote got me wondering if there was another way?

Googling around, I ran across this post from Gist's very own Adam Loving explaining how he'd managed to download all the fans for his page. The process is a bit technical, and may well violate the ToS so use it at your own discretion, but it sounds like it's already in widespread usage by page owners.

The story itself makes me think that Beacon 2.0 is likely to be run by third-party companies on top of Facebook and other social network data. There's now widespread access to public profiles across all the networks, it's going to be hard to stuff that genie back into the bottle.

Is making public data more accessible a threatening act?

Megaphone
Photo by Altemark

One of the most interesting questions to come out of the Facebook debate was about making public data more easily accessible. Everything I was looking at releasing was available through a Google search and through many other commercial companies, so in a simplistic view it was already completely public and releasing it in a convenient form made no difference. However that doesn't match our intuitive reactions, we are a lot more relaxed when data is theoretically available to anyone but hard to get to than when there's an easy way to access it.

One of my favorite researchers in this area, Arvind Narayanan, recently started a series of articles that try to turn this gut reaction into a usable model. I also spent a very productive lunch with Jud Valeski, Josh Fraser and Jon Fox hashing out the implications of the coming wave of accessibility, so here's a few highlights from that discussion.

Prop 8. Information about donors to political campaigns has always been public, but traditionally required a visit to city hall to dig through piles of paper. Suddenly the donors behind Prop 8 in California found themselves listed on a map anyone could access on the internet. While predictions of violence or boycotts didn't materialize, Scott Eckern ended up resigning from his job once his donation became widely known. I'm pretty certain he wasn't aware that his donation would be public knowledge, it's a clear case where the the distribution channel made the information much more powerful.

InfoUSA. Imagine a thought experiment where I downloaded the income, charitable donations, pets and military service information for all 89,000 Boulder residents listed in InfoUSA's marketing database, and put that information up in a public web page. That's obviously pretty freaky, but absolutely anyone with $7,000 to spare can grab exactly the same information! That intuitive reaction is very hard to model. Is it because at the moment someone has to make more of an effort to get that information? Do we actually prefer that our information is for sale, rather than free? Or are we just comfortable with a 'privacy through obscurity' regime?

So what's my conclusion? On the one hand, the web has created so many amazing innovations because it's a fantastic way to make information more available, and initial privacy concerns have faded into the background as people become more used to services. On the other, the jury's not back on how the revolution will end. Is everyone really going to be their own public broadcaster on Twitter, or are we going to retreat into more private forums in the wake of future freakouts? I don't know the answer, but everyone working in this area needs to be thinking about more than the technical aspects of data accessibility.

Why your search results are getting worse, and what you can do about it

Spamposter
Photo by Larry & Flo

Have you noticed more useless or deceptive links showing up in your Google search results? I have and it looks like I'm not alone. More and more of the time when I'm doing a technical query, I click on pages that from the summary look like they might have the answer, but they turn out to be bait-and-switch. They're usually a small amount of text captured from somewhere else surrounded by a mass of related ads.

Why is this happening? The short answer is that publishers are getting better at tricking Google into ranking low-quality articles highly. There's always been a battle between content-producers who want their content to appear in search results, and search engines trying to send their users to relevant articles. In the 90's it was enough to repeat popular keywords hundreds of times at the bottom of the page, but Google killed such simple approaches using a combination of PageRank and algorithms to assess the quality of the content.

Unfortunately, truly evaluating the quality of a text article is an AI-complete problem. Instead Google has relied on statistical tests to spot repetitions, copied content and obvious nonsense, but publishers have figured out the limits of that approach, and are busy churning out cheap low-quality content that squeaks through the tests.

Demand Media takes the Mechanical Turk route and pays people a tiny amount of money to create articles based on popular searches on sites like eHow. As you might expect, the articles tend to be pretty shallow and I doubt they help many people, but they at least pay lip service to creating original content.

Mahalo began as a reputable startup using professional editors to create good answers to common search queries. Over the last couple of years they've apparently switched to outsourcing instead, and most recently asked for 17 volunteer interns. Most recently, they've been credibly accused of scraping content created by other people without permission to blatantly game Google's rankings. Any content that's a blatant copy of other material should be spotted and down-ranked, but for some reason isn't.

So what can you do to improve your search results? You can report sites you consider spam to Google, but even if they agree it may be a while before they're removed from the listings. As a short-term measure you can add the following keyword to your searches to remove eHow for example:

-site:ehow.com -site:ehow.co.uk how to move from houston to canada

In the end the only thing that can really remove low-quality content from search results is having some human judgment in the ranking process. Google are pushing hard to use social information to build results based around what they know of your friends, and I expect that they will expand this to use information about the sites you and your friends actually visit. It's much more likely that I'm interested in news.ycombinator.com discussions than TechCrunch, because I frequently visit HN. In an ideal world links from there should show high in my search results, even though they might not for other people. Until that beautiful day dawns, I'll just pull down 100 results a page and let my eyeballs do the sorting.

How to convert from BLS to FIPS county codes

Meatgrinder
Photo by J Giacomoni

Don't be deceived by the whole Facebook brouhaha, that was an anomaly. I normally lead an extremely boring life, and quite like it that way. If you're a new reader, expect less drama and more code.

To illustrate that point, I just spent two hours of my life wrestling with conflicting federal standards for identifying US counties. The BLS has statistics on local unemployment for the last 20 years, and I want to visualize them. The Census provides some nice outlines of all the counties, which allows me to plot them on a map. So far so good. There's even a federal standard for assigning codes to each county called FIPS, which the Census data uses, and I thought the BLS unemployment data also used from a quick inspection.

That would be far too easy.

Instead, the BLS mostly uses FIPS, except when they don't. For example, they identify my old home of Ventura County, CA as 06122, whereas the correct FIPS code is 06111. The BLS aren't using 06111 for anything else, they've just decided they'd like to express their creativity by randomly shuffling the IDs around. They list the codes they're using here, but don't include a translation to FIPS. To handle the conversion, I had to write a script to read in the two files and try to match the names given to all the counties.

Doesn't sound too hard, right?

For starters, the BLS lists the areas as 'Ventura County, CA', but FIPS is 'Ventura, CA'. That's easy to fix, but then there's 'DeKalb, GA', versus 'De Kalb, GA', 'Miami-Dade, FL', versus 'Dade, FL', ad nauseum. Each inconsistency requires some fixup code, so the clock ticks by. Finally, I ironed out enough wrinkles to produce a decent-looking translation table. To save anyone else from the same sort of trouble, here's the result: blatofips.csv, and here's the script that produced it: createblatofips.php

This wasn't exactly the programming you see on TV, all enormous fonts, spinning 3D models and enhancing photos beyond belief. It's the part I love though, taking on a big ugly problem that's sitting between me and somewhere I want to go. I get to lose myself in a world of simple rules for a couple of hours, solve some puzzles, and emerge having made some tangible progress towards my goal.

So be warned there will be serious geekery ahead, but I hope you'll stay with me as I share my fascination with building these castles in the sky.

How I got sued by Facebook

Gavel
Photo by Afsart

There's information about my Facebook data set scattered around multiple news articles, as well as posts in this blog, but here's the full story of how it all came down.

I'm a software engineer, my last job was at Apple but for the last two years I've been working on my own startup called Mailana. The name comes from 'Mail Analysis', and my goal has been to use the data sitting around in all our inboxes to help us in our day-to-day lives. I spent the first year trying (and failing) to get a toe-hold in the enterprise market. Last year I moved to Boulder to go through the Techstars startup program, where I met Antony Brydon, the former CEO of Visible Path. He described the immense difficulties they'd faced with the enterprise market, which persuaded me to re-focus on the consumer side.

I'd already applied the same technology to Twitter to produce graphs showing who people talked to, and how their friends were clustered into groups. I set out to build that into a fully-fledged service, analyzing people's Twitter, Facebook and webmail communications to understand and help maintain their social networks. It offered features like identifying your inner circle so you could read a stream of just their updates, reminding you when you were falling out of touch with people you'd previously talked to a lot, and giving you information about people you'd just met.

It was the last feature that led me to crawl Facebook. When I meet someone for the first time, I'll often Google their name to find their Twitter and LinkedIn accounts, and maybe Facebook too if it's a social contact rather than business. I wanted to automate that Googling process, so for every new person I started communicating with, I could easily follow or friend them on LinkedIn, Twitter and Facebook. My first thought was to use one of the search engine APIs, but I quickly discovered that they only offer very limited results compared to their web interfaces.

I scratched my head a bit and thought "well, how hard can it be to build my own search engine?". As it turned out, it was very easy. Checking Facebook's robot.txt, they welcome the web crawlers that search engines use to gather their data, so I wrote my own in PHP (very similar to this Google Profile crawler I open-sourced) and left it running for about 6 months. Initially all I wanted to gather was people's names and locations so I could search on those to find public profiles. Talking to a few other startups they also needed the same sort of service so I started looking into either exposing a search API or sharing that sort of 'phone book for the internet' information with them.

I noticed Facebook were offering some other interesting information too, like which pages people were fans of and links to a few of their friends. I was curious what sort of patterns would emerge if I analyzed these relationships, so as a side project I set up fanpageanalytics.com to allow people to explore the data. I was getting more people asking about the data I was using, so before that went live I emailed Dave Morin at Facebook to give him a heads-up and check it was all kosher. We'd chatted a little previously, but I didn't get a reply, and he left the company a month later so my email probably got lost in the chaos.

I had commercial hopes for fanpageanalytics, I felt like there was demand for a compete.com for Facebook pages, but I was also just fascinated by how much the data could tell us about ourselves. Out of pure curiosity I created an interactive map showing how different countries, US states and cities were connected to each other and released it. Crickets chirped, tumbleweed blew past and nobody even replied to or retweeted my announcement. Only 5 or 6 people a day were visiting the site.

That weekend I was avoiding my real work but stuck for ideas on a blog post, and I'd been meaning to check out how good the online Photoshop competitors were. I'd also been chatting to Eric Kirby, a local marketing wizard, who had been explaining how effective catchy labels were for communicating complex polling data, eg 'soccer moms'.  With that in mind, I took a screenshot of my city analysis, grabbed SumoPaint and started sketching in the patterns I'd noticed. After drawing those in, I spent a few more minutes coming up with silly names for the different areas and wrote up some commentary on them. I was a bit embarassed by the shallowness of my analysis, and I was keen to see what professional researchers could do with the same information, so I added a postscript offering them an anonymized version of my source data. Once the post was done, I submitted it to news.ycombinator.com as I often do, then went back to coding and forgot about it.

On Sunday around 25,000 people read the article, via YCombinator and Reddit. After that a whole bunch of mainstream news sites picked it up, and over 150,000 people visited it on Monday. On Tuesday I was hanging out with my friends at Gnip trying to make sense of it all when my cell phone rang. It was Facebook's attorney.

He was with the head of their security team, who I knew slightly because I'd reported several security holes to Facebook over the years. The attorney said that they were just about to sue me into oblivion, but in light of my previous good relationship with their security team, they'd give me one chance to stop the process. They asked and received a verbal assurance from me that I wouldn't publish the data, and sent me on a letter to sign confirming that. Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.

Obviously this isn't the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney. They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed. That meant I had to persuade the other startups I'd shared samples with to remove their copies, but finally in mid-March I was able to sign the final agreement.

I'm just glad that the whole process is over. I'm bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), and a bit frustrated
that people don't understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I'm just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup. I really appreciate everyone's support, stay tuned for my next project!