Is making public data more accessible a threatening act?

Megaphone
Photo by Altemark

One of the most interesting questions to come out of the Facebook debate was about making public data more easily accessible. Everything I was looking at releasing was available through a Google search and through many other commercial companies, so in a simplistic view it was already completely public and releasing it in a convenient form made no difference. However that doesn't match our intuitive reactions, we are a lot more relaxed when data is theoretically available to anyone but hard to get to than when there's an easy way to access it.

One of my favorite researchers in this area, Arvind Narayanan, recently started a series of articles that try to turn this gut reaction into a usable model. I also spent a very productive lunch with Jud Valeski, Josh Fraser and Jon Fox hashing out the implications of the coming wave of accessibility, so here's a few highlights from that discussion.

Prop 8. Information about donors to political campaigns has always been public, but traditionally required a visit to city hall to dig through piles of paper. Suddenly the donors behind Prop 8 in California found themselves listed on a map anyone could access on the internet. While predictions of violence or boycotts didn't materialize, Scott Eckern ended up resigning from his job once his donation became widely known. I'm pretty certain he wasn't aware that his donation would be public knowledge, it's a clear case where the the distribution channel made the information much more powerful.

InfoUSA. Imagine a thought experiment where I downloaded the income, charitable donations, pets and military service information for all 89,000 Boulder residents listed in InfoUSA's marketing database, and put that information up in a public web page. That's obviously pretty freaky, but absolutely anyone with $7,000 to spare can grab exactly the same information! That intuitive reaction is very hard to model. Is it because at the moment someone has to make more of an effort to get that information? Do we actually prefer that our information is for sale, rather than free? Or are we just comfortable with a 'privacy through obscurity' regime?

So what's my conclusion? On the one hand, the web has created so many amazing innovations because it's a fantastic way to make information more available, and initial privacy concerns have faded into the background as people become more used to services. On the other, the jury's not back on how the revolution will end. Is everyone really going to be their own public broadcaster on Twitter, or are we going to retreat into more private forums in the wake of future freakouts? I don't know the answer, but everyone working in this area needs to be thinking about more than the technical aspects of data accessibility.

Why your search results are getting worse, and what you can do about it

Spamposter
Photo by Larry & Flo

Have you noticed more useless or deceptive links showing up in your Google search results? I have and it looks like I'm not alone. More and more of the time when I'm doing a technical query, I click on pages that from the summary look like they might have the answer, but they turn out to be bait-and-switch. They're usually a small amount of text captured from somewhere else surrounded by a mass of related ads.

Why is this happening? The short answer is that publishers are getting better at tricking Google into ranking low-quality articles highly. There's always been a battle between content-producers who want their content to appear in search results, and search engines trying to send their users to relevant articles. In the 90's it was enough to repeat popular keywords hundreds of times at the bottom of the page, but Google killed such simple approaches using a combination of PageRank and algorithms to assess the quality of the content.

Unfortunately, truly evaluating the quality of a text article is an AI-complete problem. Instead Google has relied on statistical tests to spot repetitions, copied content and obvious nonsense, but publishers have figured out the limits of that approach, and are busy churning out cheap low-quality content that squeaks through the tests.

Demand Media takes the Mechanical Turk route and pays people a tiny amount of money to create articles based on popular searches on sites like eHow. As you might expect, the articles tend to be pretty shallow and I doubt they help many people, but they at least pay lip service to creating original content.

Mahalo began as a reputable startup using professional editors to create good answers to common search queries. Over the last couple of years they've apparently switched to outsourcing instead, and most recently asked for 17 volunteer interns. Most recently, they've been credibly accused of scraping content created by other people without permission to blatantly game Google's rankings. Any content that's a blatant copy of other material should be spotted and down-ranked, but for some reason isn't.

So what can you do to improve your search results? You can report sites you consider spam to Google, but even if they agree it may be a while before they're removed from the listings. As a short-term measure you can add the following keyword to your searches to remove eHow for example:

-site:ehow.com -site:ehow.co.uk how to move from houston to canada

In the end the only thing that can really remove low-quality content from search results is having some human judgment in the ranking process. Google are pushing hard to use social information to build results based around what they know of your friends, and I expect that they will expand this to use information about the sites you and your friends actually visit. It's much more likely that I'm interested in news.ycombinator.com discussions than TechCrunch, because I frequently visit HN. In an ideal world links from there should show high in my search results, even though they might not for other people. Until that beautiful day dawns, I'll just pull down 100 results a page and let my eyeballs do the sorting.

How to convert from BLS to FIPS county codes

Meatgrinder
Photo by J Giacomoni

Don't be deceived by the whole Facebook brouhaha, that was an anomaly. I normally lead an extremely boring life, and quite like it that way. If you're a new reader, expect less drama and more code.

To illustrate that point, I just spent two hours of my life wrestling with conflicting federal standards for identifying US counties. The BLS has statistics on local unemployment for the last 20 years, and I want to visualize them. The Census provides some nice outlines of all the counties, which allows me to plot them on a map. So far so good. There's even a federal standard for assigning codes to each county called FIPS, which the Census data uses, and I thought the BLS unemployment data also used from a quick inspection.

That would be far too easy.

Instead, the BLS mostly uses FIPS, except when they don't. For example, they identify my old home of Ventura County, CA as 06122, whereas the correct FIPS code is 06111. The BLS aren't using 06111 for anything else, they've just decided they'd like to express their creativity by randomly shuffling the IDs around. They list the codes they're using here, but don't include a translation to FIPS. To handle the conversion, I had to write a script to read in the two files and try to match the names given to all the counties.

Doesn't sound too hard, right?

For starters, the BLS lists the areas as 'Ventura County, CA', but FIPS is 'Ventura, CA'. That's easy to fix, but then there's 'DeKalb, GA', versus 'De Kalb, GA', 'Miami-Dade, FL', versus 'Dade, FL', ad nauseum. Each inconsistency requires some fixup code, so the clock ticks by. Finally, I ironed out enough wrinkles to produce a decent-looking translation table. To save anyone else from the same sort of trouble, here's the result: blatofips.csv, and here's the script that produced it: createblatofips.php

This wasn't exactly the programming you see on TV, all enormous fonts, spinning 3D models and enhancing photos beyond belief. It's the part I love though, taking on a big ugly problem that's sitting between me and somewhere I want to go. I get to lose myself in a world of simple rules for a couple of hours, solve some puzzles, and emerge having made some tangible progress towards my goal.

So be warned there will be serious geekery ahead, but I hope you'll stay with me as I share my fascination with building these castles in the sky.

How I got sued by Facebook

Gavel
Photo by Afsart

There's information about my Facebook data set scattered around multiple news articles, as well as posts in this blog, but here's the full story of how it all came down.

I'm a software engineer, my last job was at Apple but for the last two years I've been working on my own startup called Mailana. The name comes from 'Mail Analysis', and my goal has been to use the data sitting around in all our inboxes to help us in our day-to-day lives. I spent the first year trying (and failing) to get a toe-hold in the enterprise market. Last year I moved to Boulder to go through the Techstars startup program, where I met Antony Brydon, the former CEO of Visible Path. He described the immense difficulties they'd faced with the enterprise market, which persuaded me to re-focus on the consumer side.

I'd already applied the same technology to Twitter to produce graphs showing who people talked to, and how their friends were clustered into groups. I set out to build that into a fully-fledged service, analyzing people's Twitter, Facebook and webmail communications to understand and help maintain their social networks. It offered features like identifying your inner circle so you could read a stream of just their updates, reminding you when you were falling out of touch with people you'd previously talked to a lot, and giving you information about people you'd just met.

It was the last feature that led me to crawl Facebook. When I meet someone for the first time, I'll often Google their name to find their Twitter and LinkedIn accounts, and maybe Facebook too if it's a social contact rather than business. I wanted to automate that Googling process, so for every new person I started communicating with, I could easily follow or friend them on LinkedIn, Twitter and Facebook. My first thought was to use one of the search engine APIs, but I quickly discovered that they only offer very limited results compared to their web interfaces.

I scratched my head a bit and thought "well, how hard can it be to build my own search engine?". As it turned out, it was very easy. Checking Facebook's robot.txt, they welcome the web crawlers that search engines use to gather their data, so I wrote my own in PHP (very similar to this Google Profile crawler I open-sourced) and left it running for about 6 months. Initially all I wanted to gather was people's names and locations so I could search on those to find public profiles. Talking to a few other startups they also needed the same sort of service so I started looking into either exposing a search API or sharing that sort of 'phone book for the internet' information with them.

I noticed Facebook were offering some other interesting information too, like which pages people were fans of and links to a few of their friends. I was curious what sort of patterns would emerge if I analyzed these relationships, so as a side project I set up fanpageanalytics.com to allow people to explore the data. I was getting more people asking about the data I was using, so before that went live I emailed Dave Morin at Facebook to give him a heads-up and check it was all kosher. We'd chatted a little previously, but I didn't get a reply, and he left the company a month later so my email probably got lost in the chaos.

I had commercial hopes for fanpageanalytics, I felt like there was demand for a compete.com for Facebook pages, but I was also just fascinated by how much the data could tell us about ourselves. Out of pure curiosity I created an interactive map showing how different countries, US states and cities were connected to each other and released it. Crickets chirped, tumbleweed blew past and nobody even replied to or retweeted my announcement. Only 5 or 6 people a day were visiting the site.

That weekend I was avoiding my real work but stuck for ideas on a blog post, and I'd been meaning to check out how good the online Photoshop competitors were. I'd also been chatting to Eric Kirby, a local marketing wizard, who had been explaining how effective catchy labels were for communicating complex polling data, eg 'soccer moms'.  With that in mind, I took a screenshot of my city analysis, grabbed SumoPaint and started sketching in the patterns I'd noticed. After drawing those in, I spent a few more minutes coming up with silly names for the different areas and wrote up some commentary on them. I was a bit embarassed by the shallowness of my analysis, and I was keen to see what professional researchers could do with the same information, so I added a postscript offering them an anonymized version of my source data. Once the post was done, I submitted it to news.ycombinator.com as I often do, then went back to coding and forgot about it.

On Sunday around 25,000 people read the article, via YCombinator and Reddit. After that a whole bunch of mainstream news sites picked it up, and over 150,000 people visited it on Monday. On Tuesday I was hanging out with my friends at Gnip trying to make sense of it all when my cell phone rang. It was Facebook's attorney.

He was with the head of their security team, who I knew slightly because I'd reported several security holes to Facebook over the years. The attorney said that they were just about to sue me into oblivion, but in light of my previous good relationship with their security team, they'd give me one chance to stop the process. They asked and received a verbal assurance from me that I wouldn't publish the data, and sent me on a letter to sign confirming that. Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.

Obviously this isn't the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney. They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed. That meant I had to persuade the other startups I'd shared samples with to remove their copies, but finally in mid-March I was able to sign the final agreement.

I'm just glad that the whole process is over. I'm bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), and a bit frustrated
that people don't understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I'm just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup. I really appreciate everyone's support, stay tuned for my next project!

The unknown marketing databases that know everything about you

Ispy
Photo by Jovike

I'm amazed at how much information is available in marketing databases. InfoUSA will sell anyone data on 210 million consumers, that's pretty much every adult in the country. What kind of information? Name, address, age, gender, occupation, income, mail order history, charitable donations, pets, whether there's a grandparent in their household and even whether they've served in the military!

Infousainfolist
What really interests me is that this has been going on for decades, with no apparent public concern. I wonder if part of it is because people doesn't realize how much is available to marketers? Or don't they care about the privacy of this information? Either way, it's something to think about as we try to figure out the rules for similar information on the internet.

Class, and why I left Britain

Toffsandtoughs
Licensed from Getty Images

I recently read a meditation on this iconic photo, and it got me thinking about how the British class system affected my life.

For most of my childhood, I wasn't aware of class at all. My mum was a nurse on night shifts and my dad worked in a chemical factory, but by getting a degree through night-school when we were growing up, rose to a position as a trained chemist. Looking back I'm amazed at how my parents raised three kids on their income, and while I noticed some of my school friends had bigger houses or their parents had two cars, I never felt any gap between us because of that.

The change came when I was 16 and started at Hills Road Sixth Form College. The British system gives kids a choice of where to do the last two years of high school, and Hills Road boasted amazing exam scores. I was incredibly geeky even then, so I leapt at the chance for more challenging lessons. By the end of the two years, I'd come to dread the place.
I started in 'Double maths', which was an advanced course that squeezed two A-levels worth of material into the normal time allotted for one. Looking around the class of 20, it soon became apparent that 15 of them already knew each other, and had already been taught a lot of the material. I soon realized that they'd come from the posh Perse private school, and that Hills Road was attractive because spending their last two years there let them qualify as state school pupils for the admission quotas for Oxford and Cambridge universities.
Not only were they blowing me out of the water academically, they were bursting with self-confidence and always had a put-down to hand. I was already going through a stormy adolescence, and spending time in their presence left me feeling awful, like the 'oik' they considered me. I dropped out of double maths after a couple of terms, but never felt like I truly fitted in, in any of my classes.

It left me resentful, and I started to notice class a lot more. A neighbor who'd attended a private school got a well-paying job 'in the city' at 18, despite terrible exam results. From my perspective as I was stacking shelves at Tesco supermarket, that didn't seem right. After I got my degree and started working in the games industry, I noticed how even at such high-tech companies management often came from posh backgrounds. Managing seemed more like a class privilege than something you earned by experience. I looked around at people on TV, politicians and journalists and noticed how many of them seemed to come from the same class.

I ended up working closely with a few private school kids and got to know them pretty well. They seemed twisted by the system too, forced to show an outwards confidence that was fragile, and left with a fear that their achievements might be more to do with connections than ability.
It affected me too; I had a chip on my shoulder, and I didn't like the sort of person I was becoming. I didn't want to spend the rest of my life either nursing a grudge, or climbing the social ladder. I was learning to be happy with who I was, and I just wanted to go through life with people accepting me for my abilities, not judging me by my accent.

In the summer between finishing Hills Road and starting university, I spent three months in a tree-house in Alaska. That's a whole different story, but what stuck with me was how people judged me. They were a lot more interested in what I could do than who my parents were. It was like a breath of fresh air being asked to honestly prove my abilities, and the memory of that summer stuck with me as I toiled back in the UK. I pushed myself hard to get the sort of specialized experience that was in demand in America.

Finally, after 5 years of learning everything I could about game console graphics, I took a job in the US. I told myself I'd just try it for 6 months, but pretty quickly I knew I couldn't go back, there were opportunities I could never dream of at home. I'm not naive, there's plenty of 'old boys networks' over here (the greek system springs to mind), but they're fragmented and in competition with each other. There's nothing as dominant or closed as the British system, they all have to be open to newcomers or they quickly lose relevance.

I didn't leave Britain because I hated toffs, but because I hated dealing with the issue at all. Both alternatives, grudge-holding or social-climbing, meant burning massive amounts of energy on something completely unproductive, and I wouldn't like the person either would make me become. I love America because it's given me the freedom to get things done without wasting my time on class.

All the cool kids are using the Rapleaf API

Graffititree
Photo by Georgios Karamanis

I spend a lot of time worrying about the privacy implications of the new wave of information about people that's becoming available, but I'm also fascinated by the beneficial possibilities. Rapportive and Etacts are great examples of that, using public profile data in innovative ways to solve every day problems.

What's less well-known is that they're both built on top of the Rapleaf API. Rapleaf has traditionally been focused on B2B applications, and any firm selling personal information to other companies is going to suffer from an 'ick' factor, but the new startups demonstrate that a 'phone book for the internet' can offer some practical benefits to users too.

I sat down with Auren and Dayo from Rapleaf on Friday and had a wide-ranging discussion about this world. They're very careful not to steal any of their partners' thunder by trumpeting the connection, are in the habit of keeping a low profile generally and so probably wouldn't want me to blog about this, but I think their API is a massively under-used resource in the startup world. If you're doing anything with sets of email addresses, you can offer your users much richer views of the people behind those addresses using Rapleaf. It's not perfectly accurate in the connections it finds, but it does a pretty good job and if you need an example of how to implement it, you can find one here in my FindByEmail project.

Just remember, this is personal information about real people you're dealing with, so use it for the forces of good, not evil! And if you want to remove your information entirely from Rapleaf, you can do that here.

Business plans for public data

Dollarsign
Photo by Leo Reynolds

More information about their users is being made public by social networks, and the tools to work with massive data sets are getting cheaper. A lot of companies are trying to figure out ways to make money from these two trends, so I wanted to give an overview of some practical revenue streams either potential customers have asked me about, or that I’ve seen competitors in this space using.

I’ll focus on cataloging what I’ve seen, rather than digging into the ethical debates that some of the applications raise. It’s important to understand what’s possible and happening right now so we can have a meaningful argument about what the rules should be.

Improved search results

I got started working this area when one of my products needed to match up email contacts with their social network accounts. I wanted to automate the process of Googling a person’s name when you first exchange emails with them, and so my first thought was using an API to one of the existing search engines. Unfortunately Google actually blocks most Facebook results from their API, and Bing and Yahoo have very spotty coverage, missing a lot of users. That led me to write my own simple crawler to catalog Facebook profiles myself just to do those name/location lookups.

I later realized how much other fascinating information was available in those public profiles, but I ran across several smaller search startups willing to pay for just the information matching a name and location to a profile. It’s definitely not a massive market, but there’s money to be made, and since it’s identical to Google’s functionality it doesn’t raise many ethical questions.

Examples: 123people.com , pipl.com

Better targeting for direct email marketing

This is one of the least known but most lucrative uses for public profile data. A company with a large email list will run all the addresses through a lookup service that gives them a list of their customer’s social network accounts. That knowledge can then be used in all sorts of ways to target customers, from sending special Twitter offers only to people you know are on the service, to pulling detailed location information for localized campaigns. I’ve even heard rumors of a Vegas casino that upgrades guests to suites if it spots they have a lot of Twitter followers! If you want to see something like this in action, Flowtown offers a lot of these features thanks to the Rapleaf API.

Any business-to-business use of our personal information is inherently a bit creepy, but direct marketing firms have been doing similar analysis for decades using traditional data sources like magazine subscriber surveys, so this seems fairly uncontroversial.

Examples: Flowtown, Rapleaf

Hedge funds

The most direct link between information and money is in the financial world. For example, if you can detect that a brand is becoming popular before anyone else, you can buy shares in that firm and benefit from the price rise when that success shows up in their profits.

Hedge funds have been using non-traditional metrics for years, doing things like running their own focus groups and opinion polls, but recently there’s been a lot of interest in the flood of information flowing through social networks. Twitter is the most obvious example of a data source, but the audience is both small and heavily skewed towards geeks, making it hard to pull out meaningful information. My feeling is that this will only become really useful once mass-market data is more available. Imagine being able to spot companies where a lot of employees have recently updated their LinkedIn profiles, for an early warning of firms in trouble.

One challenge to this approach is that you need some kind of historical baseline to compare current figures against, to tell if they represent something real or are just noise. That’s a barrier because it means you need to have been collecting the data for some time before it starts to become valuable to hedge funds.

Again this seems to be an extension of existing processes, just slotting in public profiles as a new data source, so it’s hard to see what new ethical ground is being broken.

Examples: YouGov

General marketing intelligence

Marketing managers for big brands constantly have to make decisions about how to allocate their resources and craft their messages, and they need the right information to make good choices. My FanPageAnalytics project was aimed at those people, giving them unique information about who their and their competitor’s fans were, what else they were fans of and where they lived.

There’s definitely money to be made in this area, but brand managers are busy and non-technical, so they require something very targeted to their needs and don’t seek out new solutions. My feeling is that makes the leaders like Radian6 hard to beat even as the technology changes, because they have built relationships with most brand managers that gives them a defensible distribution channel.

Examples: Radian6, Scout Labs

Reaching influencers for PR purposes

Public relations people want to persuade influential people to write about their clients. One problem is that they may not know who the influential people are in a given area, or they may know but be unable to reach them effectively.  Ever since I did my Twitter visualization, I’ve been asked about this use case repeatedly. The holy grail is being able to enter a topic, see who the most influential people are *and* who they are influenced by. Very often there are lesser-known specialists who are read by more popular writers for story ideas, and those sources may be an easier route to getting your stories to those mainstream influencers than approaching them directly.

This is one of the few areas where Twitter’s comparatively small user base is not a issue since most people who broadcast to an audience are using the service as another channel. Using information from other networks to reach them can feel like stalking though, so I expect that the increasing availability of public data will be countered by celebrities locking down their privacy settings.

Examples: Klout

Recruitment targeting

Weak relationships, people you met once at a trade show, are surprisingly effective when it comes to getting a job. Recruiters contribute a massive chunk of LinkedIn’s revenue, and people are largely happy to see their resumes and connections shared for job-hunting purposes. It’s a pretty sweet position for LinkedIn, since it makes them the only customer-facing business that’s able to sell their users’ private data to other companies without fear of a backlash. It’s an area that could be helped by the new flood of public profile data too, especially if you can get some information about people’s connections. I’ve run across two different firms who’ve tapped into their employees’ friends networks on Facebook and Twitter to help fill positions, and I imagine there has to be a lot more innovation coming in this area.

Examples: LinkedIn

How to create a job in Elastic MapReduce

I’m on a crusade to spread the word about the potential of Elastic MapReduce to revolutionize data processing for startups (a 100 machine cluster for $10 an hour!) so I’ve produced a 7 minute screencast showing exactly how to create a new job. I’ve embedded the YouTube version above, or you can find a higher-quality version here.

I’m thinking about rolling out a series of these, taking you all the way from gathering the source data to visualizing the results, so please let me know what you do and don’t like about this version.

How to set a custom screen resolution in OS X

Pixelated
Photo by AMagill

I just lost an hour of my life trying to figure out how to set my MacBook Pro to 1280×720 on the main display, so to save anyone else from banging their head against the desk, here's the steps that finally worked for me:

1 – Download and install SwitchResX

2 – Go to System Preferences and click on the SwitchResX icon

3 – Click on Color LCD on the left side

4 – Choose the Custom Resolutions tab

5 – Click on the plus icon to create a new resolution

6 – Choose Scaled Resolution from the top drop-down menu

7 – Enter the resolution you want in the two boxes below

8 – Click OK

9 – Check to make sure the resolution that now shows up in the list is correct. I've found it will sometimes forget one of the values and set it to zero! If it does that, go back in and re-edit and save it until it does appear correctly.

10 – The Type column should read Scaled, and the Status should be Uninstalled. Now press Command-S to save your changes.

11 -You should now see Needs to Reboot in the Status column. As you may have guessed, this means you need to reboot your machine so choose Restart from the main system menu.

12 – Once the system has restarted, go to the normal display preferences and you should see your new resolution listed there.

If this does fix your problem, please buy a copy of SwitchResX for 14 Euros to support its development.

Wondering why I needed a custom resolution of 1280×720? I'm working on building some more professional screencasts that are going to be run through a 720p video production pipeline, so I need to capture my whole screen at that resolution. I expected it to be fairly simple to set up, but everywhere I turned I hit baffling UI. It makes no sense that you can't set a custom resolution within Apple's preferences in the first place, and then I spent a lot of time and $20 on DisplayConfigX with no luck, before I figured out how to get SwitchResX working.