Robots and Stalking 2.0

Photo by Peyri

Web crawlers used to be restricted to big companies, purely because the
cost was so prohibitive. Now anyone with a few thousand dollars and an
Amazon Web Services account can crawl and analyze hundreds of millions
of pages. The old honor system that worked when it was just Google, Yahoo and Microsoft with access to that data won't cut it.

My approach has been to avoid dealing with companies that seem spammy or scammy, and trying to work out in the open. I'm no saint, I'd like to make a living from the insights I can gather
from public profiles, but I'm also very aware that most
people don't know how much they're exposing

The problem is that information that was made available to help search engines can also be fed into sophisticated analysis pipelines to produce much deeper and potentially more invasive data sets. What does that mean in practice? Just using public profile data, you could use gender-guessing and portrait images to produce HotOrNot 2.0. Getting even creepier, it would be possible to match up interests, locations and even friends in common using that same data to produce a great tool for stalkers and perverts. Intuitively that all seems very wrong, but since it's technically straightforward I'm certain somebody out there is already working on it.

So how can we respond to this new world?

Expand robots.txt

If there's personal information on a page, make it clear that there's privacy implications for handling it, and lay out some rules. I don't know exactly what this would look like, but a noanalyze or indexonly meta tag that worked like noarchive might be a good start. It would be a polite request that the crawler only use the information for serving direct user searches. Like all robots.txt directives it's not enforceable, but it would give clear guidance and give networks a stick to beat violators with.

Look backwards

We've lived with our names and addresses in public phone books for over a century, despite the potential for abuse by time-traveling robot hitmen. We mitigated the risks by adopting some simple tricks that might also work in the internet world. How about just an initial for your first name to limit gender identification? There was also a very clear process for 'hide me', going ex-directory, that was standardized and easy to understand, not complex and constantly changing like the space shuttle control panel that most privacy dialogs now resemble.

Obsfucate sensitive information

Keeping email addresses as images is an old trick, but if there's information you'd like to show to humans on public profiles but not have stored by robots, why not use the same technique for that too? It's far from perfect, but it makes grabbing that data much tougher and slower. You can also use Javascript to make it harder for a crawler to pull the information, but still leave it as text.

Keep changing links and ids

There's no reason that the id for a public profile has to have anything in common with the actual user id, or that the portrait image URL can't be a redirect that changes every two weeks. Keeping the public and private worlds unconnected makes it much harder to subvert the privacy constraints.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: