Web crawlers used to be restricted to big companies, purely because the
cost was so prohibitive. Now anyone with a few thousand dollars and an
Amazon Web Services account can crawl and analyze hundreds of millions
of pages. The old honor system that worked when it was just Google, Yahoo and Microsoft with access to that data won't cut it.
My approach has been to avoid dealing with companies that seem spammy or scammy, and trying to work out in the open. I'm no saint, I'd like to make a living from the insights I can gather
from public profiles, but I'm also very aware that most
people don't know how much they're exposing.
The problem is that information that was made available to help search engines can also be fed into sophisticated analysis pipelines to produce much deeper and potentially more invasive data sets. What does that mean in practice? Just using public profile data, you could use gender-guessing and portrait images to produce HotOrNot 2.0. Getting even creepier, it would be possible to match up interests, locations and even friends in common using that same data to produce a great tool for stalkers and perverts. Intuitively that all seems very wrong, but since it's technically straightforward I'm certain somebody out there is already working on it.
So how can we respond to this new world?
If there's personal information on a page, make it clear that there's privacy implications for handling it, and lay out some rules. I don't know exactly what this would look like, but a noanalyze or indexonly meta tag that worked like noarchive might be a good start. It would be a polite request that the crawler only use the information for serving direct user searches. Like all robots.txt directives it's not enforceable, but it would give clear guidance and give networks a stick to beat violators with.
We've lived with our names and addresses in public phone books for over a century, despite the potential for abuse by time-traveling robot hitmen. We mitigated the risks by adopting some simple tricks that might also work in the internet world. How about just an initial for your first name to limit gender identification? There was also a very clear process for 'hide me', going ex-directory, that was standardized and easy to understand, not complex and constantly changing like the space shuttle control panel that most privacy dialogs now resemble.
Obsfucate sensitive information
Keep changing links and ids
There's no reason that the id for a public profile has to have anything in common with the actual user id, or that the portrait image URL can't be a redirect that changes every two weeks. Keeping the public and private worlds unconnected makes it much harder to subvert the privacy constraints.