I just returned from a panel at UC Berkeley’s DataEdge conference on “How surveillants think“. I was the unofficial spokesman for corporate surveillance, since not many startup people are willing to talk about how we’re using the flood of new data that people are broadcasting about themselves. I was happy to stand up there because the only way I can be comfortable working in the field is if I’m able to be open about what I’m doing. Blogging and speaking are my ways of getting a reality check from the rest of the world on the ethics of my work.
One of the most interesting parts was an argument between Vivek Wadhwa and Gilman Louie, former head of In-Q-Tel, the venture capital arm of the US intelligence services. Apologies in advance to both of them for butchering their positions, but I’ll try to do them justice. We all agreed that retaining privacy in the internet age was a massive problem. Vivek said that the amount of data available about us and the technology for extracting meaning from it were advancing so fast that social norms had no hope of catching up. The solution was a system where we all own our data. Gilman countered with a slew of examples from the public sector, talking about approaches like the “Do not call” registry that solved tough privacy problems.
A few years ago I would have agreed with Vivek. As a programmer there’s something intuitively appealing about data ownership. We build our security models around the concepts of permissions, and it’s fun to imagine a database that stores the source of every value it contains, allowing all sorts of provenance-based access. At any point, you could force someone to run a “DELETE FROM corp_data WHERE source=’Pete Warden';“, and your information would vanish. This is actually how a lot of existing data protection laws work, especially in the EU. The problem is that the approach completely falls over once you move beyond explicitly-entered personal information. Here are a few reasons why.
Data is invisible
The first problem is that there’s no way to tell what data’s been collected on you. Facebook used to have a rule that any information third-parties pulled from their API had to be deleted after 24 hours. I don’t know how many developers obeyed that rule, and neither does anyone else. Another example is Twitter’s streaming API; if somebody deletes a tweet after it’s been broadcast, users of the API are supposed to delete the message from their archives too, but again it’s opaque how often that’s honored. Collections of private, sensitive information are impossible to detect unless they’re exposed publicly. They can even be used as the inputs to all sorts of algorithms, from ad targeting to loan approvals, and we’d still never know. You can’t enforce ownership if you don’t know someone else has your data.
Data is odorless
Do I know that you like dogs from a pet store purchase, or from a photo you posted privately on Facebook, from an online survey you filled out, from a blog post you wrote, from a charitable donation you made, or from a political campaign you gave money to? It’s the same fact, but if you don’t give permission to Facebook or the pet store to sell your information, and you discover another company has it, how do you tell what the chain of ownership was? You could require the provenance-tagging approach, I know intelligence agencies have systems like that to ensure every word of every sentence of a briefing can be traced back to their sources, but it’s both a massive engineering effort, and easy to fake. Just pretend that you have the world’s most awesome dog-loving prediction algorithm from other public data, and say that’s the source. With no practical way to tell where a fact came from, you can’t assert ownership of it.
All data is PII
Gilman talked about how government departments spend a lot of time figuring out how to safely handle personally-identifiable information. One approach to making a data ownership regime more practical is to have it focus on PII, since that feels like a more manageable amount of data. The problem is that deanonymization works on almost any data set that has enough dimensions. You can be identified by your gait, by noise in your camera’s sensor, by accelerometer inconsistencies, by your taste in movies. It turns out we’re all pretty unique! That means that almost any ‘data exhaust’ that might appear innocuous could be used to derive sensitive, personal information. The example I threw out was that Jetpac has the ability to spot unofficial gay bars in repressive places like Tehran, just from the content of public Instagram photos. We try hard to avoid exposing people to harm, and don’t release that sort of information, but anyone who wanted to could do a similar analysis. When the world is instrumented, with gargantuan amounts of sensor data sloshing around, figuring out what could be sensitive is almost impossible, so putting a subset of data under ownership won’t work.
Nobody wants to own their data
The most depressing thing I’ve discovered over the years is that it’s very hard to get people interested in what’s happening to their data behind closed doors. People have been filling out surveys in magazines for decades, building up databases at massive companies like Acxiom long before the internet came along. For a price, anyone can download detailed information on people, including their salary, kids, medical conditions, military service, political beliefs, and charitable donations. The person in the street just doesn’t care. As long as it’s not causing them problems, nobody’s bothered. It matters when it affects credit scores or other outcomes, but as long as it’s just changing the mix of junk mail they receive, there’s no desire to take any action. Physical and intellectual property laws work because they build on an existing intuitive feeling of ownership. If nobody cares about ownership of their data, we’ll never be pass or enforce legislation around the concept.
Privacy needs politics
I’ve been picking on Vivek’s data ownership phrase as an example, and he didn’t have a chance to outline what he truly meant by that, but in my experience every solution that relies on constraining data inputs has similar problems. We’re instrumenting our lives, we’re making the information from our sensors public, and organizations are going to exploit that data. The only way forward I see is to focus on cases where the outcomes from that data analysis are offensive. It’s what people care about after all, the abuses, the actual harms that occur because of things like redlining. The good news is that we have a whole set of social systems set up to digest new problems, come up with rules, and ensure people follow them. Vivek made the point that social mores are lagging far behind the technology, which is true. Legislators, lawyers, and journalists, the people who drive those social systems don’t understand the new world of data we’re building as technologists. I think where we differ is that I believe it’s possible to get those folks up to speed before it’s too late. It will be messy, painful, and always incomplete process, but I see signs of it already.
Before anything else can happen, we need journalists to explain what’s going on to the general public. One of the most promising developments I’ve seen is the idea of reporters covering algorithms as a beat, just like they cover crime or finance. As black boxes make an increasing number of decisions about our lives, we need watchdogs who can keep an eye on them. Despite their internal complexity, you can still apply traditional investigative skills to the results. I was pleased to see a similar idea pop up in the recent Whitehouse report on Big Data too – “The increasing use of algorithms to make eligibility decisions must be carefully monitored for potential discriminatory outcomes for disadvantaged groups, even absent discriminatory intent.”. Once we’ve spotted things going wrong, then we need well-crafted legislation to stop the abuse, and like Gilman I’d point to “Do not call” as a great example of how that can work.
The engineering community is generally very reluctant to get involved in traditional politics, which is why technical solutions like data ownership are so appealing to us. The trouble is we’re now at the point where the mainstream world knows that the new world of data is a big threat to privacy, and they’re going to put laws in place whether we’re involved or not. If we’re not part of the process, and if we haven’t educated the participants to a reasonable level, they’re going to be ineffective and even counter-productive laws. I’m trying to do what I can by writing and talking about the realities of our new world, and through volunteering with political campaigns. I don’t have all the answers, but I truly believe the best way for us to tackle this is through the boring footwork of civil society.