The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than John in Texan border towns, and so the country was on the verge of being swamped by Hispanics. What was worse, I couldn’t even get my data into the hands of reputable sociologists, thanks to concerns from Facebook.
I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way. The best you can hope for is blog and Twitter feedback, but without access to the data, or even a full paper on the techniques, you can’t dig very deeply.
Why are data scientists getting all the attention? I blame the real scientists! There’s a mass of fascinating information buried in all the data we’re collecting on ourselves, and traditional scientists have been painfully slow to take advantage of it. There are all sorts of barriers, ranging from the proprietary nature of the source data, the lack of familiarity with methods able to handle the information at scale, and a cultural distance between the academic and startup worlds. None of these should be insurmountable though. There’s great work being done with confidential IRS and US Census data, so the protocols exist to both do real science and preserve secrecy. I’ve seen the size of the fiber-optic bundles at CERN, so physicists at least know how to deal with crazy data rates. Most of the big startups had their roots in universities, so the cultural gap should be bridgeable.
What am I doing about it? I love efforts by teams like OpenPaths to break data out from proprietary silos so actual scientists can use them, and I do what I can to help any I run across. I popularize techniques that are common at startups, but lesser-known in academia. I’m excited when I see folks like Cameron Marlow at Facebook collaborating with academics to produce peer-reviewed research. I keep banging the drum about how shifty and feckless we data scientists really are, in the hope of damping down the starry-eyed credulity that greets our every pronouncement.
What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources. If you’re in a startup, figure out if you have data that tells a story, and see if there’s any academics you can reach. If you’re a reader, heckle data scientists when we make a sage pronouncement, keep us honest!