Why you should never trust a data scientist

July 18, 2013 By Pete Warden in Uncategorized 44 Comments

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than John in Texan border towns, and so the country was on the verge of being swamped by Hispanics. What was worse, I couldn’t even get my data into the hands of reputable sociologists, thanks to concerns from Facebook.

I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way. The best you can hope for is blog and Twitter feedback, but without access to the data, or even a full paper on the techniques, you can’t dig very deeply.

Why are data scientists getting all the attention? I blame the real scientists! There’s a mass of fascinating information buried in all the data we’re collecting on ourselves, and traditional scientists have been painfully slow to take advantage of it. There are all sorts of barriers, ranging from the proprietary nature of the source data, the lack of familiarity with methods able to handle the information at scale, and a cultural distance between the academic and startup worlds. None of these should be insurmountable though. There’s great work being done with confidential IRS and US Census data, so the protocols exist to both do real science and preserve secrecy. I’ve seen the size of the fiber-optic bundles at CERN, so physicists at least know how to deal with crazy data rates. Most of the big startups had their roots in universities, so the cultural gap should be bridgeable.

What am I doing about it? I love efforts by teams like OpenPaths to break data out from proprietary silos so actual scientists can use them, and I do what I can to help any I run across. I popularize techniques that are common at startups, but lesser-known in academia. I’m excited when I see folks like Cameron Marlow at Facebook collaborating with academics to produce peer-reviewed research. I keep banging the drum about how shifty and feckless we data scientists really are, in the hope of damping down the starry-eyed credulity that greets our every pronouncement.

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources. If you’re in a startup, figure out if you have data that tells a story, and see if there’s any academics you can reach. If you’re a reader, heckle data scientists when we make a sage pronouncement, keep us honest!

44 responses

Mahesh CR says:

July 18, 2013 at 1:12 pm

Fantastic points. This goes to the heart of what data analysis should be ideally.

People do not have tools to judge or interpret metrics. We have no in-built BS detector for numbers, as we do for ideas/opinions expressed via speech and text. Numerical literacy is must to assess what the numbers say.

Just as “findability” is key for content, “reproducibility” should be made essential for any “data science” efforts to be taken seriously. If it is not peer reviewed then results should be received with sufficient skepticism.

Points around the proprietary nature of data, how it constantly evolves, security constraints etc can all be dealt with, if the citizenry choose to question what is being said to them.

Outsource thinking and opinion forming to a 3rd party, especially the media, leaves us where we deserve to be!

Reply
- Riccardo Scalco (@eidogram) says:
  
  July 20, 2013 at 5:29 pm
  
  The results should be received with sufficient skepticism even if peer reviewed (even if by Nature or Science), this is part of the scientific approach. Viceversa, a scientific work does not need peer review necessarily, it is sufficient that its description is explained well enough to be reproducible (“reproducibility” is the keyword, as mentioned).
  If the work is interesting then there will be who will reproduce it.
  In this sense, for what concerns data scientists, I think that the mathematical methods applied should be clearly described (as well as the way data has been collected), without hiding anything, this is honest and sufficient. Then, what people say or think about the analysis is not under the responsability of the author.
  
  Reply
  - N Sukumar says:
    
    August 9, 2013 at 12:51 pm
    
    Hmmm! I’m not very hopeful about that, since no more than 10% of published science seems to be reproducible, even with peer review: http://blog.jove.com/2012/05/03/studies-show-only-10-of-published-science-articles-are-reproducible-what-is-happening
Joe McCarthy says:

July 18, 2013 at 2:27 pm

Thanks for sharing your experience, insights and recommendations. Scientific skepticism is increasingly essential in an era where stunning visual representations can activate other areas of our brains.

I found myself thinking about the PhD Comics classic Science News Cycle cartoon, and how visualization of large data sets seems to shorten that cycle.

Reply
Pingback: Why You Should Never Trust a Data Scientist | Symposium Magazine
Neil says:

July 18, 2013 at 9:42 pm

Geo-Information Scientists have been working away analysing large data sets (for the day), and prising open data silos for decades. Look at the work of Stan Openshaw for example, physicists used to complain he used up to much time on the Cray!

It seems to me that much of what is now being termed Big Data, at least on the Visualisation side, is analytical mapping going mainstream. Not that, as a GI Scientist, I don’t think that is great. There’s plently of data to go round!

The trouble is that it compounds the problem as to why science is slow to exploit all this data. Too often Data Scientists were once ‘real’ scientists, but got sick of being paid a half or even a third of what their private sector colleages recieve, while constantly having to find funding to extend their contracts.

In the 1980s it was the physicists who left to become “quants” in the banking industry. The next decade will, I fear, see a drain from GIS and Geography departments. Mea culpa, I’m 50% out the door myself.

Reply
AstroBio says:

July 18, 2013 at 11:16 pm

Last I checked, what you described above is not science. If this disturbs you, please just refer to yourself as an analyst.

Reply
- AstroBio says:
  
  July 18, 2013 at 11:29 pm
  
  But thanks for the link to OpenPaths. Cool. I arrived at your blog from Crooked Timber and have added you to my reader.
  
  Reply
- inknzvl says:
  
  September 18, 2013 at 2:10 am
  
  Exactly what I was going to say! Unfortunately most people read “science” and they believe everything without questioning.
  
  Reply
Pingback: When Can You Trust a Data Scientist? | Symposium Magazine
kwalitisme says:

July 22, 2013 at 2:24 pm

Reblogged this on kwalitisme.

Reply
Pingback: Reactions to “Why you should never trust a data scientist” « Pete Warden's blog
Edd Grant says:

July 24, 2013 at 8:06 am

Interesting article, while the work you are doing is great stuff I would caution calling it ‘science’. The “irritating peer review or reproducibility worries” that you mention are key principles which distinguish between what you’re doing here, which I think AstroBio correctly terms as ‘analysis’, and real Science. I’m not saying this to belittle what you’re doing here but simply but because it’s important that Science is celebrated for using techniques such as peer analsys and systematic reviews rather than berated for it. These things are what makes science a trustworthy guide of “best current understanding of how stuff works”.

Do keep up the writing though, it’s really interesting stuff.

Reply
- Pete Warden says:
  
  July 24, 2013 at 5:02 pm
  
  Thanks for the kind words Edd! I have struggled with the ‘science’ part of the Data Science term too, but I have a qualified defense of it here:
  http://strata.oreilly.com/2011/05/data-science-terminology.html
  
  I do think analysis is a useful and under-used term in our field though, it doesn’t have nearly the cachet.
  
  Reply
Pingback: Data Driven News
Pingback: Why you should never trust a data visualisation - rss news
Pingback: Big Data Observations: The Science of Asking Questions | What's The Big Data?
Pingback: Why you Should Never Trust a Data Visualization | Michael Sandberg's Data Visualization Blog
Pingback: Data Viz News [17] | Visual Loop
Pingback: 데이터 과학자들을 그대로 믿어서는 안되는 이유 | NewsPeppermint
Pingback: Why you should never trust a data visualisation | Big Data NewsBig Data News
Pingback: Data Recap | Zec Blog
Thomas Speidel says:

July 31, 2013 at 9:18 pm

That’s where certification might help as a first step. To the same extent a tax professional is bound to code of conduct, or a physician is bound to an oath and a code of conduct, or an engineer will be in trouble when the bridge he designed falls, so too data scientists should have a body that both protects them and discipline them. Admittedly, it would have provided little help in stopping the spread of your story.

The other side of the coin is that what you call “real scientists”, have far more trouble in getting their research out: it takes very competitive grants application, lots of research, writing rigorous manuscripts, submitting them for publication, dealing with the peer-review criticism and sometimes having to pay money just to get it into the open access world. A process that easily take many years for a single project. A clear benefit is the higher standard or higher baseline, at the expenses of killing potentially promising research projects.

I think what we need is a compromise where the scientist can make contributions to projects where, for example the data is already available and of good quality, while the non-scientist can have their work openly and constructively criticized by more professional figures.

Reply
Pingback: Varför "Big Data" bara är lurendrejeri
A Real Scientist says:

August 6, 2013 at 8:16 am

I’m sure that there are many a “real” scientist who would be more than happy to use all of this data. The problems are twofold: 1. We have to convince funding agencies (and ourselves) that it’s worth pursuing the research given that there are countless potential lines of inquiry. 2. If we published all of the connections that we looked for and didn’t find (or all of them that we thought we saw but turned out we didn’t), we would quickly need metadata scientists to be able to sort through all the crap. Please don’t play “blame the scientist”. That game is already played far too much. The situation is far more complicated and involves far more groups than just “real scientists”.

Reply
Urbie Watrous says:

August 7, 2013 at 1:45 pm

Proprietary data hoarding isn’t just for corporations, either. For years, Michael Mann, the climatologist responsible for the infamous “hockey stick” graph used to show an alarming rate of warming over the past several decades, refused to release his source data for other scientists to review. “I don’t have time,” was his explanation. Finally, he released the data — and the hockey stick was promptly challenged by a couple of Canadian researchers who couldn’t reproduce his results! You don’t hear much about that, because Mann was able to control the conversation for so long. Similarly, James Hansen, NASA’s leading climate scientist, did a study that he hoped would show a similarly alarming/unprecedented warming in recent years. But when he crunched the numbers, it turned out that the hottest year of the 20th century was actually… 1934! So he changed the algorithm to make a more recent year look equally hot. This stuff happens all the time, and journalists eat it up. Don’t believe executive summaries — demand to see the source data!

Reply
Pingback: 빅데이터를 향한 비판적 사고 | lodworld
Pingback: stoykonet :: info
medicalquackblog says:

September 3, 2013 at 7:18 pm

Nice article and do you and Cathy O’neil collaborate, you should:) Her points are very similar to what you say here as well. Quit creating models that lie…I feature her a lot on my blog to reach the layman in healthcare issues as I’m reformed medical record programmer myself, the next level down the ladder. Also I keep about 4 videos in the footer of my blog so folks can stumble on them and maybe get curious…and I have a page dedicated to folks smarter than me who take time to educate with their videos which I have coined “The Algo Duping 101 Page”…and the result of that is the Attack of the Killer Algorithms…of which I wrote a series of posts showing every day events of when they bite:)

http://www.ducknet.net/attack-of-the-killer-algorithms/

Reply
Pingback: IZARDNET | Why you should never trust a data visualisation
Pingback: Data Scientists and Shooting the Messenger | Guest Blogs
Pingback: How many people read my posts? « Pete Warden's blog
Pingback: Data-scepticisme | Data denkers
Pingback: Data-scepticisme | Sargasso
Pingback: Data-scepticisme - Analyse - De Nieuwe Reporter - Journalistiek & Nieuwe Media
Shaona Mukherjee says:

October 21, 2013 at 6:00 am

Very well mentioned points. It is true that data scientists have been over hyped and that’s why i feel one should have atleast some basic knowledge about the subject. A few workshops might come in handy so that one is not fooled!

Reply
Pingback: What does Jetpac measure? « Pete Warden's blog
Pingback: 71 useful articles on online behavior change in 2013 you can choose not to read | Mats Stafseng Einarsen
Pingback: #IC14NL: Selfiecity | Masters of Media
Malarkey says:

April 29, 2014 at 7:28 am

You know these are good points … but a lot of (not all of) network analysis in biomedical research is just eye-candy too – regardless of peer review. So don’t feel too bad.

Reply
Pingback: Six Things Facebook (Thinks It) Knows About Your Love Life | Nagg
Pingback: Week 7: Assessment 1 – Activity – Resource list for essay | reannahurley
Pingback: Happy, Healthy, Hungry. Mapping San Francisco Restaurant Cleanliness | Open Data Science
Pingback: Six Things Facebook (Thinks It) Knows About Your Love Life – Flash News

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Why you should never trust a data scientist

44 responses

Leave a comment Cancel reply

Share this:

Related

44 responses

Leave a comment Cancel reply