Five short links

tally

Photo by ahojohnebrause

Telepath logger – An open source keylogger, with a lot of fun extras like webcam snapshots, light-level, and music-monitoring.

A failure of authorship and peer review – An inadvertent #overlyhonestmethods moment in a chemistry paper.

Datahacker – Some beautiful experiments with 3d maps in the browser.

Black holes, lifeloggers, and space brainAdam Becker’s been doing some great stories for New Scientist, one of my favorite mags, and it’s well worth following his blog too.

Comments on the Mill CPU -I’m a hopeless romantic when it comes to off-beat CPU architectures (I still pine for the clockless ARM Amulet, despite knowing about the problems that have emerged over the decades), so I was excited to read the original announcement of the Mill approach. It’s good to see some in-depth analysis from an expert, explaining a bit more of the history and challenges the Mill architecture faces, especially from dog-slow memory.

 

TechRaking – mining the news

techraking

One of my favorite conferences is TechRaking, a gathering of journalists and technologists hosted by Google. It’s invite-only unfortunately, but they seem to gather a good selection of reporters interested in using technology, and engineers interested in creating tools for journalists. In the past I’ve met some fascinating people like LyraMcKee for the first time, and caught up with long-time friends like Pete Skomoroch, Somini Sengupta and Burt Herman.

This time, I’m going to live blog the talks, though there’s also an excellent set of notes as a Google Doc here.

Michael Corey from CIR started with an introduction. Tech + muck-raking is what it’s all about, hence the name of the conference. We’re focused on “mining the news” this time, which means turning bits and bytes into stories. We’re going to try to raise the bar on what we can do in this area, by bring people together.

Phil Bronstein, chair of the CIR. Held up a t-shirt – “I can’t tell you how long it’s been since journalists got to have swag!”. Why is the CIR important? What distinguishes us is the idea of innovation. Journalists suffer from “the higher calling disease”, we don’t want to interact with the messy public. We’ve been discovering that technology can enable us to reconnect with our audience.

Technologists want to create and build something that people use. Journalists have traditionally stopped at “here’s the story, see you later”. The CIR is looking for results from their stories. They do everything from live theater to poetry to get at the the emotional truth of the stories, so it’s more than just technology. I’m hoping that this event will be even better than the first one, because we ended up with engineers in one room and reporters in another, just because we’re not familiar with each other.

Richard Gingras, Google. “I have been around since the age of steam-powered modems”. “Why are we here? We’re all here because we care about the relationship between citizens and media, and particularly citizens and journalism.” Our colonial predecessors had a very limited interaction with media, most of their daily lives were dominated by personal interactions. Over the last hundred years, mass media became vastly more important, and now social media has exploded. It’s not just an increase in volume, the nature has changed too. We’re all able to be producers now, we’re no longer the passive audience we were. We’re able to pick the voices we listen to, share what we like, talk back to the authors. Media is no longer an adjunct, it’s a central part of our lives, many of our most important relationships are exclusively online.

How can we integrate quantitive thinking into the daily workflow of journalism, make it the rule, not the exception? We’ve made great strides in the last few years. Internet Archives closed-caption analysis is exciting, useful, and compelling. ProPublica built a dialysis tracker, that finds nearby centers for patients, and estimates how good they are. These kind of evergreen investigative projects are the way reporting should be going.

The Google Fellowship program is aiming to produce both quant-friendly journalists and journalist-friendly quants. Our society’s need for credible journalism knowledge is more urgent than ever. Can we find truth in the numbers? I suspect we can.

Now we shift to a panel discussion. The members are Richard Gingras, Google, Rodney Gibbs, Texas Tribune, Robyn Tomlin – digital first media, Vanessa Schneider, Google Maps, Sree Balakrishnan, Google Fusion.

“What is your fantasy data set?”. Robin, a database of  gun deaths across the US. Vanessa, a magical database of shape files at all the admin levels. Power outage data would be extremely useful too. Sree is very interested in local data, by ZIP code shows exactly what happens to your property tax. Rodney wants property taxes too, there are 900 tax assessors in Texas, some do well, but a lot require you to appear in person to get the data.

“Everyone knows the present is mobile, mobile, mobile. What does mobile do today for storytelling?”. Vanessa is excited by how readers can contribute in real time from their perspective. Robin loves being able to tune into the interests of people depending on where they are. She had someone build a database of pet licenses, and was able to serve up geo-located data on what the most popular pet names and types were nearby. Rodney also loves the ability to personalize the make the results location-based too, so you can look at average salaries of public servants and school rankings nearby. It is a problem to set up the CMS to make it mobile friendly though. Sree sees a lot of requests from people who want to use Fusion Tables from mobile devices, and that’s an important use case. Local is very compelling to a lot of people, and if we could geo-tag stories they would really help. Richard says over half of their traffic is coming from mobile, so a lot of new projects are mobile first.

“How do smaller organizations deal with this new world of data?” Robin deals with over 75 newsrooms, so she’s tried to create a community of data so that all of those teams can turn to a hub of expertise. They also work with non-profits like the CIR and ProPublica to get help from their specialists. Every newsroom should have somebody who’s dabbling and can reach out to others! Rodney has had a lot of success encouraging reporters to play with the new tools. One example is a youth prison that had a high rate of injuries in their data, and it turned out that the staff were running fight clubs. That persuaded a lot of people that the data was worth digging into. We’re training reporters in CSS and Javascript too. 90% of the journalists are excited by it. Vanessa is really interested in how journalists want to use the tools.

“What is the future of Fusion Tables? What’s the next big thing?” – Sree is looking at bringing more of the structured data in the world into Fusion Tables.

“How do we revolt against the tyranny of government data?”. Sree – We’re looking at getting governments to adopt standards, working with organizations like OKFN. The activities of journalists in finding data, cleaning it up, and putting it back out there is incredibly useful. Robin – news organizations have been fighting for a long time to get data. Commonly there are format problems, such as PDF versions of Excel files. Rodney says that Texas is actually very progressive on making data available. He’s been working with school districts, who have been helpful, but for example they just were given a spreadsheet with 100,000 rows and 105,000 columns! They’ve developed skills in dealing with messy data, to the point that the state comptroller wanted to pay them to deal with some of their internal data. Once governments see the potential, they become a lot more willing.

“What are the challenges standing in the way of data journalism?” Richard – it’s getting the proof of concepts. Once we have examples of how powerful these stories can be, people will understand how sexy it is. Robin says it’s also leadership, news organizations need to emphasize how important it is. Sree mentions a couple of examples – a map of gun ownership, the Nottingham Post in the UK created a map of where the most parking tickets are issued.

“What about the blowback from that gun story?” The paper posted a map of all the gun owners in the area, and lots of states passed laws making the ownership data private. Robin says that a lot of journalists were critical, because there wasn’t a good story behind it, they just rushed to get it up there. Sree says that Google has the conversation about the balances all the time. He says the only way forward is to focus on the benefits, and to guard against abuses. Rodney tells a story where his prisoner database had incorrectly marked a few hundred prisoners as child abusers, but it turned out the state had incorrectly entered a code in their data!

Five short links

chalkfive

Photo by Leo Reynolds

San Francisco aerial views from 1938 – A startlingly high-resolution zoomable map of the city from 75 years ago. The detail is amazing, as are the sand dunes across what’s now the Sunset neighborhood. It’s an incredible resource for anyone interested in our local history.

One weird kernel trick – Ad spam for ML researchers – “the secret to learning any problem with just ten training samples”!

Encoding messages in public SSH keys – Pimp your public RSA key with a custom message!

Why we need a compiler with a long-term support release – Dependencies are the devil, and compilers are at the root of most dependency trees. I can’t even imagine how much wrangling it takes to keep a big software platform compiling across dozens of architectures, but this heartfelt plea from the trenches gives me a flavor of the suffering involved. In a few decades we’ll be stuck searching for complete VMs to run old software, since most of the frameworks they depend on will have rotted off the internet.

Archaeology and Github – Liberating data from hard-to-OCR tables at the end of papers, and into a simple shared environment. I don’t know if Github is the right space for this, but like wikis, programmers are the early adopters of collaboration tools, so it’s a good place to start.

 

Five short links

fivefoxes

Photo by Art.Crazed

Flush and reload cache attack – An attack on a common encryption program, using only cache manipulation to figure out exactly which code paths are being executed, and so what the secret inputs are. There are so many obscure ways for information to leak out from any complex system.

How to build effective heatmaps – There’s no objective way to build a heatmap. Every way of turning discrete points into gradients on a map emphasizes different features of the data, and this discussion has a great overview of the most common techniques.

Overly-honest social science – Worth quoting in depth: “Good social scientific practice should be about acknowledging the weaknesses of the methods used: not to reward sloppiness, but to highlight what really goes on; to reassure novice researchers that real world research often is messy; or to improve on current methods and find ways to do things better.  Better in this case does not necessarily mean less subjective, but it does mean more transparent, and usually more rigorous.  The publication of mistakes is a necessary part of this process.”

Visualizing historical map data – I’ve been very impressed with CartoDB, and this is a great example of how powerful it can be. The author uses uncommon names to track New Yorker’s movements between the 1839 and 1854 censuses.

America’s artificial heartland – The prose is purpler than an eggplant, and a bit hard to swallow, but there’s something thought-provoking about this analysis. He paints a world where we industrialize farming beyond recognition, and then compensate with a lot of expensive theater at the places where we buy our food.

 

How to easily resize and cache images for the mobile web

darkroom

Photo by Trugiaz

Mobile web pages crash when they run out of memory. Apple devices are particularly painful – they’ll die hard without even giving you a chance to record the crash! Even if they don’t crash, performance will be sluggish if you’re pushing memory limits. Large images are the usual cause, but since we display a lot of photos from social networks, I don’t have much control over their size outside of the default dimensions the different services supply. What I really needed was a zero-maintenance way to pull arbitrary sizes of thumbnails from external images, very fast. Happily, I found a way!

The short version is that I set up a server running the excellent ImageProxy open-source project, and then I placed a Cloudfront CDN in front of it to cache the results. The ImageProxy server listens for requests containing a source image location as a parameter, along with a set of requested operations also baked into the URL. The code forks a call to ImageMagick’s command-line tools under the hood, which isn’t the most elegant solution, but does provide a lot of flexibility. With this sort of execution model, I do recommend running the server on a separate box from your other machines, for both performance and security reasons.

There are a few wrinkles to getting the code running, so if you’re on EC2, I put together a public AMI you can use on US-West1 as ami-3c85ad79. Otherwise, here are the steps I took, and the gotchas I hit:

– I started with clean Ubuntu 12.04, since this is the modern distribution I’m most comfortable with.

– ImageProxy requires Ruby 1.9.x, and installing it instead of the default 1.8.7 using apt-get proved surprisingly hard! I eventually found these instructions, but you might be better off using RVM.

– I mostly followed the ImageProxy EC2 setup recipe, but one missing piece was that the default AllowEncodedSlashes caused Apache to give 404 errors on some requests, so I had to make a fiddly change to my configuration.

I now had an image proxy server up and running! To check if yours worked, go to the /selftest page on your own server. Pay close attention to the “Resize (CloudFront-compatible URL format)” image, since this is the one that was broken by Apache’s default AllowEncodedSlashes configuration, and is the one you’ll need for the next CDN step.

I’m an Amazon junkie, so when I needed a flexible and fast caching solution, CloudFront was my first stop. I’d never used a CDN before, so I was pleasantly surprised by how easy it was. I clicked on “Create New Distribution” in the console, specified the URL of the ImageProxy server I’d set up as the source, and within a few minutes I had a new CloudFront domain to use. It handles distributing its copies of the cached files to locations near each requesting user, and if a URL’s not yet cached, it will pull it from the ImageProxy server the first time it’s requested, and then offer super-fast access for subsequent calls. It all just worked!

Now, my code targeting mobile devices is able to automagically transform image URLs into their resized equivalents, eg from http://eahanson.s3.amazonaws.com/imageproxy/sample.png to http://imageproxy.heroku.com/convert/resize/100×100/source/
http%3A%2F%2Feahanson.s3.amazonaws.com%2Fimageproxy%2Fsample.png. There is a slight delay the first time a URL is accessed while the resizing is performed, but as long as you have a finite set of images you’re loading, this should go away pretty quickly as the results get cached by CloudFront.

The only other issue to think about is limiting the access to your server, since by default any other site could potentially start leaching off your proxy image processing. There are plenty of other ways of doing the same thing, so I wouldn’t stay awake at night worrying about it, but ImageProxy does allow you to require a signature, and restrict the source URLs to particular domains, both of which would help prevent that sort of usage.

Big thanks to Erik Hanson for building ImageProxy, I’ve been putting off doing this kind of caching for months, but his project made it far simpler than I’d hoped. It’s working great, improving performance and preventing crashes, so if you’re struggling with unwieldy images, I recommend you take a look too!

[Update – I forgot to mention that the ultimate easy solution is a paid service like embed.ly or imgix! After looking into them, I decided the overhead of depending on yet another third-party service outweighed the convenience of avoiding setup hassles. It wasn’t an easy choice though, and I could see myself switching to one of them in the future if the server maintenance becomes a pain – I know that’s what happened with my friend Xavier at Storify!]

Five short links

starfish

Photo by Dezz

Hillbilly Tracking of Low Earth Orbit Satellites – A member of the Southern Appalachian Space Agency gives a rundown of his custom satellite tracking hardware and software, all built from off-the-shelf components. Pure hacking in its highest form, done for the hell of it.

Introduction to data archaeology – I’m fascinated by traditional archaeology, which has taught me a lot about uncovering information from seemingly intractable sources. Tomasz picks a nice concrete problem that shows how decoding an unknown file format from a few examples might not be as hard as you think!

Adversarial Stylometry – If you know your writing will be analyzed by experts in stylometry, can you obsfucate your own style to throw them off, or even mimic someone elses to frame them?

The man behind the Dickens and Doestoevsky hoax – Though I struggle with the credibility problems of data science, I’m constantly reminded that no field is immune. For years biographers reproduced the story of a meeting that never happened between the British and Russian novelists. The hoax’s author was driven by paranoia about his rejection by academia, and set out to prove that bad work under someone else’s name would be accepted by journals that rejected his regular submissions.

Dedupe – A luscious little Python library for intelligently de-duplicating entities in data. This task normally takes up more time than anything in the loading stage of data processing pipelines, and the loading stage itself always seems to be the most work, so this is a big deal for my projects.

Why you should never trust a data scientist

lieslieslies

Photo by Jesse Means

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than John in Texan border towns, and so the country was on the verge of being swamped by Hispanics. What was worse, I couldn’t even get my data into the hands of reputable sociologists, thanks to concerns from Facebook.

I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way. The best you can hope for is blog and Twitter feedback, but without access to the data, or even a full paper on the techniques, you can’t dig very deeply.

Why are data scientists getting all the attention? I blame the real scientists! There’s a mass of fascinating information buried in all the data we’re collecting on ourselves, and traditional scientists have been painfully slow to take advantage of it. There are all sorts of barriers, ranging from the proprietary nature of the source data, the lack of familiarity with methods able to handle the information at scale, and a cultural distance between the academic and startup worlds. None of these should be insurmountable though. There’s great work being done with confidential IRS and US Census data, so the protocols exist to both do real science and preserve secrecy. I’ve seen the size of the fiber-optic bundles at CERN, so physicists at least know how to deal with crazy data rates. Most of the big startups had their roots in universities, so the cultural gap should be bridgeable.

What am I doing about it? I love efforts by teams like OpenPaths to break data out from proprietary silos so actual scientists can use them, and I do what I can to help any I run across. I popularize techniques that are common at startups, but lesser-known in academia. I’m excited when I see folks like Cameron Marlow at Facebook collaborating with academics to produce peer-reviewed research. I keep banging the drum about how shifty and feckless we data scientists really are, in the hope of damping down the starry-eyed credulity that greets our every pronouncement.

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources. If you’re in a startup, figure out if you have data that tells a story, and see if there’s any academics you can reach. If you’re a reader, heckle data scientists when we make a sage pronouncement, keep us honest!

Five short links

chainlinkfive`

Photo by Trevor Pritchard

Network dynamic temporal visualization – Skye Bender-deMoll’s blog is the perfect example of why I had to find an alternative RSS service when Google Reader shut down. He only posts once every few months, but I’d hate to miss any of them! Here he covers a delicious little R tool that makes visualizing complex networks over time easy.

PressureNet live API – We’re all carrying little networked laboratories in our pockets. You see a photo. I see millions of light-sensor readings at an exact coordinate on the earth’s surface with a time resolution down to the millisecond. The future is combining all these signals into new ways of understanding the world, like this real-time stream of atmospheric measurements.

Inferring the origin locations of tweets – You can guess where a Twitter message came from with a surprising degree of accuracy, just based on the unstructured text of the tweet, and the user’s profile.

What every web developer should know about URL encoding – Nobody gets URL encoding right, not even me! Most of the time your application won’t need to handle the obscurer cases (path parameters anyone?) but it’s good to know the corners you’re cutting.

Global Name DataAdam Hyland alerted me to the awesome OpenGenderTracking project. Not only does this have data on names from the US and the UK, but it includes the scripts to download them from the source sites. Hurrah for reproducibility! I’ve also just been alerted to a Dutch source of first name popularity data.

 

Switching to WordPress from Typepad.com

Screen Shot 2013-07-12 at 12.21.19 PM

Photo by Dhaval Shah

I’ve been on the domain petewarden.typepad.com since I started blogging in 2006. At the time I knew that the sensible thing to do was to set up a custom domain that pointed to the typepad.com one, but since I didn’t expect I’d keep it up, it didn’t seem that important. A thousand posts later, I’ve had plenty of time to regret that shortcut!

Plenty of friends have tried to stage interventions, especially those with any design sense. I would half-heartedly muttered “If it aint broke, don’t fix it” and plead lack of time, but the truth is my blog was broken. I post because I want people to read what I write, and the aging design and structure drove readers away. When Matt Mullenweg called me out on it, I knew I had to upgrade.

The obvious choice was WordPress.com. Much as I love open source, I hate dealing with upgrades, dependencies, and spam, and I’m happy to pay a few dollars a month not to worry about any of those. Here’s what I had to do:

Purchased a premium WordPress domain at https://petewarden.wordpress.com. I wanted paid hosting, just like I had at Typepad, partly for the extra features (custom domains in particular) but also because I want to pay for anything that’s this important to me.

Bought the Elemin design template created by Automattic. I wanted something clean, and not too common, and I haven’t seen this minimal design in too many places, unlike a lot of other themes.

Exported my old posts as a text file, and then uploaded them to my new WordPress blog. The biggest downside to this is that none of the images are transferred, so I’ll need to keep the old typepad.com blog up until I can figure out some scripted way to move those too. The good news is that all of the posts seem to have transfered without any problems.

Set up a custom domain in the WordPress settings to point to petewarden.com. This involved changing my nameservers through my DNS provider, and then using a custom settings language to duplicate the MX records. This was the scariest part, since I’m using Google Apps for my email and I really didn’t want to mess their settings up. The only hitch I hit was trying to duplicate the www CNAME, I couldn’t get an entry working until I realized that it was handled automatically, so I just had to leave it out! With that all set up, I made petewarden.com my primary domain, so that petewarden.wordpress.com links would redirect to it.

Updated my Feedburner settings to point to the WordPress RSS feed. I was worried about this step too, but other than duplicating the last ten posts when I checked in my RSS reader, it just seemed to work.

Added an RSS redirect in the head section of my old typepad blog. This is a Javascript hack so none of the Google Juice from any popular posts will transfer to the new domain (and it may even ding search rankings), but at least readers should now see the modern version. Unfortunately there’s no proper 30x redirect support in Typepad, so this is the best I can do.

Matt kindly offered to help me with the transfer, but so far I haven’t hit anything I can’t figure out for myself. Using a modern CMS after wrestling with Typepad’s neglected interface for years is like slipping into a warm bath, I’m so glad I took the leap – thanks to everyone who pushed me!