Five short links

fivefoxes

Photo by Art.Crazed

Flush and reload cache attack – An attack on a common encryption program, using only cache manipulation to figure out exactly which code paths are being executed, and so what the secret inputs are. There are so many obscure ways for information to leak out from any complex system.

How to build effective heatmaps – There’s no objective way to build a heatmap. Every way of turning discrete points into gradients on a map emphasizes different features of the data, and this discussion has a great overview of the most common techniques.

Overly-honest social science – Worth quoting in depth: “Good social scientific practice should be about acknowledging the weaknesses of the methods used: not to reward sloppiness, but to highlight what really goes on; to reassure novice researchers that real world research often is messy; or to improve on current methods and find ways to do things better.  Better in this case does not necessarily mean less subjective, but it does mean more transparent, and usually more rigorous.  The publication of mistakes is a necessary part of this process.”

Visualizing historical map data – I’ve been very impressed with CartoDB, and this is a great example of how powerful it can be. The author uses uncommon names to track New Yorker’s movements between the 1839 and 1854 censuses.

America’s artificial heartland – The prose is purpler than an eggplant, and a bit hard to swallow, but there’s something thought-provoking about this analysis. He paints a world where we industrialize farming beyond recognition, and then compensate with a lot of expensive theater at the places where we buy our food.

 

How to easily resize and cache images for the mobile web

darkroom

Photo by Trugiaz

Mobile web pages crash when they run out of memory. Apple devices are particularly painful – they’ll die hard without even giving you a chance to record the crash! Even if they don’t crash, performance will be sluggish if you’re pushing memory limits. Large images are the usual cause, but since we display a lot of photos from social networks, I don’t have much control over their size outside of the default dimensions the different services supply. What I really needed was a zero-maintenance way to pull arbitrary sizes of thumbnails from external images, very fast. Happily, I found a way!

The short version is that I set up a server running the excellent ImageProxy open-source project, and then I placed a Cloudfront CDN in front of it to cache the results. The ImageProxy server listens for requests containing a source image location as a parameter, along with a set of requested operations also baked into the URL. The code forks a call to ImageMagick’s command-line tools under the hood, which isn’t the most elegant solution, but does provide a lot of flexibility. With this sort of execution model, I do recommend running the server on a separate box from your other machines, for both performance and security reasons.

There are a few wrinkles to getting the code running, so if you’re on EC2, I put together a public AMI you can use on US-West1 as ami-3c85ad79. Otherwise, here are the steps I took, and the gotchas I hit:

– I started with clean Ubuntu 12.04, since this is the modern distribution I’m most comfortable with.

– ImageProxy requires Ruby 1.9.x, and installing it instead of the default 1.8.7 using apt-get proved surprisingly hard! I eventually found these instructions, but you might be better off using RVM.

– I mostly followed the ImageProxy EC2 setup recipe, but one missing piece was that the default AllowEncodedSlashes caused Apache to give 404 errors on some requests, so I had to make a fiddly change to my configuration.

I now had an image proxy server up and running! To check if yours worked, go to the /selftest page on your own server. Pay close attention to the “Resize (CloudFront-compatible URL format)” image, since this is the one that was broken by Apache’s default AllowEncodedSlashes configuration, and is the one you’ll need for the next CDN step.

I’m an Amazon junkie, so when I needed a flexible and fast caching solution, CloudFront was my first stop. I’d never used a CDN before, so I was pleasantly surprised by how easy it was. I clicked on “Create New Distribution” in the console, specified the URL of the ImageProxy server I’d set up as the source, and within a few minutes I had a new CloudFront domain to use. It handles distributing its copies of the cached files to locations near each requesting user, and if a URL’s not yet cached, it will pull it from the ImageProxy server the first time it’s requested, and then offer super-fast access for subsequent calls. It all just worked!

Now, my code targeting mobile devices is able to automagically transform image URLs into their resized equivalents, eg from http://eahanson.s3.amazonaws.com/imageproxy/sample.png to http://imageproxy.heroku.com/convert/resize/100×100/source/
http%3A%2F%2Feahanson.s3.amazonaws.com%2Fimageproxy%2Fsample.png. There is a slight delay the first time a URL is accessed while the resizing is performed, but as long as you have a finite set of images you’re loading, this should go away pretty quickly as the results get cached by CloudFront.

The only other issue to think about is limiting the access to your server, since by default any other site could potentially start leaching off your proxy image processing. There are plenty of other ways of doing the same thing, so I wouldn’t stay awake at night worrying about it, but ImageProxy does allow you to require a signature, and restrict the source URLs to particular domains, both of which would help prevent that sort of usage.

Big thanks to Erik Hanson for building ImageProxy, I’ve been putting off doing this kind of caching for months, but his project made it far simpler than I’d hoped. It’s working great, improving performance and preventing crashes, so if you’re struggling with unwieldy images, I recommend you take a look too!

[Update – I forgot to mention that the ultimate easy solution is a paid service like embed.ly or imgix! After looking into them, I decided the overhead of depending on yet another third-party service outweighed the convenience of avoiding setup hassles. It wasn’t an easy choice though, and I could see myself switching to one of them in the future if the server maintenance becomes a pain – I know that’s what happened with my friend Xavier at Storify!]

Five short links

starfish

Photo by Dezz

Hillbilly Tracking of Low Earth Orbit Satellites – A member of the Southern Appalachian Space Agency gives a rundown of his custom satellite tracking hardware and software, all built from off-the-shelf components. Pure hacking in its highest form, done for the hell of it.

Introduction to data archaeology – I’m fascinated by traditional archaeology, which has taught me a lot about uncovering information from seemingly intractable sources. Tomasz picks a nice concrete problem that shows how decoding an unknown file format from a few examples might not be as hard as you think!

Adversarial Stylometry – If you know your writing will be analyzed by experts in stylometry, can you obsfucate your own style to throw them off, or even mimic someone elses to frame them?

The man behind the Dickens and Doestoevsky hoax – Though I struggle with the credibility problems of data science, I’m constantly reminded that no field is immune. For years biographers reproduced the story of a meeting that never happened between the British and Russian novelists. The hoax’s author was driven by paranoia about his rejection by academia, and set out to prove that bad work under someone else’s name would be accepted by journals that rejected his regular submissions.

Dedupe – A luscious little Python library for intelligently de-duplicating entities in data. This task normally takes up more time than anything in the loading stage of data processing pipelines, and the loading stage itself always seems to be the most work, so this is a big deal for my projects.

Why you should never trust a data scientist

lieslieslies

Photo by Jesse Means

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than John in Texan border towns, and so the country was on the verge of being swamped by Hispanics. What was worse, I couldn’t even get my data into the hands of reputable sociologists, thanks to concerns from Facebook.

I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way. The best you can hope for is blog and Twitter feedback, but without access to the data, or even a full paper on the techniques, you can’t dig very deeply.

Why are data scientists getting all the attention? I blame the real scientists! There’s a mass of fascinating information buried in all the data we’re collecting on ourselves, and traditional scientists have been painfully slow to take advantage of it. There are all sorts of barriers, ranging from the proprietary nature of the source data, the lack of familiarity with methods able to handle the information at scale, and a cultural distance between the academic and startup worlds. None of these should be insurmountable though. There’s great work being done with confidential IRS and US Census data, so the protocols exist to both do real science and preserve secrecy. I’ve seen the size of the fiber-optic bundles at CERN, so physicists at least know how to deal with crazy data rates. Most of the big startups had their roots in universities, so the cultural gap should be bridgeable.

What am I doing about it? I love efforts by teams like OpenPaths to break data out from proprietary silos so actual scientists can use them, and I do what I can to help any I run across. I popularize techniques that are common at startups, but lesser-known in academia. I’m excited when I see folks like Cameron Marlow at Facebook collaborating with academics to produce peer-reviewed research. I keep banging the drum about how shifty and feckless we data scientists really are, in the hope of damping down the starry-eyed credulity that greets our every pronouncement.

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources. If you’re in a startup, figure out if you have data that tells a story, and see if there’s any academics you can reach. If you’re a reader, heckle data scientists when we make a sage pronouncement, keep us honest!

Five short links

chainlinkfive`

Photo by Trevor Pritchard

Network dynamic temporal visualization – Skye Bender-deMoll’s blog is the perfect example of why I had to find an alternative RSS service when Google Reader shut down. He only posts once every few months, but I’d hate to miss any of them! Here he covers a delicious little R tool that makes visualizing complex networks over time easy.

PressureNet live API – We’re all carrying little networked laboratories in our pockets. You see a photo. I see millions of light-sensor readings at an exact coordinate on the earth’s surface with a time resolution down to the millisecond. The future is combining all these signals into new ways of understanding the world, like this real-time stream of atmospheric measurements.

Inferring the origin locations of tweets – You can guess where a Twitter message came from with a surprising degree of accuracy, just based on the unstructured text of the tweet, and the user’s profile.

What every web developer should know about URL encoding – Nobody gets URL encoding right, not even me! Most of the time your application won’t need to handle the obscurer cases (path parameters anyone?) but it’s good to know the corners you’re cutting.

Global Name DataAdam Hyland alerted me to the awesome OpenGenderTracking project. Not only does this have data on names from the US and the UK, but it includes the scripts to download them from the source sites. Hurrah for reproducibility! I’ve also just been alerted to a Dutch source of first name popularity data.

 

Switching to WordPress from Typepad.com

Screen Shot 2013-07-12 at 12.21.19 PM

Photo by Dhaval Shah

I’ve been on the domain petewarden.typepad.com since I started blogging in 2006. At the time I knew that the sensible thing to do was to set up a custom domain that pointed to the typepad.com one, but since I didn’t expect I’d keep it up, it didn’t seem that important. A thousand posts later, I’ve had plenty of time to regret that shortcut!

Plenty of friends have tried to stage interventions, especially those with any design sense. I would half-heartedly muttered “If it aint broke, don’t fix it” and plead lack of time, but the truth is my blog was broken. I post because I want people to read what I write, and the aging design and structure drove readers away. When Matt Mullenweg called me out on it, I knew I had to upgrade.

The obvious choice was WordPress.com. Much as I love open source, I hate dealing with upgrades, dependencies, and spam, and I’m happy to pay a few dollars a month not to worry about any of those. Here’s what I had to do:

Purchased a premium WordPress domain at https://petewarden.wordpress.com. I wanted paid hosting, just like I had at Typepad, partly for the extra features (custom domains in particular) but also because I want to pay for anything that’s this important to me.

Bought the Elemin design template created by Automattic. I wanted something clean, and not too common, and I haven’t seen this minimal design in too many places, unlike a lot of other themes.

Exported my old posts as a text file, and then uploaded them to my new WordPress blog. The biggest downside to this is that none of the images are transferred, so I’ll need to keep the old typepad.com blog up until I can figure out some scripted way to move those too. The good news is that all of the posts seem to have transfered without any problems.

Set up a custom domain in the WordPress settings to point to petewarden.com. This involved changing my nameservers through my DNS provider, and then using a custom settings language to duplicate the MX records. This was the scariest part, since I’m using Google Apps for my email and I really didn’t want to mess their settings up. The only hitch I hit was trying to duplicate the www CNAME, I couldn’t get an entry working until I realized that it was handled automatically, so I just had to leave it out! With that all set up, I made petewarden.com my primary domain, so that petewarden.wordpress.com links would redirect to it.

Updated my Feedburner settings to point to the WordPress RSS feed. I was worried about this step too, but other than duplicating the last ten posts when I checked in my RSS reader, it just seemed to work.

Added an RSS redirect in the head section of my old typepad blog. This is a Javascript hack so none of the Google Juice from any popular posts will transfer to the new domain (and it may even ding search rankings), but at least readers should now see the modern version. Unfortunately there’s no proper 30x redirect support in Typepad, so this is the best I can do.

Matt kindly offered to help me with the transfer, but so far I haven’t hit anything I can’t figure out for myself. Using a modern CMS after wrestling with Typepad’s neglected interface for years is like slipping into a warm bath, I’m so glad I took the leap – thanks to everyone who pushed me!