Five short links

Fivedrives
Photo by Daryl Mitchell

DynamoDB – I can't tell you how excited I am to see Amazon's new hosted NoSQL service. I so wanted SimpleDB to be usable, but it was too hard to work with thanks to the way it required clients to deal with implementation details like servers and sharding. I ended up using S3 for some projects, but if I was starting Jetpac today I'd seriously consider going with Dynamo rather than self-hosted Cassandra. The main drawback for our requirements is that we couldn't run Pig scripts across the data, but I'd imagine analytics will come.

Unsupervised Decomposition of a Document into Authorial Components – Uses machine-learning techniques to figure out the authorship of books from the bible, in a way that matches the results of traditional biblical scholars. I love applications of computer science to the humanities, I think there's a lot of ground for cross-fertilization.

The state of NoSQL in 2012 – An insightful look at the past and future of the new wave of database systems, from someone who's been in the trenches working with them for years.

Auto scaling in the Amazon cloud – Describes how Netflix keeps up with demand by automatically creating and destroying instances based on usage. The CPU utilization graphs show effective they are at managing the process, but they also hint at the work that's required to build AMIs for all services that can be spun up automatically.

Google Plus Scraper – There's no API yet to most of the interesting bits of Google+'s content, but a lot of it is available publicly to web crawlers, and the information is delivered as a giant JSON array embedded within the page, so it's surprisingly easy to decode. The biggest pain is the space-saving variant of JSON that Google use, which leaves out explicit null values in arrays, so you get ['a',,'b',,,'c'] instead of ['a',null,'b',null,null,'c']

Give me your Tumblr URL and I’ll give you a globe

Globepreview

One of the toughest challenges with Jetpac is persuading people to sign up with Facebook. It has a lot to offer if you make it over that hurdle, but some of the most thoughtful people I know never will, for reasons I respect and understand.

The nice thing about using unstructured text as our location information is that we can go anywhere there's good photo captions. We started on Facebook just because there's so much data available (the average user has had over 200,000 photos shared with them by their friends) but now I'm finally getting a chance to address other sources. First up is Tumblr, so if you have your own or are a fan of one that mentions locations in any of the posts, you should be able to get a quick visualization of it, no login required, by going to:

https://www.jetpac.com/globe/create

Here's an example of what you'll get:

http://blog.jetpac.com/post/16028230197/everything-everywhere-travel-photography

We take all your photos, and build an HTML5/WebGL globe to help you explore them by place. Type in the URL, get a globe, it's as simple as that!

Just for fun I pointed the service at The Economist's Tumblr, to see how it coped with posts that definitely weren't travel photos. It didn't turn out half bad!

https://www.jetpac.com/globe/theeconomist.tumblr.com

Five short links

Pentagonallight
Photo by Phillip Chapman-Bell

Zipscribble map of Italy – Patterns leap out of post code data when you connect adjacent codes with lines, and color them according to the most significant digits. Shows how useful pictures can be when we need to make sense of complex data. 

The Gentleman Hacker's 1903 lulz – Innovators never enjoy pesky outsiders pointing out flaws with their technology. What's interesting is that I couldn't discover any practical exploits against early radio signals, despite how obvious the flaws were in the wake of these demonstrations. I did discover there's a world of Morse code software I never imagined existed though!

ARM instruction guide – Back in 1990, I learnt ARM assembler as my second language, after Basic, and though I haven't used it since then, I enjoyed this companion to a workshop on hacking the processor's security model because it's actually a concise, useful guide to its important features.

A revolution in mathematics? – An approachable look at the underappreciated changes in maths at the end of the 19th century. The short version is that the process of creating proofs became highly formalized, which sounds dry as dust but actually opened up a world of new possibilities. It's worth quoting at length:

"Well-optimized modern definitions have unexpected advantages. They give access to material that is not (as far as we know) reflected in the physical world. A really “good” definition often has logical consequences that are unanticipated or counterintuitive. A great deal of modern mathematics is built on these unexpected bonuses, but they would have been rejected in the old, more scientific approach. Finally, modern definitions are more accessible to new users. Intuitions can be developed by working directly with definitions, and this is faster and more reliable than trying to contrive a link to physical experience…rank and-file mathematicians can use the new methods confidently and effectively, while success with older methods was mostly limited to the elite"

Stepping away from "common sense" and relying on the logical outcomes of an abstract system that doesn't provide intuitive reasons sounds a lot like where we're headed in the data world. The question "Why does that work?" will often come up when you're making choices based on AB testing, and often the honest answer is "I don't know, but it does!".

Spark – A framework implementing a higher-level approach to writing distributed algorithms, with a more readable statement of the problem than standard MapReduce produces.

Five short links

Fiveaces
Photo by RHiNO NEAL

Your ideal performance/consistency tradeoff – It's unclear what the right number of nodes and level of redundancy for a Cassandra cluster are for any particular performance requirements, so most of us experiment until we have something that vaguely seems to work. Thanks to the folks at Berkeley, there's now a better way to figure it out via an interactive tool. Interestingly, they ended up using a Monte Carlo simulation rather than a formula, which shows how complex the problem is.

Why is finance so complex? – One of the most interesting articles I've read in a long time. It posits that finance is effectively a benign con trick, and relies on a lack of transparency to encourage people to take risks they wouldn't if they fully understood what they were getting into. The idea is that it's a collective action problem that only works if everyone jumps on board, and so the opacity helps persuade people to do that and achieve a better overall result than if they made an individually-rational choice. The model seems like it might explain other odd features of our social world.

Run a MapReduce job across five billion web pages for 25 cents – I have a massive data-crush on Common Crawl, and this is a fantastic practical demonstration of why I'm so excited. 

Clickjacking - The web's security model is more like Windows' than Unix's. It's been grafted onto an underlying system that was designed without any security foundations, and there's lots of gaps where different components interact in exploitable ways. This page explains how there's no reliable way to prevent malicious sites from hosting your site as an invisible frame and tricking users into taking actions by unknowingly clicking on it. Luckily we're in a world where software can be frequently updated, unlike 90's desktop software, so at least if this becomes widespread we might quickly see some fixes.

Muse – A noble experiment in mining useful data from your own email archives. It's still a bit too buggy to really get a feel for how interesting the results could be though.

Five short links

Fiveleaves
Photo by Let Ideas Compete

Rust – A trap to ensnare unwary web crawlers, by Tim McNamara. It creates pathological patterns of input data that will slow down naive robots by the sheer volume of processing required, whilst using minimal resources on the server thanks to elegant event-driven code. It's effectively a reversed denial-of-service attack, designed to overwhelm malicious or thoughtless crawlers of your site. Well-written and robust robot scripts will cope with malformed input of course, but the odds are that any crawler that's bringing your site to its knees with an unreasonable number of requests won't be a masterpiece of engineering!

Seeing like a database – Written by another fan of Seeing like a State, this has a great quote from Jay Owens at the end, noting "the asymmetry of personal data, open for the 99% & deep analytics for the 1%".

HttpBin – Echoes back information about HTTP requests you send it, including things like headers, data, and forced result codes. I'm just thankful it introduced me to the 418 (I'm a teapot) status code, I can't believe I've been writing web code for so long without checking for that possibility.

Drone landscapes, intelligent geotextiles, geographic countermeasures – I'd never realized how deeply adding processing to landscape structures could change our world. This is a compelling exploration of some of the possibilities, and I'm especially struck by the possibilties for a robot-readable world.

An end to bad heir days - The copyright on James Joyce's work finally expired! The enforcement process became a poster child for how the combination of insanely-long copyright terms and ornery heirs can derail the enjoyment and exploration of an artist's work. Thankfully scholars are now free to quote Joyce's work and letters, and I've just downloaded A Portrait of an Artist as a Young Man to re-read in celebration.

Five short links

Tally

Photo by Richard Paterson

The Ugliest Map in the World – Such an eyewatering color scheme, you'd think I'd designed it. The swimming-pool bottom caustics for the ocean areas really clinches it.

The Life of a Typeahead Query – An exploration of how hard it is to make an easy interface. Great to see a practical example how someone architected a real-world system with messy requirements.

Ending the Infographic Plague – Visualizations are an excellent hack for getting publicity, which inevitably leads to pollution by bad actors.

The Mess that is NPM – I really, really want to use Node.js, but the library ecosystem isn't quite mature enough for me to use in production. There's a lot of non-technical community hacking that you need to do to create a strong set of modules, and responsible maintainership isn't something I'm perfect at with all my projects, so I know how hard it is.

Brain Grain – Tasty little HTML5 visualization of world-wide migration. It's pretty simple, but has some innovations I've not seen elsewhere and uses animation effectively.

And last but not least, Jetpac is now rounding out a fundraising round, so if you're on Angelist any comments or recommendations would very welcome.