Five short links


Photo by Leo Reynolds

Fingerprinting Cameras through Sensor Noise – Arvind shows how unique the noise pattern that your camera embeds in every photo you take really is. It's as identifiable as a thumb-print, and with high resolution pictures available from many sources associated with your real name, makes it theoretically possible to match up photos you've tried to take anonymously. The one saving grace is that compression removes a lot of that unseen uniqueness, but it looks like you'd need to go down to 60 quality on JPEG to be sure.

Multiplicative Microdata Noise for Privacy - Noise isn't always bad though. I'm fascinated by the ideas of adding it to prevent reverse-engineering of sensitive information from data releases. I bet it's tough to get right though, I'd love to set Arvind or some of the Kaggle team loose to see how solid the protection really is.

Underscore – The hardest thing about switching to Javascript from other languages is its primitive support for iteration. I'm excited by the promise of this library that Tyler Gillies pointed out to me, it looks like a great solution for my grumbles, unless there's a catch I'm missing.

requirejs – He also showed me this solution for proper module loading in JS. The syntax is a bit funky, and I know we'll still want to compile and minify for production, but it's still a good tool to have in your kit.

TeleHash – An intriguingly generic way of connecting machines together in a peer-to-peer way using distributed hash tables. Maybe it's just my exposure to them as I do more Cassandra work, but DHTs seem to be popping up all over the place.


Five short links

Photo by Ronaldo F Cabuhat

CryptDB – A PostGres modification that stores data in a securely encrypted way, but still allows efficient queries. I love the idea, can't wait to see the software.

Perceived density – We very rarely go back and look at what the numbers we're using actually mean. This author noticed that the standard population density figures for cities didn't match up with anyone's real experiences with those places. He then reasoned about why that was, created a different way of measuring and built figures that seem a lot closer to reality. New York really is denser than Los Angeles!

Reverse Social Engineering (pdf) - If attackers can persuade their targets to approach them, the attack is more likely to succeed. The friend recommendation systems of social networks like Facebook can be gamed to manipulate those targets into friending attackers. Via Prasanna

If This Then That – Wonderfully simple scripting system for joining together web services like Flickr, Facebook and Twitter.

Hector – A battle-tested Java interface to Cassandra with a strong community behind it.

Thoughts on a Cassandra issue?

I'm running into consistent problems when storing values larger than 15MB into Cassandra, and I was hoping for some help on tracking down what's going wrong. I've emailed the Cassandra list, but I thought my long-suffering readers might like a crack too. I promise I'll get another five short links post up there if I can crack this bug!

From the FAQ it seems like what I'm trying to do is possible, so I assume I'm messing something up with my configuration. I have a minimal set of code to reproduce the issue below, which I've run on the DataStax 0.8.1 AMI I'm using in production (ami-9996c4dc)


# To set up the test data structure on Cassandra:
connect localhost/9160;
create keyspace TestKeyspace with 
  placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and 
  strategy_options = [{replication_factor:3}];
use TestKeyspace;
create column family TestFamily with
  comparator = UTF8Type and
  column_metadata =
    {column_name: test_column, validation_class: UTF8Type},
# From bash on the same machine, with Ruby and the Cassandra gem installed:
require 'rubygems'
require 'cassandra/0.8'
client ='TestKeyspace', 'localhost:9160', :retries => 5, :connect_timeout => 5, :timeout => 10)

# With data this size, the call works
column_value = 'a' * (14*1024*1024)
row_value = { 'column_name' => column_value }
client.insert(:TestFamily, 'SomeKey', row_value)
# With data this size, the call fails with the exception below
column_value = 'a' * (15*1024*1024)
row_value = { 'column_name' => column_value }
client.insert(:TestFamily, 'SomeKey', row_value)
# Results:
This first call with a 14MB chunk of data succeeds, but the second one fails with this exception:
CassandraThrift::Cassandra::Client::TransportException: CassandraThrift::Cassandra::Client::TransportException
from /usr/lib/ruby/gems/1.8/gems/thrift-0.7.0/lib/thrift/transport/socket.rb:53:in `open'
from /usr/lib/ruby/gems/1.8/gems/thrift-0.7.0/lib/thrift/transport/framed_transport.rb:37:in `open'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/connection/socket.rb:11:in `connect!'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:105:in `connect!'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:144:in `handled_proxy'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:60:in `batch_mutate'
from /usr/lib/ruby/gems/1.8/gems/cassandra-0.12.1/lib/cassandra/protocol.rb:7:in `_mutate'
from /usr/lib/ruby/gems/1.8/gems/cassandra-0.12.1/lib/cassandra/cassandra.rb:459:in `insert'
from (irb):6
from :0
Any suggestions on how to dig deeper? I'll be reaching out to the Cassandra gem folks, etc too of course.
[Update – Thanks to Matthew Russell and Daniel Lundin for pointing me towards the solution. Cassandra.yaml defines a maximum frame size for the Thrift API communication, and defaults to 15MB. Upping that and the max message length solved it for me.]


Five signs you’ve become American

Photo by Beverly and Pack

Ten years ago, I stepped off a plane at LAX and began a new life in America. I was definitely English at that point. A few months back, an NPR interviewer confronted me with a question – "Should we describe you as a British or American researcher?". I'd never had to answer that before, so it took me a moment to think, but only one answer felt right – "I'm American". How did I know?

Being Good

The first change came in the way I talk. I'm terrible at accents, so I still sound funny, but my phrasing changed radically. It was a natural process, people only expect to hear certain responses and so I ended up being trained to avoid puzzled looks. I knew I'd crossed a threshold when my parents started mocking me. "How are you Peter?" – "I'm good" – "We know you are! Teheheheehehe". The correct British response is "I'm well", "Ok", or a grunt/mumble.

Enjoying Fake Pubs

I still miss mum's roast dinners, Indian restaurants and chip butties, but the stereotype of terrible British food is generally true. We're known for our drinking though, so most cities have a few theme pubs. There's apparently a legal requirement that they feature a traditional phone booth outside (though usually without the improvised restroom quality that would really mark them as authentic). When I first arrived I hated these places. I had a horror of being an 'ex-pat' who sat around watching soccer and reading The Sun, never truly engaging with the culture of the place I lived. They always seemed like shabby emotional props, over the top stage sets with enormous portraits of Churchill and the Queen, so many clich├ęs it felt like walking into an Austin Powers movie.

After a few years though, something happened. The terrible taste no longer seemed important, I loosened up and began to look forward to my occasional visits to American pubs. I had developed a new sense – the ability to detect and appreciate quaintness. Growing up in the English countryside I was surrounded by so much of it that I had no awareness, in the same way that I guess fish don't think about the water they swim in. My normal was now strip malls, sushi and bars, so anything British felt foreign and interesting, and the authenticity no longer felt so important.

A Fascination with the Old Country

As that change progressed, I realized my reading habits were skewing towards British history and literature. I'd always been a voracious reader with traditional classics in the mix, but I felt a need to fill in the gaps in my knowledge of the British past, both non-fiction and culture. The typical American obsession with ancestry always looked ridiculous from the other side of the Atlantic, but now I understood. It's not a rejection of the US national identity, it's an integral part of it. Because there's people from such a wide range of different backgrounds having to live together, history and ancestry provide a safe way to talk about our differences. It's non-threatening because the stories all end up with us here in America. Talking so much about the variations implicitly acknowledges how much we have in common.

Atrophied Irony and Sarcasm

I inherited an ability to come up with a cutting remark for any occasion from my mum, and I still find her a riot. It served me well in the UK, but I rapidly discovered that it generated confused and concerned looks over here. It dawned on me that sarcasm relies on a deep shared understanding between the speaker and listeners, or it won't be clear that you actually meant the opposite of what you just said. There just isn't enough agreement about what's normal here to deploy sarcasm for anything but the simplest situations. You can find Americans who will believe almost anything, and be willing to say it! I also realized I was hiding behind my negativity. I was afraid of stating what I really believed and desired explicitly for fear of rejection or mockery, so being ironic gave me plausible deniability. I'd still hope that people would agree, but was protected if they didn't.

This timidity had to go. You can't persuade people to help you get interesting things done unless you clearly show you believe in them yourself. My big secret was that I've always been painfully earnest, and moving to the US gave me the chance to come out of that closet. I get to talk about the crazy things I dream of building without immediately hearing a million reasons why they're bound to fail. I still love a sprinkling of snark, but as a little spice, not the main dish. One of the hardest things about going back to Britain is biting my tongue so I don't sound arrogant, because I'm not hedging what I'm saying enough.

The Passport

This is the last piece of the puzzle. I have my green card, and I'm just over a year away from qualifying to become a naturalized citizen. I'm going to apply as soon as I can, even though there's almost no practical difference between my status with a passport and being a permanent resident. America has given me the chance to live some amazing dreams, make wonderful friends, to create things and have experiences that would have been denied to me in the UK. It feels like home. I want to make that official.

Is the Swiss national bank investing in EC2 spot instances?

I'm a big fan of using EC2 spot instances to help reduce costs, but the pricing behavior can make managing them a pain. There's a very bimodal distribution, where the price will wobble along at around 16 cents an hour almost all the time, with infrequent but sudden spikes up to 50 cents or so, above the non-auction cost. I don't know for sure why things are so erratic, but I can think of a couple of possible causes.

It could be that there's a very non-linear distribution of maximum bids, where almost everybody is willing to pay up to the normal price for an instance, and so when capacity is reached and machines need to be shut down, the price has to shoot up before any significant number of resources are freed up.

There could also be some very heavy hitters who occasionally demand very large numbers of machines, so causing the price to spike. What was interesting this evening was that I was still able to manually start up a couple of instances at the regular price, which was below the auction price. I also find it surprising that the two data centers seem to be so highly correlated. Unless there's sophisticated users who are quickly switching their requests between availability zones, I'd expect a lot more independence as the spare capacity in each varies.

Chart via Felix Salmon

Of course, there could be something deeper at work. Werner Vogels was born in Europe, and the Swiss National Bank is trying to find somewhere to sink all its money to prevent the currency from appreciating. Coincidence? Mind you, I talk a bit funny too, so it's hard to trust anything I say. Unless I'm narrating a documentary about meerkats, in which case the accent helps.

Photo by Keven Law

Five short links


Photo by Stew Dean

The Good Judgment Project - Almost all of us are terrible at making predictions, even professional pundits, and worse, we're unaware that we're so bad at it. That makes me excited to see this academic project to objectively analyze the techniques and people who make the most accurate forecasts. Even just a minor improvement in our prediction skills could make a world of difference in the quality of our decisions, so I'm looking forward to seeing what results come out of the study.

Thesis on phone geotag analysis – A strong overview from a UK undergraduate, covering a lot of different ways that location can be determined from iPhones and other smart phones.

Identifiability of de-identified data – Summary of a researcher whose work demonstrates how flawed most data anonymization is, even for sensitive medical information.

Shape optimization of gridded surfaces – I had a good conversation with Avik Das, the author of this work recently, and it made me nostalgic for my time in computer graphics. This video shows some of the research he's doing to 'relax' complex geometric shapes into a more natural arrangement, and is beautiful in a very geeky way.

Bulk loading in Cassandra – Yet more useful material from DataStax, on a problem I've been looking into a lot recently. Since I've had no luck getting Mumakil running, I'll see how this approach works.