Five signs you’ve become American

Oldstarsandstripes
Photo by Beverly and Pack

Ten years ago, I stepped off a plane at LAX and began a new life in America. I was definitely English at that point. A few months back, an NPR interviewer confronted me with a question – "Should we describe you as a British or American researcher?". I'd never had to answer that before, so it took me a moment to think, but only one answer felt right – "I'm American". How did I know?

Being Good

The first change came in the way I talk. I'm terrible at accents, so I still sound funny, but my phrasing changed radically. It was a natural process, people only expect to hear certain responses and so I ended up being trained to avoid puzzled looks. I knew I'd crossed a threshold when my parents started mocking me. "How are you Peter?" – "I'm good" – "We know you are! Teheheheehehe". The correct British response is "I'm well", "Ok", or a grunt/mumble.

Enjoying Fake Pubs

I still miss mum's roast dinners, Indian restaurants and chip butties, but the stereotype of terrible British food is generally true. We're known for our drinking though, so most cities have a few theme pubs. There's apparently a legal requirement that they feature a traditional phone booth outside (though usually without the improvised restroom quality that would really mark them as authentic). When I first arrived I hated these places. I had a horror of being an 'ex-pat' who sat around watching soccer and reading The Sun, never truly engaging with the culture of the place I lived. They always seemed like shabby emotional props, over the top stage sets with enormous portraits of Churchill and the Queen, so many clichés it felt like walking into an Austin Powers movie.

After a few years though, something happened. The terrible taste no longer seemed important, I loosened up and began to look forward to my occasional visits to American pubs. I had developed a new sense – the ability to detect and appreciate quaintness. Growing up in the English countryside I was surrounded by so much of it that I had no awareness, in the same way that I guess fish don't think about the water they swim in. My normal was now strip malls, sushi and bars, so anything British felt foreign and interesting, and the authenticity no longer felt so important.

A Fascination with the Old Country

As that change progressed, I realized my reading habits were skewing towards British history and literature. I'd always been a voracious reader with traditional classics in the mix, but I felt a need to fill in the gaps in my knowledge of the British past, both non-fiction and culture. The typical American obsession with ancestry always looked ridiculous from the other side of the Atlantic, but now I understood. It's not a rejection of the US national identity, it's an integral part of it. Because there's people from such a wide range of different backgrounds having to live together, history and ancestry provide a safe way to talk about our differences. It's non-threatening because the stories all end up with us here in America. Talking so much about the variations implicitly acknowledges how much we have in common.

Atrophied Irony and Sarcasm

I inherited an ability to come up with a cutting remark for any occasion from my mum, and I still find her a riot. It served me well in the UK, but I rapidly discovered that it generated confused and concerned looks over here. It dawned on me that sarcasm relies on a deep shared understanding between the speaker and listeners, or it won't be clear that you actually meant the opposite of what you just said. There just isn't enough agreement about what's normal here to deploy sarcasm for anything but the simplest situations. You can find Americans who will believe almost anything, and be willing to say it! I also realized I was hiding behind my negativity. I was afraid of stating what I really believed and desired explicitly for fear of rejection or mockery, so being ironic gave me plausible deniability. I'd still hope that people would agree, but was protected if they didn't.

This timidity had to go. You can't persuade people to help you get interesting things done unless you clearly show you believe in them yourself. My big secret was that I've always been painfully earnest, and moving to the US gave me the chance to come out of that closet. I get to talk about the crazy things I dream of building without immediately hearing a million reasons why they're bound to fail. I still love a sprinkling of snark, but as a little spice, not the main dish. One of the hardest things about going back to Britain is biting my tongue so I don't sound arrogant, because I'm not hedging what I'm saying enough.

The Passport

This is the last piece of the puzzle. I have my green card, and I'm just over a year away from qualifying to become a naturalized citizen. I'm going to apply as soon as I can, even though there's almost no practical difference between my status with a passport and being a permanent resident. America has given me the chance to live some amazing dreams, make wonderful friends, to create things and have experiences that would have been denied to me in the UK. It feels like home. I want to make that official.

Is the Swiss national bank investing in EC2 spot instances?

Spotprice1week
I'm a big fan of using EC2 spot instances to help reduce costs, but the pricing behavior can make managing them a pain. There's a very bimodal distribution, where the price will wobble along at around 16 cents an hour almost all the time, with infrequent but sudden spikes up to 50 cents or so, above the non-auction cost. I don't know for sure why things are so erratic, but I can think of a couple of possible causes.

It could be that there's a very non-linear distribution of maximum bids, where almost everybody is willing to pay up to the normal price for an instance, and so when capacity is reached and machines need to be shut down, the price has to shoot up before any significant number of resources are freed up.

There could also be some very heavy hitters who occasionally demand very large numbers of machines, so causing the price to spike. What was interesting this evening was that I was still able to manually start up a couple of instances at the regular price, which was below the auction price. I also find it surprising that the two data centers seem to be so highly correlated. Unless there's sophisticated users who are quickly switching their requests between availability zones, I'd expect a lot more independence as the spare capacity in each varies.

Swissfranc
Chart via Felix Salmon

Of course, there could be something deeper at work. Werner Vogels was born in Europe, and the Swiss National Bank is trying to find somewhere to sink all its money to prevent the currency from appreciating. Coincidence? Mind you, I talk a bit funny too, so it's hard to trust anything I say. Unless I'm narrating a documentary about meerkats, in which case the accent helps.

Meerkat
Photo by Keven Law

Five short links

Fivecontainer

Photo by Stew Dean

The Good Judgment Project - Almost all of us are terrible at making predictions, even professional pundits, and worse, we're unaware that we're so bad at it. That makes me excited to see this academic project to objectively analyze the techniques and people who make the most accurate forecasts. Even just a minor improvement in our prediction skills could make a world of difference in the quality of our decisions, so I'm looking forward to seeing what results come out of the study.

Thesis on phone geotag analysis – A strong overview from a UK undergraduate, covering a lot of different ways that location can be determined from iPhones and other smart phones.

Identifiability of de-identified data – Summary of a researcher whose work demonstrates how flawed most data anonymization is, even for sensitive medical information.

Shape optimization of gridded surfaces – I had a good conversation with Avik Das, the author of this work recently, and it made me nostalgic for my time in computer graphics. This video shows some of the research he's doing to 'relax' complex geometric shapes into a more natural arrangement, and is beautiful in a very geeky way.

Bulk loading in Cassandra – Yet more useful material from DataStax, on a problem I've been looking into a lot recently. Since I've had no luck getting Mumakil running, I'll see how this approach works.

 

Using encrypted DMGs to store sensitive data on OS X

It has been years since I used a desktop machine for development, but working on a laptop does make physical security harder. If someone steals your machine, how much do they have access to? If you're on a Mac, here's a tip Apple taught me for securing your data.

You're probably familiar with .DMG files from software downloads, but not many people know that it's easy to create your own, and they can be writable and encrypted with a password. They work a lot like external drives, but are just stored as a file on your main machine. The advantage is that you have to enter a password before you mount them, otherwise they're just meaningless random data, even if someone has taken your machine and changed the root password. They're not a magic bullet, but they're a useful general-purpose tool to use as part of your security strategy. Apple required us to keep any source code we had on our laptops in them for example. Here's how you build your own:

– Open up the 'Disk Utility' application, under Applications/Utilities.

– Select 'New Image' from the top toolbar.

Dmg0
– Choose the name, size you want (it's hard to change afterwards) and under encryption pick 256-bit AES. Everything else you can leave at the defaults.

Dmg1
– On the next screen pick a strong password, and very importantly, uncheck 'Remember password in my keychain'! We want to ensure that anyone who wants access to the data has to enter our original password, not get access to it via a reset user password, so we don't want it stored in the keychain.

You'll now have a DMG file on your drive, and it will be automatically mounted when you create it. To test it's working, go to the finder and hit the eject icon next to where it appears in the sidebar. After that's complete, double-click on the DMG file and you should be prompted to enter your password. Again, make sure you don't check the remember password option. You should see it appear in the sidebar again.

The volume will remain mounted for as long as you're logged in, so you need to make sure you have a password set on your screensaver. With that in place, an attacker will need to reboot your machine to reset account passwords, and so would need to re-enter the password to get access to the data on the disk image. 

I use this a lot for things like SSH credentials, so I'll usually create a symbolic link by running something like:

ln -s /volumes/ssh/.ssh /Users/petewarden/.ssh

There are alternatives of course, such as using FileVault, TrueCrypt or an encrypted USB key, but I've found this a simple and straightforward way to help secure your data.

 

Five short links

Fivediscs
Photo by Brian Fuller

Visualizing Jane Austen – How do the names of key characters recur in Austen's novels? Matthew Hurst has put together some simple but interesting visualizations 

Elasticache – You know how I was just asking for a no-hassle, pay-as-you-go, true database-as-a-service? It isn't persistent, but Amazon's new memcached on-demand is a big step forward. It's a shame you still have to worry about nodes, and I hope it doesn't turn out to have hidden flaws (like Amazon's SimpleDB's reliance on manual sharding) but I'm excited to try it out.

Mumakil – So, I can't actually get this to run, but in theory this should be a great way of doing bulk loads and dumps from Cassandra using Hadoop. I'll also be digging into Brisk over the next few weeks, but I like the idea of something like Mumakil that's laser-focused on data transfer, as a complement to Datastax's more general tools.

Junar – An Argentinean data-marketplace startup. As a data consumer, it's great to see a thousand flowers blossom in this area, and it will be interesting to see how their offerings start to specialize.

Map Tile URL formats – A machine-readable collection describing the format of many different public web services that offer map image tiles (though keep in mind that most of them have lots of conditions attached).

Five short links

Pocketwatch

Photo by Susanna S.

Data Patterns – A pithy, useful and opinionated (in a good way) collection of advice and techniques for dealing with common data problems, from parsing HTML, threading scrapers and the joy of CSV for data storage. It's early days and there's lots more to be filled out, but what's there is great.

The Guild of Silicon Valley – This article makes me want to grow a chin beard. One funny thing about the 'new wave' of data technologies like Hadoop, Lucene and Cassandra is that they're written in Java, a language most startup web developers avoid like the plague. The painful thing about Java and C++ is that they force you to think hard up front about what you're building before you dive in. The insight of agile programming is that for smaller projects that's a waste, but these show you still need it for industrial-grade frameworks. Or maybe it's just that Doug Cutting's a force of nature and it happens to be his favorite language, since he's responsible for two of the three projects above?

WeoGeo – The interface is mind-boggling, but if you persevere. there's a rich set of free and commercial geographic data sets available. I discovered a compendium of cell tower locations from the FCC I was unaware of, amongst other goodies.

Scaling Up Machine Learning – Solid advice from people who've obviously been fighting in the trenches.

Xeround – I'm tired of spending my time dealing with database housekeeping for uninteresting transactional data problems, so I love the idea of a relational database that just works, a turnkey service that I don't have to set up but that can still scale. I haven't used it or similar services like ScaleDB, so I'm sure there's caveats, but it's a problem that needs solving. Today it feels like I have to build my own power plant just to get electricity. I'd much rather pay somebody else to deal with a lot of the solved database issues so I can focus on the more interesting problems.

Porting Flash/Flex 3’s Matrix, Point and Rectangle classes to Javascript

Blindingflash
Photo by Free Wildebeest

I started off writing the OpenHeatMap renderer for Flash using Flex3, and then ported the code to vanilla Javascript to support HTML5. There were many things that felt poorly designed in Flex, but the 2D geometry support was a pleasure to use. To minimize the differences between code for the two renderers I ended up rewriting the bulk of the Matrix, Point and Rectangle classes in Javascript. Today I needed to reuse some of my OpenHeatMap functions in another project, so it seemed like a good chance to split off the classes and relicense them as BSD.

Why should you care? You almost certain don't, unless you're somebody who's porting a big project from Flash to Javascript. In that case you're probably sobbing in a corner, rocking back-and-forth and clutching your knees, thanks to all the other painful issues you're dealing with. If you emerge from your fugue state long enough to notice, you'll be happy though, trust me.

The code's up at https://github.com/petewarden/flxjs.

Securing Cassandra on EC2

Keepout

Photo by Edward Ross

Over the last couple of months I've been creating a large-scale data processing pipeline for my new startup. I've used all of the technologies involved before, but never all together or in an environment where the processing is so user-driven. The main ingredients are a Ruby/Sinatra frontend, a Postgres database for small-scale transactional information like user accounts, a Cassandra cluster for big data, and Hadoop for processing, all hosted on EC2. I've learned lots of lessons about integration, but one of the ones I found the least guidance on was security. I'll be talking about Hadoop at some point, but here's what I discovered about Cassandra:

– Most people use it on machines that are completely inaccessible from the outside world, so security just means keeping attackers outside your firewall. Since with EC2 your machines have to be minimally-accessible from outside the data center, it isn't straightforward to implement this strategy.

– I love the Datastax material on Cassandra, but their guide to setting up on EC2 suggests that you allow port 9160 to be reached from any address. This allows anyone who discovers the address of the machine to log in and look through your data. I don't want to beat up on them, that's actually a good way to get started with minimal hassle when you're experimenting, but it's worth calling out the implications.

– There's password authentication built into Cassandra but it's not very mature. As one of the commenters on this thread says "I am not aware of anyone using the security features of the SimpleAuthenticator anywhere in production" and my research showed a lot of fiddly things to get wrong, so I'm not ready to rely on it to protect my user's data.

So, what did I end up doing? I set up strict firewall rules using Amazon's security groups feature to block every port but 22 for ssh on my Cassandra cluster. I then added some exceptions, to allow any other machines in the 'Cassandra' security group to access port 7000 for internal cluster communications, and machines in the 'Frontend' and 'Hadoop' groups to call 9160, the external interface to the data. These machines themselves are in EC2, locked down behind their own firewall rules.

This makes the security problem very similar to the standard Cassandra setup within an intranet, where the goal is to keep attackers outside. It means I have to use ssh tunneling or similar techniques if I want to develop on my local machine connecting to the cluster, but that's not too much of an inconvenience.

Five short links

Highfive2
Photo by somethingstartedcrazy

Airport security using your online profile – Most people now have multiple public profiles on different services like Facebook and LinkedIn. Unlike traditional self-supplied information, these are hard to fake because they require significant numbers of other people to implicitly supply references by friending you. You can imagine creating a large number of fake acounts all friending each other, but that structure will stick out in a social graph like a sour thumb. Then there's semi-public data like credit reports on top of that, which either requires years of preparation or cooperation from multiple private companies to fake. This means that if you can verify that a person is who they say they are, you can be very sure about whether that identity is a real or made-up one. I first heard about this as a problem that spies in foreign countries now face when building new identities, but this article indicates that airline security in the US may rely on similar data as a signal when screening travelers.

Mapbox's Wax  – I only recently discovered Mapbox, but I've been blown away by the quality of their work. Wax is their Javascript library that makes it easy to use a whole bunch of different map technologies through a common interface.

Girls go geek again – It's eye-opening to see how female-friendly computing used to be, and depressing to see how much ground we've lost since the early 80's. Don't dismiss this as a hippy political-correctness problem, just think about all of the kick-ass github projects that don't exist because the authors didn't go into our field.

Open-source data journalism with BuzzData – I've been excited to see Peter Forde's vision of a socially-focused data site become a reality. It ties in with one of my big dreams, of seeing every journalistic story that references data make the raw numbers available for follow-ups and responses, just like scientific papers.

HMS Pinafore – The Pirates of Penzance came up on my iTunes shuffle a few days ago, and I decided to see if there were any Gilbert and Sullivan performances coming up. As luck would have it, Lamplighters were finishing up a run of HMS Pinafore at Mountain View, so a few of us ventured down in some trepidation (South Bay isn't normally where I head for my culture). I was amazed, it was by far the best G&S production I've ever seen. The singing of the main players and the chorus was clear, rich and powerful, the choreography was crisp but still full of life, and the orchestra was note-perfect. The acting made the show though, with Robby Stafford stealing scenes left and right as Dick Deadeye. I'm going to be following the Lamplighter's schedule closely from now on, that was one of the best shows I've seen all year and I'm looking forward to catching more.