Duboce Triangle excitement and business card inspiration

Derailedmuni

If you were wondering why your Muni ride home took a lot longer tonight, this is the explanation. The rear wheels of the N car I was on decided to turn off down Church Street, while the rest carried on straight along Duboce. There were a lot of sparks overhead as the power connector sheared off, followed by a nasty crunch, but we were going slowly and nobody was hurt. Happily it happened just outside my apartment, so I was able to hop off without too much disruption, and it sounds like service should be back to normal tomorrow.

Liamstone

I also had to share this business card I came across. All I can say is that I'll be passing along some suggestions to my co-founder and CEO Julian in the morning.

Five short links

Purpleflower

Photo by Sami Sieranoja

JSON Pointer – For one crazy moment I thought this was an attempt to squeeze C's memory management into Javascript. It's actually a very useful effort to standardize how we describe parts of JSON structures, a bit like XPATH is for XML.

Microsoft and Hadoop – Even the Beast of Redmond digs Hadoop these days. Do I need to be a code hipster and find something more obscure to evangelize now?

Pygmalion – I've been knee-deep in Pig and Cassandra internals for the last week, trying to build an approachable analytics solution for a massive, dynamic data set. It has been something of a struggle, thanks to the combination of my unfamiliarity with both Pig and Cassandra, and the scarcity of other users. I've had some fantastic help from the community though, especially from Jeremy Hanna and Brandon Williams, and I recommend checking out Jeremy's library and talks if you're also wandering into this area.

SMS Corpus – The National University of Singapore has made around 60,000 voluntarily collected text messages in English and Chinese available as a research data set. There's precious little like this available for academic researchers, so asking for contributions is an interesting solution to the privacy problem.

Bill Nguyen – I met Bill briefly at the Color offices, and he is startlingly charismatic. This profile includes some thoughtful quotes from Paul Kedrosky and Eric Ries, but the one that rang most true was the old Hollywood saying that "nobody knows anything". I'm lousy at predicting which companies will go on to success, I have my own mental anti-portfolio of fantastic startups I could have got more deeply involved in. The only way to keep my sanity is to work on products I'm proud of, and hope everything else works out.

Blue Angels 2011

Blueangels0

When my friend Bruno Bowden told me that Fleet Week was one of his favorite events in San Francisco, I raised an eyebrow. It sounded like something straight out of a 40's musical. Then he invited me and Joanne along to a party on his rooftop to see the airshow, and I was intrigued, but imagined we'd be staring at little dots flying over Sausalito through binoculars. Boy, was I wrong!

There's no way I could capture the full experience of the jets flying a few hundred feet directly over our heads, or watching them seemingly weave between the buildings, flying lower than we were. It was so hard to film them, they flew over so fast ! We did manage to capture a couple of the larger planes in the show, but please excuse the swearing in the second one, I just couldn't believe they could do that with a full-sized jet-liner over a major city!

Big thanks to Bruno to inviting us along, it truly was one of the most unique events I've ever been to. I'm so happy they allow such an amazing show in the middle of San Francisco. Looking around the rooftops were packed, especially with kids, and I know the pilots left behind memories that will last a lifetime.

Blueangels1

How British cheese and a Turing test converted me to Google Plus

Shropshireblue

Photo by Ulterior Epicure

I'll admit it, I had written off Google's latest effort to be social. I have a tidy little theory that their focus on metrics leads them to local maximums but prevents them leaping across gaps to islands of fun. I expected Google Plus to be a flash in the pan as people signed up, poked around and never returned. That left me surprised to hear a continuing murmur of interest from people I trust like Marshall Kirkpatrick. I almost hate to publicize his secret, but he has managed to notch up a series of great stories by using the service for research.

On Thursday night I'd just finished off a links post and wanted to do my normal tweet about it, but Twitter was down. When it was still unavailable after an hour, I decided this was a sign I should check out Google Plus myself. I went to the site and put together a short post with an explanation I was experimenting with the service.

Speed

My first impression was of how fast the interface was. It didn't have that lag that I thought was inevitable with web applications. It was so quick I was convinced they must have pre-populated the page with hidden content so it was instantly available. When I looked at the network activity it turns out I was wrong, most of their content is dynamically loaded by Ajax calls. What makes the difference is how fast their response times are, often under 100ms for me. I'm definitely a snob about this sort of thing after my time at Apple, but the snappiness and some thoughtful work on the interaction model made a profound difference to my experience. The fit and finish make me want to spend time exploring the service, in contrast to Twitter's web interface which just feels a lot klunkier and messier.

Length

Being able to write at a natural length was a surprising relief. I still stayed fairly short, within the same three-sentence rule that I try to use for emails, but I was over 140 characters. I don't think the service would work unless we'd already been taught brevity by Twitter, but being able to expand beyond those limits when I need to feels very liberating.

Audience

There were a surprising number of responses to my initial post. One of them was from DeWitt Clinton, in what I wrongly suspected was an auto-generated welcome message. I knew him through the Buzz and WebFinger developer mailing lists, so I thought he might be the Tom Anderson of Google Plus, the first friend for every new user. He convinced me he was human, so either the employees are very engaged with the service, or Google's code can now pass the Turing Test. It seems like there's quite a few regular people watching the streams too, which is confidence-inspiring this long after the launch.

Conversations about cheese

The comment thread grew into a couple of conversations, giving me a chance to reconnect with several friends. In particular Audrey Watters, Edd Dumbill and I started waxing nostalgic about British food, and the struggle to find decent cheese. Edd was inspired to create a short public post, and the whole experience convinced me that the service makes some great conversations possible. It left me wanting to return regularly, which gives me hope the service is sustainable.

I'm happy that I was proven wrong about Google Plus. I won't be abandoning Twitter, but I will be spending more time on Google's service because it offers a different experience, one that I was surprised to find myself enjoying. If you're curious, join me there too.

Five short links

Flakeyfive

Photo by Holeymoon

Sourcetree – I don't often recommend commercial software, mostly because my personal stack's mostly open source these days. I've fallen in love with this tool for exploring my git repositories though. Git's new Mac app is fantastic too, but focused on 'doing things'. I've found sourcetree a wonderful way to explore and understand your code. I just discovered they've been acquired by Atlassian, so I guess I'm not the only fan!

The Sketchbook Project – I never lose my sense of wonder at how many ways the web can be used to drive creativity. By offering to scan people's sketchbooks they've motivated a community of artists from all over the world, and given me a vast set of material to browse through when my imagination needs a jump-start.

Smart Meter surveillance – How your electricity meter can reveal what TV channel you're watching. My German's not good enough to follow the main paper, but the abstract sounds very plausible. From my perspective not something to freak out over, but a good example of all the unexpected ways we leak information about our lives. We measure more and more things to improve efficiency, but the by-product is that the same data can be used for many unintended purposes too.

The Brown Revolution – An unfortunate name, but a compelling idea for sustainable grazing. I'm normally skeptical of agricultural 'silver bullets' like this, but I know from my experience maintaining trails how effective thoughtful drainage can be. When water's compressed into a narrow stream by a gully it will cut through even packed soil like a plasma torch, but keep spread out it in a wide sheet using a shallow 'rolling dip' and you'll have a surface that can survive years of storms.

Apple insiders remember Steve Jobs – I'm very sad we've lost Steve, he always seemed more like a super hero than a mortal to me. At the Guardian's request I contributed a few thoughts about my time working at Apple, and how he was a constant presence even though I barely met him. I'll be thinking of his family.

Five short links

Defacedlincoln
Photo by Nick P

Is teaching MapReduce healthy? – The working conditions inside Hadoop are terrible, but it's rapidly becoming the default for large-scale data processing. Does that mean students should learn the MapReduce approach? It feels a lot like the debates over teaching ugly, confusing, widely-used C/C++ versus beautiful, elegant but niche functional languages, and which should come first in the curriculum. This article gave me some fantastic glimpses into the wider world of distributed frameworks and techniques, and left me itching to try Bloom.

Amazon comments on Spot price spikes – A refreshingly detailed and open response from a large company. I'm disappointed that the prices have suddenly become so volatile though, since that severely limits the places I can use them.

PhantomJS – A WebKit based headless browser, lighter-weight than Selenium and driven by Javascript. I hope I get a chance to use this, and that they can sever the vestigial dependency on X windows soon. My main use case would be generating screenshots.

Teaching data to speak humanely – Looking at the Facebook timeline, and how old metaphors of interaction disappear as the interface gets closer to the content.

Pictures of the Big Bang – Gorgeous snapshots of our universe's very first moments, courtesy of a computer simulation:

Bigbangshot

How I saved $1,000 on my monthly EC2 costs

burningmoney

Photo by Paul Hohmann

If you’ve been a user of my Data Science Toolkit or OpenHeatMap sites, you may have noticed they’ve been a bit flakey recently. The back story is that I’ve started a new company (of which more soon) and I had to cut back on how much I was spending on my own servers. It was costing me over $1,200 a month on all the different systems I’d set up over the last three years! This was mostly because it was quicker and easier to set up a new instance than worry about jamming it onto an existing one, and I never got around to cleaning things up. I doubt there are many people who’ve been as lazy about this as me, but if you’re looking at cutting costs, I would start by figuring out if there’s servers you can merge.

By cutting out several like twitter.mailana.com and fanpageanalytics.com that were no longer being updated or heavily used, I was able to cut that in half, but I then reached two sites that have decent traffic. I started by merging DSTK and OHM onto one large server, which mostly worked but caused some hiccups. That got me down to around $400 a month, including another small instance for some legacy Mailana labs sites.

I then decided to switch to ‘spot instances’, Amazon’s auction model for buying spare server capacity cheaply. Most of the time it’s only about 12 cents an hour, a third of the normal price. I switched, but then started to experience some serious price spikes that kept taking the server down, and required manual intervention to set everything up again. At some points, the price went to $15 an hour, so there’s obviously capacity limits being hit. That’s very different from my experiences just a few months ago when prices seemed a lot more stable. At this point I’d never recommend spot instances for user-facing servers, the downtime seems too high. They’re still a great deal for things like MapReduce backend processing.

I need 64 bit support for a lot of the frameworks DSTK relies on, so I couldn’t go down to a small instance, but I did realize that micro instances were x86_64, so my next step was to try running both sites off one of those. Rather predictably, the processing requirements and lack of memory crippled the tiny instance, so the site was extremely flakey. It would have been only $14 a month though, so I spent some time trying to fix the issues, by configuring swap space for instance. My conclusion was that micros are too limited for anything but light web serving.

Today I finally bit the bullet and bought a one year reserved m1.large instance, which cost me about $900 up front and another $1,100 over the next year in costs. I also rolled in my web.mailana.com small instance into the same server, so I’m down to about $170 a month! This is around a $100 less than the unreserved cost, so I’d seriously look at reserving servers as a way of managing your costs, if you can stomach the up-front deposit.

I’ve also added a link to my O’Reilly books to the sites, in the hope I’ll cover some of my costs:

Improve your data skills (and keep this server running!) by buying my guides:

 I’m happy I’ve found a solution that should allow me to keep offering the DSTK and OpenHeatMap services without breaking my bank account. Apologies to all the users who have suffered through the transition, but things should be a lot more stable from now on.

Five short links

Phonefive

Photo by Leo Reynolds

Fingerprinting Cameras through Sensor Noise – Arvind shows how unique the noise pattern that your camera embeds in every photo you take really is. It's as identifiable as a thumb-print, and with high resolution pictures available from many sources associated with your real name, makes it theoretically possible to match up photos you've tried to take anonymously. The one saving grace is that compression removes a lot of that unseen uniqueness, but it looks like you'd need to go down to 60 quality on JPEG to be sure.

Multiplicative Microdata Noise for Privacy - Noise isn't always bad though. I'm fascinated by the ideas of adding it to prevent reverse-engineering of sensitive information from data releases. I bet it's tough to get right though, I'd love to set Arvind or some of the Kaggle team loose to see how solid the protection really is.

Underscore – The hardest thing about switching to Javascript from other languages is its primitive support for iteration. I'm excited by the promise of this library that Tyler Gillies pointed out to me, it looks like a great solution for my grumbles, unless there's a catch I'm missing.

requirejs – He also showed me this solution for proper module loading in JS. The syntax is a bit funky, and I know we'll still want to compile and minify for production, but it's still a good tool to have in your kit.

TeleHash – An intriguingly generic way of connecting machines together in a peer-to-peer way using distributed hash tables. Maybe it's just my exposure to them as I do more Cassandra work, but DHTs seem to be popping up all over the place.

 

Five short links

Saturnfive
Photo by Ronaldo F Cabuhat

CryptDB – A PostGres modification that stores data in a securely encrypted way, but still allows efficient queries. I love the idea, can't wait to see the software.

Perceived density – We very rarely go back and look at what the numbers we're using actually mean. This author noticed that the standard population density figures for cities didn't match up with anyone's real experiences with those places. He then reasoned about why that was, created a different way of measuring and built figures that seem a lot closer to reality. New York really is denser than Los Angeles!

Reverse Social Engineering (pdf) - If attackers can persuade their targets to approach them, the attack is more likely to succeed. The friend recommendation systems of social networks like Facebook can be gamed to manipulate those targets into friending attackers. Via Prasanna

If This Then That – Wonderfully simple scripting system for joining together web services like Flickr, Facebook and Twitter.

Hector – A battle-tested Java interface to Cassandra with a strong community behind it.

Thoughts on a Cassandra issue?

I'm running into consistent problems when storing values larger than 15MB into Cassandra, and I was hoping for some help on tracking down what's going wrong. I've emailed the Cassandra list, but I thought my long-suffering readers might like a crack too. I promise I'll get another five short links post up there if I can crack this bug!

From the FAQ it seems like what I'm trying to do is possible, so I assume I'm messing something up with my configuration. I have a minimal set of code to reproduce the issue below, which I've run on the DataStax 0.8.1 AMI I'm using in production (ami-9996c4dc)

 

# To set up the test data structure on Cassandra:
cassandra-cli
connect localhost/9160;
create keyspace TestKeyspace with 
  placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and 
  strategy_options = [{replication_factor:3}];
use TestKeyspace;
create column family TestFamily with
  comparator = UTF8Type and
  column_metadata =
  [
    {column_name: test_column, validation_class: UTF8Type},
  ];
# From bash on the same machine, with Ruby and the Cassandra gem installed:
irb
require 'rubygems'
require 'cassandra/0.8'
client = Cassandra.new('TestKeyspace', 'localhost:9160', :retries => 5, :connect_timeout => 5, :timeout => 10)

# With data this size, the call works
column_value = 'a' * (14*1024*1024)
row_value = { 'column_name' => column_value }
client.insert(:TestFamily, 'SomeKey', row_value)
# With data this size, the call fails with the exception below
column_value = 'a' * (15*1024*1024)
row_value = { 'column_name' => column_value }
client.insert(:TestFamily, 'SomeKey', row_value)
# Results:
This first call with a 14MB chunk of data succeeds, but the second one fails with this exception:
CassandraThrift::Cassandra::Client::TransportException: CassandraThrift::Cassandra::Client::TransportException
from /usr/lib/ruby/gems/1.8/gems/thrift-0.7.0/lib/thrift/transport/socket.rb:53:in `open'
from /usr/lib/ruby/gems/1.8/gems/thrift-0.7.0/lib/thrift/transport/framed_transport.rb:37:in `open'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/connection/socket.rb:11:in `connect!'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:105:in `connect!'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:144:in `handled_proxy'
from /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:60:in `batch_mutate'
from /usr/lib/ruby/gems/1.8/gems/cassandra-0.12.1/lib/cassandra/protocol.rb:7:in `_mutate'
from /usr/lib/ruby/gems/1.8/gems/cassandra-0.12.1/lib/cassandra/cassandra.rb:459:in `insert'
from (irb):6
from :0
Any suggestions on how to dig deeper? I'll be reaching out to the Cassandra gem folks, etc too of course.
[Update – Thanks to Matthew Russell and Daniel Lundin for pointing me towards the solution. Cassandra.yaml defines a maximum frame size for the Thrift API communication, and defaults to 15MB. Upping that and the max message length solved it for me.]