Data Science Toolkit security fix

Padlockunlocked
Photo by Lee Haywood

Just a quick note and apology to users of the DSTK EC2 AMI. The default public key that Amazon adds to ~/.ssh/authorized_keys wasn't being removed automatically during the AMI creation process as I expected, so I had unknowingly been given login access to any unmodified instances created from a DSTK AMI. Happily Amazon's audit procedures spotted the problem, so I've now gone ahead and built a new version with my public key removed. Apologies to everyone, that was my mistake. To be clear the worst case was that I would be able to log in to a server you'd created, it didn't give anyone else access.

I've updated the docs and released ami-a971b7c0 as the current version of DSTK 0.35. I recommend users of the current AMI either switch to the new one, or just edit ~/.ssh/authorized_keys to remove the first line containing my key. It will be the line that begins ssh-rsa AAAAB3Nza… Be careful not to delete your own login credentials, or you'll be unable to log into the box yourself!

To be sure I wasn't missing anything else I've been studying the Amazon guide to creating shared AMIs, and use the following commands just before I build the image to wipe sensitive data like website visitors, command line histories and the default server SSH keys: 

sudo rm -rf /var/log/apache2/*

sudo rm -rf ~/.ssh/authorized_keys

sudo rm -rf /etc/ssh/ssh_host_dsa_key

sudo rm -rf /etc/ssh/ssh_host_dsa_key.pub

sudo rm -rf /etc/ssh/ssh_host_rsa_key

sudo rm -rf /etc/ssh/ssh_host_rsa_key.pub

history -c

 

 

Five short links

Lincolncloseup
Photo by Billie Hara

I'm frantically coding for an upcoming launch, so apologies if you're waiting for an email reply. I'm looking forward to showing the world what we're working on though, and to posting the lessons I've learned about using Cassandra and Hadoop in production.

TinkerPop – The equivalent of the LAMP stack for graph data processing, pulling together the best open-source tools to help build a turnkey pipeline. The developer's avatars on the right of the page scare me though.

Gridded Population of the World – I've been looking for something like this for a long time. It's a breakdown of the surface of the earth as a grid, with an estimate of how many people live in each square. Why is this useful? Almost every geographic density map you produce will be dominated by the places people actually live, with the signal you really cared about drowned by the fact that most people live packed into a few urban areas. This data set could be used to correct for that, at least as a first approximation, so you can tell if more people than you'd expect by their raw population are visiting your site from particular areas, for example.

Data Alchemists – I like Ben's idea, resigned as I am to the dominance of the term data scientist.

ISPs are hijacking search queries – This is a hard-to-explain but important story. ISPs are using a third-party service to capture and redirect their users searches. The redirection is painful, but handing over everything their users are searching for to a random bunch of marketing companies with no disclosure is a really bad idea.

Three.js – I'm in love with this library, and with WebGL in general. The combination of the ease of use of Javascript, the power of OpenGL and the fact that the latest versions of Chrome, Firefox and Safari all support it means that we'll be seeing a ton of beautiful WebGL pages over the next few months. Check out some of the examples to get inspired.

Goodbye Thor

Chomp!

Just a little personal note, since he's shown up on this blog pretty frequently. I'm sad to say that my dog Thor passed away yesterday after a short but severe illness. The picture above is from just a couple of weeks ago, so you can see he was active right up until the end. He was a fantastic little guy, and I feel lucky to have had him in my life for the time I did. If you knew him and want to remember him, feel free to make a small donation to the ASPCA in his memory. I'm obviously very sad, and will miss him a lot, but I'm doing ok.

Bigsur2

Cassandra initial tokens table

When you're setting up a Cassandra cluster with random partitioning, you need to choose balanced keys for the initial tokens by dividing 2^127 by the number of nodes. I found a script here, but being a lazy bugger, I just wanted a table for common numbers of nodes. I couldn't find one by googling, so here's my generated version, and here's the script as a file.

One Node

node 0: 0

Two Nodes

node 0: 0

node 1: 85070591730234615865843651857942052864

 

Three Nodes

node 0: 0

node 1: 56713727820156410577229101238628035242

node 2: 113427455640312821154458202477256070485

 

Four Nodes

node 0: 0

node 1: 42535295865117307932921825928971026432

node 2: 85070591730234615865843651857942052864

node 3: 127605887595351923798765477786913079296

Five Nodes

node 0: 0

node 1: 34028236692093846346337460743176821145

node 2: 68056473384187692692674921486353642291

node 3: 102084710076281539039012382229530463436

node 4: 136112946768375385385349842972707284582

 

Six Nodes

node 0: 0

node 1: 28356863910078205288614550619314017621

node 2: 56713727820156410577229101238628035242

node 3: 85070591730234615865843651857942052864

node 4: 113427455640312821154458202477256070485

node 5: 141784319550391026443072753096570088106

 

Seven Nodes

node 0: 0

node 1: 24305883351495604533098186245126300818

node 2: 48611766702991209066196372490252601636

node 3: 72917650054486813599294558735378902454

node 4: 97223533405982418132392744980505203273

node 5: 121529416757478022665490931225631504091

node 6: 145835300108973627198589117470757804909

Eight Nodes

node 0: 0

node 1: 21267647932558653966460912964485513216

node 2: 42535295865117307932921825928971026432

node 3: 63802943797675961899382738893456539648

node 4: 85070591730234615865843651857942052864

node 5: 106338239662793269832304564822427566080

node 6: 127605887595351923798765477786913079296

node 7: 148873535527910577765226390751398592512

 

Nine Nodes

node 0: 0

node 1: 18904575940052136859076367079542678414

node 2: 37809151880104273718152734159085356828

node 3: 56713727820156410577229101238628035242

node 4: 75618303760208547436305468318170713656

node 5: 94522879700260684295381835397713392071

node 6: 113427455640312821154458202477256070485

node 7: 132332031580364958013534569556798748899

node 8: 151236607520417094872610936636341427313

 

Ten Nodes

node 0: 0

node 1: 17014118346046923173168730371588410572

node 2: 34028236692093846346337460743176821145

node 3: 51042355038140769519506191114765231718

node 4: 68056473384187692692674921486353642291

node 5: 85070591730234615865843651857942052864

node 6: 102084710076281539039012382229530463436

node 7: 119098828422328462212181112601118874009

node 8: 136112946768375385385349842972707284582

node 9: 153127065114422308558518573344295695155

Five short links

Letterv2Photo by Chris in Plymouth

Datacatalogs.org – A catalog of data catalogs from governments around the world. The really hard problem with all 'open data' is making the connection between a developer's immediate problem and an available data set or API, but at least sites like this are building a foundation for solving that.

A proposal for making Ajax crawlable – I didn't realize the hashbang syntax was actually backed up by an informal standard for making the same content available to crawlers through a traditional URL. This is much better than completely opaque Javascript-driven pages, but I am left wondering how tough it is to maintain two separate content delivery paths in the code?

Disruptor – Much as I dislike queues as a general-purpose primitive for data processing (I see them as a necessary evil when you're dealing with the subset of problems that require streaming solutions) I am impressed by this high-performance framework. A recurring theme in many of my optimization investigations over the last few years has been the painful cost of locking, so I bet their focus on lockless parallelization will be very powerful.

Adventures with venture capital – Chasing investment is both time-consuming and uncertain. A cautionary tale from Tim on how the process can go wrong, which unfortunately is more often than you'd think.

How much compute power do you need for next-gen sequencing? – Bioinformatics tasks are much larger than most web problems, but this analysis of their computing needs has some useful parallels. In my data jobs, CPU has never been the bottleneck, it's always been memory or IO. I don't think I'll be moving to a 1TB RAM machine any time soon though!

My San Francisco food highlights

Eatfoo

Photo from the Lazy Bear

Since I've moved into San Francisco, I've fallen in with a bad crowd. I've found myself hanging out with foodies, which has resulted in a lighter wallet and a heavier exercise schedule. Here's some of the more memorable places I've been in the last six months. Most of these aren't particularly high-end, but did leave an impression. 

Blue – Spicy mac'n'cheese! Tater tots! A tiny diner on Market near Castro, full of lovingly executed comfort food.

La Mar – Strange but nice mashed-potato sculptures, served a bit like sushi. An expensive Peruvian restaurant at the Embarcadero, has high-octane pismo sours.

Chow – American traditional, old-style pasta dishes like spaghetti and meatballs. There's multiple locations, but the one at Church and Market has the best food and service.

L'Ardoise – Classy but unpretentious French food, coq-au-vin done right. Staffed by actual French people, this small place near Noe and Market is one of my favorite restaurants in the world.

Sushi Raw – Everyday California sushi, think spider rolls and other staples. Near Haight and Steiner in Lower Haight, this is a solid neighborhood sushi bar with a good online ordering system for take-out. It's my go-to place for a quick and healthy weekday meal.

Axum Cafe – Ethiopian food, lots of vegetarian offerings. This was my first exposure to Ethiopian food, and as a fan of both bread and eating with my hands, I was a natural. I recommend getting a platter with a sampler of all five of their dishes.

Five belated links

Fivepastcool
Photo by Dave Knapnik

If you've been wondering about my radio silence, the last few weeks have been jam-packed with change. I've joined up with a couple of experienced co-founders I've known for a while, and we're creating a consumer-focused business based on a lot of the ideas I've been working on for the last few years. I can't tell you how productive it is to be working in a team again, I'm even going into an office every day! We're in the middle of raising a seed round, so if you're an investor who wants to know more, drop me an email: pete@petewarden.com.

Anyway, the new venture gives me so much material but so little time, but I'll still be blogging when I can catch my breath. It's the only way I know to really understand my own thinking.

Gitalytics – A statistical analysis service for Github by Sameer Al-Sakran. It shows how powerful remixing publicly-available data about people can be, and I'm not just saying that because he gave me a high score! I already use github activity as a signal in hiring, but this gives me a much better summary than I can get by browsing the main website.

Is data open if it can't be crawled? – As we see more websites adopt the #! (hashbang) javascript-driven approach to rendering pages, does it close off automated access? In this case it appears that Oregon has an alternative structure available, but as content gets embedded as part of a script-driven process, it will be harder and harder for developers to access it. This is one reason I'm betting that client-side capturing will be the long-term answer to open data.

To lose weight, forget the details – Dieters only given access to their approximate weight were more successful than those who could see an exact version. As much as I love metrics, this matches what I know about my own emotional response to numbers.

Eatfoo – I met the chef behind the Lazy Bear 'underground restaurant' at a recent party, and I can't wait to attend the next one. I've fallen in with a foodie crowd here in San Francisco, they're even starting to shame me out of my occasional weakness for Taco Bell.

"Researchers make their reputations on discovery, not de-discovery" – Laura J Snyder applies her historical perspective to the current practice of science and finds it wanting. I do wonder what the long-term consequences of the flood of bogus studies will be? It's no wonder that there's a lot of skepticism about real phenomena like global warming, when many of the well-publicized studies about other topics aren't exposed to active debunking efforts.

Five short links

Pentagami
Picture by Phillip Chapman-Bell

Truthy – A research project that tracks how memes spread on the internet. Could be a big help in understanding how we can combat bogus ideas on the web.

A rough but intriguing method for assessing a school principal – I love rules of thumb, especially subtle ones that are hard to game and tied to something you care about, they're so useful in any data work. PageRank is at its heart a statistical hack like most rules of thumb, and is the most lucrative algorithm in history. Bob Sutton unearths an offline example, and I can believe that how many pupil's names a principal knows has a strong relationship to their diligence and involvement in the job.

Liquimap Demo – Steve Souza has been doing some thought-provoking work on new ways of presenting data as raster images, essentially hacking our vision systems to help us spot patterns in chaotic information.

Data Without Borders – I'm a believer in the power of data analysis to provide new insights on old problems, especially in the non-profit world, and Jake Porway's new effort seems like a great way for us to use our powers for good.

HBase vs Cassandra – I'm deciding which of the two to use for a new project, and this article has been a big help. By looking around at other startups, Cassandra seems to have won, but since I'll be doing a lot of Hadoop processing, I'm trying to figure out whether Brisk on Cassandra would work, or if HBase still has advantages there.

Green Tea Kit Kats

Greenteakitkat

A few weeks ago I had a lovely gift from Ken Cukier, who writes for the Economist out of Tokyo, and was responsible for the Deluge of Data special report last year. Who knew Nestlé had the mad food science skills to create a green tea kit kat? I actually hesitated about eating it, I wanted to keep it as a conversation piece, but in the end I couldn't resist. It was unusual but delicious, a bit like white chocolate but with a distinct tea flavor. I'm posting this both as a thank-you to Ken, and in the hope it will encourage other friends to ply me with exotic candy from around the world, it's a trend I like!

Am I wrong about queues being Satan’s little helpers?

Fluffysatan
Photo by Lori La Tortuga

I either did a bad job explaining myself, or my last post was wrong, judging by the reaction from a Twitter engineer, and other comments by email. The point I was trying to get across was that queue-based systems look temptingly simple on the surface but require a lot of work to get right. Is it possible to build a robust pipeline based on queues? Yes. Can your team do it? Even if they can, is it worth the time compared to an off-the-shelf batch solution?

I've seen enough cases to believe that there's a queue anti-pattern for building data processing pipelines. Building streaming pipelines is still a research problem, with promising projects like S4Storm, and Google's Caffeine, but they're all very young or proprietary. It's a tempting approach because it's such an obvious next step in data processing, and it's so easy to get started stringing queues together. That's the wrong choice for most of us mere mortals though, as we'll get sucked into dealing with all the unexpected problems I described, instead of adding features to our product.

I'm wary of using queues for data processing for the same reason I'm wary of using threads for parallelizing code. Experts can create wonderful concurrent systems using threads, but I keep shooting myself in the foot when I use them. They just aren't the right abstraction for most problems. In the same way, when you're designing a pipeline and thinking in terms of queues, take a step back and ask yourself why you can't achieve your goals with a mature batch-based system like Hadoop instead?

Queue-based data pipelines are hard, but they look easy at first. That's why I believe they're so dangerous.