Five short links

August 22, 2011 By Pete Warden in Uncategorized Leave a comment

Data Patterns – A pithy, useful and opinionated (in a good way) collection of advice and techniques for dealing with common data problems, from parsing HTML, threading scrapers and the joy of CSV for data storage. It's early days and there's lots more to be filled out, but what's there is great.

The Guild of Silicon Valley – This article makes me want to grow a chin beard. One funny thing about the 'new wave' of data technologies like Hadoop, Lucene and Cassandra is that they're written in Java, a language most startup web developers avoid like the plague. The painful thing about Java and C++ is that they force you to think hard up front about what you're building before you dive in. The insight of agile programming is that for smaller projects that's a waste, but these show you still need it for industrial-grade frameworks. Or maybe it's just that Doug Cutting's a force of nature and it happens to be his favorite language, since he's responsible for two of the three projects above?

WeoGeo – The interface is mind-boggling, but if you persevere. there's a rich set of free and commercial geographic data sets available. I discovered a compendium of cell tower locations from the FCC I was unaware of, amongst other goodies.

Scaling Up Machine Learning – Solid advice from people who've obviously been fighting in the trenches.

Xeround – I'm tired of spending my time dealing with database housekeeping for uninteresting transactional data problems, so I love the idea of a relational database that just works, a turnkey service that I don't have to set up but that can still scale. I haven't used it or similar services like ScaleDB, so I'm sure there's caveats, but it's a problem that needs solving. Today it feels like I have to build my own power plant just to get electricity. I'd much rather pay somebody else to deal with a lot of the solved database issues so I can focus on the more interesting problems.

Porting Flash/Flex 3’s Matrix, Point and Rectangle classes to Javascript

August 20, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Free Wildebeest

I started off writing the OpenHeatMap renderer for Flash using Flex3, and then ported the code to vanilla Javascript to support HTML5. There were many things that felt poorly designed in Flex, but the 2D geometry support was a pleasure to use. To minimize the differences between code for the two renderers I ended up rewriting the bulk of the Matrix, Point and Rectangle classes in Javascript. Today I needed to reuse some of my OpenHeatMap functions in another project, so it seemed like a good chance to split off the classes and relicense them as BSD.

Why should you care? You almost certain don't, unless you're somebody who's porting a big project from Flash to Javascript. In that case you're probably sobbing in a corner, rocking back-and-forth and clutching your knees, thanks to all the other painful issues you're dealing with. If you emerge from your fugue state long enough to notice, you'll be happy though, trust me.

The code's up at https://github.com/petewarden/flxjs.

Securing Cassandra on EC2

August 18, 2011 By Pete Warden in Uncategorized 1 Comment

Photo by Edward Ross

Over the last couple of months I've been creating a large-scale data processing pipeline for my new startup. I've used all of the technologies involved before, but never all together or in an environment where the processing is so user-driven. The main ingredients are a Ruby/Sinatra frontend, a Postgres database for small-scale transactional information like user accounts, a Cassandra cluster for big data, and Hadoop for processing, all hosted on EC2. I've learned lots of lessons about integration, but one of the ones I found the least guidance on was security. I'll be talking about Hadoop at some point, but here's what I discovered about Cassandra:

– Most people use it on machines that are completely inaccessible from the outside world, so security just means keeping attackers outside your firewall. Since with EC2 your machines have to be minimally-accessible from outside the data center, it isn't straightforward to implement this strategy.

– I love the Datastax material on Cassandra, but their guide to setting up on EC2 suggests that you allow port 9160 to be reached from any address. This allows anyone who discovers the address of the machine to log in and look through your data. I don't want to beat up on them, that's actually a good way to get started with minimal hassle when you're experimenting, but it's worth calling out the implications.

– There's password authentication built into Cassandra but it's not very mature. As one of the commenters on this thread says "I am not aware of anyone using the security features of the SimpleAuthenticator anywhere in production" and my research showed a lot of fiddly things to get wrong, so I'm not ready to rely on it to protect my user's data.

So, what did I end up doing? I set up strict firewall rules using Amazon's security groups feature to block every port but 22 for ssh on my Cassandra cluster. I then added some exceptions, to allow any other machines in the 'Cassandra' security group to access port 7000 for internal cluster communications, and machines in the 'Frontend' and 'Hadoop' groups to call 9160, the external interface to the data. These machines themselves are in EC2, locked down behind their own firewall rules.

This makes the security problem very similar to the standard Cassandra setup within an intranet, where the goal is to keep attackers outside. It means I have to use ssh tunneling or similar techniques if I want to develop on my local machine connecting to the cluster, but that's not too much of an inconvenience.

Five short links

August 14, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by somethingstartedcrazy

Airport security using your online profile – Most people now have multiple public profiles on different services like Facebook and LinkedIn. Unlike traditional self-supplied information, these are hard to fake because they require significant numbers of other people to implicitly supply references by friending you. You can imagine creating a large number of fake acounts all friending each other, but that structure will stick out in a social graph like a sour thumb. Then there's semi-public data like credit reports on top of that, which either requires years of preparation or cooperation from multiple private companies to fake. This means that if you can verify that a person is who they say they are, you can be very sure about whether that identity is a real or made-up one. I first heard about this as a problem that spies in foreign countries now face when building new identities, but this article indicates that airline security in the US may rely on similar data as a signal when screening travelers.

Mapbox's Wax – I only recently discovered Mapbox, but I've been blown away by the quality of their work. Wax is their Javascript library that makes it easy to use a whole bunch of different map technologies through a common interface.

Girls go geek again – It's eye-opening to see how female-friendly computing used to be, and depressing to see how much ground we've lost since the early 80's. Don't dismiss this as a hippy political-correctness problem, just think about all of the kick-ass github projects that don't exist because the authors didn't go into our field.

Open-source data journalism with BuzzData – I've been excited to see Peter Forde's vision of a socially-focused data site become a reality. It ties in with one of my big dreams, of seeing every journalistic story that references data make the raw numbers available for follow-ups and responses, just like scientific papers.

HMS Pinafore – The Pirates of Penzance came up on my iTunes shuffle a few days ago, and I decided to see if there were any Gilbert and Sullivan performances coming up. As luck would have it, Lamplighters were finishing up a run of HMS Pinafore at Mountain View, so a few of us ventured down in some trepidation (South Bay isn't normally where I head for my culture). I was amazed, it was by far the best G&S production I've ever seen. The singing of the main players and the chorus was clear, rich and powerful, the choreography was crisp but still full of life, and the orchestra was note-perfect. The acting made the show though, with Robby Stafford stealing scenes left and right as Dick Deadeye. I'm going to be following the Lamplighter's schedule closely from now on, that was one of the best shows I've seen all year and I'm looking forward to catching more.

Data Science Toolkit security fix

August 5, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Lee Haywood

Just a quick note and apology to users of the DSTK EC2 AMI. The default public key that Amazon adds to ~/.ssh/authorized_keys wasn't being removed automatically during the AMI creation process as I expected, so I had unknowingly been given login access to any unmodified instances created from a DSTK AMI. Happily Amazon's audit procedures spotted the problem, so I've now gone ahead and built a new version with my public key removed. Apologies to everyone, that was my mistake. To be clear the worst case was that I would be able to log in to a server you'd created, it didn't give anyone else access.

I've updated the docs and released ami-a971b7c0 as the current version of DSTK 0.35. I recommend users of the current AMI either switch to the new one, or just edit ~/.ssh/authorized_keys to remove the first line containing my key. It will be the line that begins ssh-rsa AAAAB3Nza… Be careful not to delete your own login credentials, or you'll be unable to log into the box yourself!

To be sure I wasn't missing anything else I've been studying the Amazon guide to creating shared AMIs, and use the following commands just before I build the image to wipe sensitive data like website visitors, command line histories and the default server SSH keys:

sudo rm -rf /var/log/apache2/*

sudo rm -rf ~/.ssh/authorized_keys

sudo rm -rf /etc/ssh/ssh_host_dsa_key

sudo rm -rf /etc/ssh/ssh_host_dsa_key.pub

sudo rm -rf /etc/ssh/ssh_host_rsa_key

sudo rm -rf /etc/ssh/ssh_host_rsa_key.pub

history -c

Five short links

August 5, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Billie Hara

I'm frantically coding for an upcoming launch, so apologies if you're waiting for an email reply. I'm looking forward to showing the world what we're working on though, and to posting the lessons I've learned about using Cassandra and Hadoop in production.

TinkerPop – The equivalent of the LAMP stack for graph data processing, pulling together the best open-source tools to help build a turnkey pipeline. The developer's avatars on the right of the page scare me though.

Gridded Population of the World – I've been looking for something like this for a long time. It's a breakdown of the surface of the earth as a grid, with an estimate of how many people live in each square. Why is this useful? Almost every geographic density map you produce will be dominated by the places people actually live, with the signal you really cared about drowned by the fact that most people live packed into a few urban areas. This data set could be used to correct for that, at least as a first approximation, so you can tell if more people than you'd expect by their raw population are visiting your site from particular areas, for example.

Data Alchemists – I like Ben's idea, resigned as I am to the dominance of the term data scientist.

ISPs are hijacking search queries – This is a hard-to-explain but important story. ISPs are using a third-party service to capture and redirect their users searches. The redirection is painful, but handing over everything their users are searching for to a random bunch of marketing companies with no disclosure is a really bad idea.

Three.js – I'm in love with this library, and with WebGL in general. The combination of the ease of use of Javascript, the power of OpenGL and the fact that the latest versions of Chrome, Firefox and Safari all support it means that we'll be seeing a ton of beautiful WebGL pages over the next few months. Check out some of the examples to get inspired.

Goodbye Thor

July 23, 2011 By Pete Warden in Uncategorized Leave a comment

Just a little personal note, since he's shown up on this blog pretty frequently. I'm sad to say that my dog Thor passed away yesterday after a short but severe illness. The picture above is from just a couple of weeks ago, so you can see he was active right up until the end. He was a fantastic little guy, and I feel lucky to have had him in my life for the time I did. If you knew him and want to remember him, feel free to make a small donation to the ASPCA in his memory. I'm obviously very sad, and will miss him a lot, but I'm doing ok.

Cassandra initial tokens table

July 20, 2011 By Pete Warden in Uncategorized 2 Comments

When you're setting up a Cassandra cluster with random partitioning, you need to choose balanced keys for the initial tokens by dividing 2^127 by the number of nodes. I found a script here, but being a lazy bugger, I just wanted a table for common numbers of nodes. I couldn't find one by googling, so here's my generated version, and here's the script as a file.

One Node

node 0: 0

Two Nodes

node 0: 0

node 1: 85070591730234615865843651857942052864

Three Nodes

node 0: 0

node 1: 56713727820156410577229101238628035242

node 2: 113427455640312821154458202477256070485

Four Nodes

node 0: 0

node 1: 42535295865117307932921825928971026432

node 2: 85070591730234615865843651857942052864

node 3: 127605887595351923798765477786913079296

Five Nodes

node 0: 0

node 1: 34028236692093846346337460743176821145

node 2: 68056473384187692692674921486353642291

node 3: 102084710076281539039012382229530463436

node 4: 136112946768375385385349842972707284582

Six Nodes

node 0: 0

node 1: 28356863910078205288614550619314017621

node 2: 56713727820156410577229101238628035242

node 3: 85070591730234615865843651857942052864

node 4: 113427455640312821154458202477256070485

node 5: 141784319550391026443072753096570088106

Seven Nodes

node 0: 0

node 1: 24305883351495604533098186245126300818

node 2: 48611766702991209066196372490252601636

node 3: 72917650054486813599294558735378902454

node 4: 97223533405982418132392744980505203273

node 5: 121529416757478022665490931225631504091

node 6: 145835300108973627198589117470757804909

Eight Nodes

node 0: 0

node 1: 21267647932558653966460912964485513216

node 2: 42535295865117307932921825928971026432

node 3: 63802943797675961899382738893456539648

node 4: 85070591730234615865843651857942052864

node 5: 106338239662793269832304564822427566080

node 6: 127605887595351923798765477786913079296

node 7: 148873535527910577765226390751398592512

Nine Nodes

node 0: 0

node 1: 18904575940052136859076367079542678414

node 2: 37809151880104273718152734159085356828

node 3: 56713727820156410577229101238628035242

node 4: 75618303760208547436305468318170713656

node 5: 94522879700260684295381835397713392071

node 6: 113427455640312821154458202477256070485

node 7: 132332031580364958013534569556798748899

node 8: 151236607520417094872610936636341427313

Ten Nodes

node 0: 0

node 1: 17014118346046923173168730371588410572

node 2: 34028236692093846346337460743176821145

node 3: 51042355038140769519506191114765231718

node 4: 68056473384187692692674921486353642291

node 5: 85070591730234615865843651857942052864

node 6: 102084710076281539039012382229530463436

node 7: 119098828422328462212181112601118874009

node 8: 136112946768375385385349842972707284582

node 9: 153127065114422308558518573344295695155

Five short links

July 20, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Chris in Plymouth

Datacatalogs.org – A catalog of data catalogs from governments around the world. The really hard problem with all 'open data' is making the connection between a developer's immediate problem and an available data set or API, but at least sites like this are building a foundation for solving that.

A proposal for making Ajax crawlable – I didn't realize the hashbang syntax was actually backed up by an informal standard for making the same content available to crawlers through a traditional URL. This is much better than completely opaque Javascript-driven pages, but I am left wondering how tough it is to maintain two separate content delivery paths in the code?

Disruptor – Much as I dislike queues as a general-purpose primitive for data processing (I see them as a necessary evil when you're dealing with the subset of problems that require streaming solutions) I am impressed by this high-performance framework. A recurring theme in many of my optimization investigations over the last few years has been the painful cost of locking, so I bet their focus on lockless parallelization will be very powerful.

Adventures with venture capital – Chasing investment is both time-consuming and uncertain. A cautionary tale from Tim on how the process can go wrong, which unfortunately is more often than you'd think.

How much compute power do you need for next-gen sequencing? – Bioinformatics tasks are much larger than most web problems, but this analysis of their computing needs has some useful parallels. In my data jobs, CPU has never been the bottleneck, it's always been memory or IO. I don't think I'll be moving to a 1TB RAM machine any time soon though!

My San Francisco food highlights

July 18, 2011 By Pete Warden in Uncategorized 1 Comment

Photo from the Lazy Bear

Since I've moved into San Francisco, I've fallen in with a bad crowd. I've found myself hanging out with foodies, which has resulted in a lighter wallet and a heavier exercise schedule. Here's some of the more memorable places I've been in the last six months. Most of these aren't particularly high-end, but did leave an impression.

Blue – Spicy mac'n'cheese! Tater tots! A tiny diner on Market near Castro, full of lovingly executed comfort food.

La Mar – Strange but nice mashed-potato sculptures, served a bit like sushi. An expensive Peruvian restaurant at the Embarcadero, has high-octane pismo sours.

Chow – American traditional, old-style pasta dishes like spaghetti and meatballs. There's multiple locations, but the one at Church and Market has the best food and service.

L'Ardoise – Classy but unpretentious French food, coq-au-vin done right. Staffed by actual French people, this small place near Noe and Market is one of my favorite restaurants in the world.

Sushi Raw – Everyday California sushi, think spider rolls and other staples. Near Haight and Steiner in Lower Haight, this is a solid neighborhood sushi bar with a good online ordering system for take-out. It's my go-to place for a quick and healthy weekday meal.

Axum Cafe – Ethiopian food, lots of vegetarian offerings. This was my first exposure to Ethiopian food, and as a fan of both bread and eating with my hands, I was a natural. I recommend getting a platter with a sampler of all five of their dishes.

	Moonshine Voice v2 v… on Announcing Moonshine Voice
	Pete Warden on Launching a free, open-source,…
	riddelln on Launching a free, open-source,…
	I see dead people. Y… on Announcing Moonshine Voice
	Pete Warden: Announc… on Announcing Moonshine Voice

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Five short links

Porting Flash/Flex 3’s Matrix, Point and Rectangle classes to Javascript

Securing Cassandra on EC2

Five short links

Data Science Toolkit security fix

Five short links

Goodbye Thor

Cassandra initial tokens table

Five short links

My San Francisco food highlights