Twelve steps to running your Ruby code across five billion web pages

Stacks2
Photo by Andrew Ferguson

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

1 – Fetch the example code from github

You'll need git to get the example source code. If you don't already have it, there's a good guide to installing it here:

http://help.github.com/mac-set-up-git/

From a terminal prompt, you'll need to run the following command to pull it from my github project:

git clone git://github.com/petewarden/common_crawl_types.git

2 – Add your Amazon keys

If you don't already have an Amazon account, go to this page and sign up:

https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Your keys should be accessible here:

https://aws-portal.amazon.com/gp/aws/securityCredentials

To access the data set, you need to supply the public and secret keys. Open up extension_map.rb in your editor and just below the CHANGEME comment add your own keys (it's currently around line 61).

3 – Sign in to the EC2 web console

To control the Amazon web services you'll need to run the code, you need to be signed in on this page:

http://console.aws.amazon.com

4 - Create four buckets on S3

Commoncrawl0

Buckets are a bit like top-level folders in Amazon's S3 storage system. They need to have globally-unique names which don't clash with any other Amazon user's buckets, so when you see me using com.petewarden as a prefix, replace that with something else unique, like your own domain name. Click on the S3 tab at the top of the page and then click the Create Bucket button at the top of the left pane, and enter com.petewarden.commoncrawl01input for the first bucket. Repeat with the following three other buckets:

com.petewarden.commoncrawl01output

com.petewarden.commoncrawl01scripts

com.petewarden.commoncrawl01logging

The last part of their names is meant to indicate what they'll be used for. 'scripts' will hold the source code for your job, 'input' the files that are fed into the code, 'output' will hold the results of the job, and 'logging' will have any error messages it generates.

5 – Upload files to your buckets

Commoncrawl1

Select your 'scripts' bucket in the left-hand pane, and click the Upload button in the center pane. Select extension_map.rb, extension_reduce.rb, and setup.sh from the folder on your local machine where you cloned the git project. Click Start Upload, and it should only take a few seconds. Do the same steps for the 'input' bucket and the example_input.txt file.

6 – Create the Elastic MapReduce job

The EMR service actually creates a Hadoop cluster for you and runs your code on it, but the details are mostly hidden behind their user interface. Click on the Elastic MapReduce tab at the top, and then the Create New Job Flow button to get started.

7 – Describe the job

Commoncrawl2

The Job Flow Name is only used for display purposes, so I normally put something that will remind me of what I'm doing, with an informal version number at the end. Leave the Create a Job Flow radio button on Run your own application, but choose Streaming from the drop-down menu.

8 – Tell it where your code and data are

Commoncrawl3

This is probably the trickiest stage of the job setup. You need to put in the S3 URL (the bucket name prefixed with s3://) for the inputs and outputs of your job. Input Location should be the root folder of the bucket where you put the example_input.txt file, in my case 's3://com.petewarden.commoncrawl01input'. Note that this one is a folder, not a single file, and it will read whichever files are in that bucket below that location.

The Output Location is also going to be a folder, but the job itself will create it, so it mustn't already exist (you'll get an error if it does). This even applies to the root folder on the bucket, so you must have a non-existent folder suffix. In this example I'm using 's3://com.petewarden.commoncrawl01output/01/'.

The Mapper and Reducer fields should point at the source code files you uploaded to your 'scripts' bucket, 's3://com.petewarden.commoncrawl01scripts/extension_map.rb' and 's3://com.petewarden.commoncrawl01scripts/extension_map.rb'. You can leave the Extra Args field blank, and click Continue.

9 – Choose how many machines you'll run on

Commoncrawl4

The defaults on this screen should be fine, with m1.small instance types everywhere, two instances in the core group, and zero in the task group. Once you get more advanced, you can experiment with different types and larger numbers, but I've kept the inputs to this example very small, so it should only take twenty minutes on the default three-machine cluster, which will cost you less than 30 cents. Click Continue.

10 – Set up logging

Commoncrawl6

Hadoop can be a hard beast to debug, so I always ask Elastic MapReduce to write out copies of the log files to a bucket so I can use them to figure out what went wrong. On this screen, leave everything else at the defaults but put the location of your 'logging' bucket for the Amazon S3 Log Path, in this case 's3://com.petewarden.commoncrawl01logging'. A new folder with a unique name will be created for every job you run, so you can specify the root of your bucket. Click Continue.

11 – Specify a boot script

Commoncrawl5

The default virtual machine images Amazon supplies are a bit old, so we need to run a script when we start each machine to install missing software. We do this by selecting the Configure your Bootstrap Actions button, choosing Custom Action for the Action Type, and then putting in the location of the setup.sh file we uploaded, eg 's3://com.petewarden.commoncrawl01scripts/setup.sh'. After you've done that, click Continue.

12 – Run your job

Commoncrawl7

The last screen shows the settings you chose, so take a quick look to spot any typos, and then click Create Job Flow. The main screen should now contain a new job, with the status 'Starting' next to it. After a couple of minutes, that should change to 'Bootstrapping', which takes around ten minutes, and then running the job, which only takes two or three.

Debugging all the possible errors is beyond the scope of this post, but a good start is poking around the contents of the logging bucket, and looking at any description the web UI gives you.

Commoncrawl8

Once the job has successfully run, you should see a few files beginning 'part-' inside the folder you specified on the output bucket. If you open one of these up, you'll see the results of the job.

Commoncrawl9

This job is just a 'Hello World' program for walking the Common Crawl data set in Ruby, and simply counts the frequency of mime types and URL suffixes, and I've only pointed it at a small subset of the data. What's important is that this gives you a starting point to write your own Ruby algorithms to analyse the wealth of information that's buried in this archive. Take a look at the last few lines of extension_map.rb to see where you can add your own code, and edit example_input.txt to add more of the data set once you're ready to sink your teeth in.

Big thanks again to Ben Nagy for putting the code together, and if you're interested in understanding Hadoop and Elastic MapReduce in more detail, I created a video training session that might be helpful. I can't wait to see all the applications that come out of the Common Crawl data set, so get coding!

Unpaid work, sexism, and racism

 

Skatergirldevilboy

Photo by Wayan Vota

You may have been wondering why I haven't been blogging for over a week. I've got the generic excuse of being busy, but truthfully it's because I've had a draft of this post staring back at me for most of that time. God knows I'm not normally one to shy away from controversy, but I also know how tough it is to talk about racism and sexism without generating more heat than light. After two more head-slapping examples of our problem appeared just in the last few days, I couldn't hold off any longer. I'm not a good person to talk about explicit discrimination in the tech industry, I'd turn to somebody like Kristina Chodorow, but I have been struck by one of the more subtle reasons we discourage a lot of potential engineers from joining the profession.

I don't get paid for most of the things I spend my time on. I do my blogging, open source coding, and speak at conferences for free, my books provide beer money, and I've only been able to pay myself a small salary for the last few months, after four years of working on startups. This isn't a plea for sympathy, I love doing what I do and see it all as a great investment in the future. I saved up money during my time at Apple precisely so I'd have the luxury of doing all these things.

I was thinking about this when I read Rebecca Murphey's post about the Fluent conference. Her complaints were mostly about things that seemed intrinsic to commercial conferences to me, but I was struck by her observation that the lack of expenses for speakers hits diversity.

I think it goes beyond conferences though (and I've actually found O'Reilly to be far better at paying contributors than most organizers, and they work very hard on discrimination problems). The media industry relies on unpaid internships as a gateway to journalism careers, which excludes a lot of people. Our tech community chooses its high-flyers from people who have enough money and confidence to spend significant amounts of time on unpaid work. Isn't this likely to exclude a lot of people too?

And yes, we do have a diversity problem. I'm not wringing my hands about this out of a vague concern for 'political correctness', I'm deeply frustrated that I have so much trouble hiring good engineers. I look around at careers that require similar skills, like actuaries, and they include a lot more women and minorities. I desperately need more good people on my team, and the statistics tell me that as a community we're failing to attract or keep a lot of the potential candidates.

We're a meritocracy. Writing, speaking, or coding for free helps talented people get noticed, and it's hard to picture our industry functioning without that process at its heart. We have to think hard about how we can preserve the aspects we need, but open up the system to people we're missing right now. Maybe that means setting up scholarships, having a norm that internships should all be paid, setting aside time for training as part of the job, or even doing a better job of reaching larval engineers earlier in education? Is part of it just talking about the career path more explicitly, so that people understand how crucial spending your weekends coding on open source, etc, can be for your career?

I don't know exactly what to do, but when I look around at yet another room packed with white guys in black t-shirts, I know we're screwing up.

Five short links

Twoplusthree
Photo by Bitzi

Geotagging poses security risks – An impressively level-headed look at how the quiet embedding of locations within photos can cause security issues, especially for the service members it's aimed at.

I can't stop looking at tiny homes – I was so happy to discover I'm not the only one obsessed with houses the size of dog kennels. If you're a fellow sufferer, avoid this site at all costs.

From CMS to DMS – Are we moving into an era of Data Management Systems, that play the same interface role for our data that CMS's do for our content?

Drug data reveals sneaky side effectsDrew Breunig pointed me at this example of how bulk data is more than the sum of its parts. By combining a large amount of adverse reaction reports, a large number of new side effects caused by mixing drugs were discovered.

Gisgraphy – An intriguing open-source LGPL project that offers geocoding services based on OpenStreetMap and Geonames information. I look forward to checking this out and having a play.

Keep the web weird

Robotsholdinghands
Photo by Jeremy Brooks

I'm doing a short talk at SXSW tomorrow, as part of a panel on Creating the Internet of Entities. Preparing is tough because don't I believe it's possible, and even if it was I wouldn't like it. Opposing better semantic tagging feels like hating on Girl Scout cookies, but I've realized that I like an internet full of messy, redundant, ambiguous data.

The stated goal of an Internet of Entities is a web where "real-world people, places, and things can be referenced unambiguously". We already have that. Most pages give enough context and attributes for a person to figure out which real world entity it's talking about. What the definition is trying to get at is a reference that a machine can understand.

The implicit goal of this and similar initiatives like Stephen Wolfram's .data proposal is to make a web that's more computable. Right now, the pages that make up the web are a soup of human-readable text, a long way from the structured numbers and canonical identifiers that programs need to calculate with. I often feel frustrated as I try to divine answers from chaotic, unstructured text, but I've also learned to appreciate the advantages of the current state of things.

Producers should focus on producing

The web is written for humans to read, and anything that requires the writers to stop and add extra tagging will reduce how much content they create. The original idea of the Semantic Web was that we'd somehow persuade the people who create websites to add invisible structure, but we had no way to motivate them to do it. I've given up on that idea. If we as developers want to create something, we should do the work of dealing with whatever form they've published their information in, and not expect them to jump through hoops for our benefit.

I also don't trust what creators tell me when they give me tags. Even if they're honest, there's no feedback for whether they've picked the right entity code or not. The only ways I've seen anything like this work are social bookmarking services like the late, lamented Delicio.us, or more modern approaches like Mendeley, where picking the right category tag gives the user something useful in return, so they have an incentive both to take the action and to do it right.

Ambiguity is preserved

The example I'm using in my talk is the location field on a Twitter profile. It's free-form text, and it's been my nemesis for years. I often want to plot users by location on a map, and that has meant taking those arbitrary strings and trying to figure out what they actually mean. By contrast, Facebook forces users to pick from a whitelist of city names, so there's only a small number of exact strings to deal with, and they even handily supply coordinates for each.

You'd think I'd be much happier with this approach, but actually it has made the data a lot less useful. Twitter users will often get creative, putting in neighborhood, region, or even names, and those let me answer a lot of questions that Facebook's more strait-laced places can't. Neighborhoods are a fascinating example. There's no standard for their boundaries or names, they're a true folksonomy. My San Francisco apartment has been described as being in the Lower Haight, Duboce Triangle, or Upper Castro, depending on who you ask, and the Twitter location field gives me insights into the natural voting process that drives this sort of naming.

There's many other examples I could use of how powerful free-form text is, like the prevalance of "bicoastal" and "flyover countries" as descriptions changes over time, but the key point is that they're only possible because ambiguous descriptions are allowed. A strict reference scheme like Facebook's makes those applications impossible.

Redundancy is powerful

When we're describing something to someone else, we'll always give a lot more information than is strictly needed. Most postal addresses could be expressed as just a long zip code and a house number, but when we're mailing letters we include street, city and state names. When we're talking about someone, we'll say something like "John Phillips, the lawyer friend of Val's, with the green hair, lives in the Tenderloin", when half that information would be enough to uniquely identify the person we mean.

We do this because we're communicating with unreliable receivers, we don't know what will get lost in transmission as the postie drops your envelope in a puddle, or exactly what information will ring a bell as you're describing someone. All that extra information is manna from heaven for someone doing information processing though. For example I've been experimenting with a completely free map of zip code boundaries, based on the fact that I can find latitude/longitude coordinates for most postal addresses using just the street number, name, and city, which gives me a cluster of points for each zip. The same approach works for extra terms used to in conjunction with people or places – there must be a high correlation between the phrases "dashingly handsome man about town" and "Pete Warden" on pages around the web. I'm practically certain. Probably.

Canonical schemes are very brittle in response to errors. If you pick the wrong code for a person or place, it's very hard to recover. Natural language descriptions are much harder for computers to deal with, but they not only are far more error-resistant, the redundant information they include often has powerful applications. The only reason Jetpac can pick good travel photos from your friends is that the 'junk' words used in the captions turned out to be strong predictors of picture quality.

Fighting the good fight

I'm looking forward to the panel tomorrow, because all of the participants are doing work I find fascinating and useful. Despite everything I've said, we do desperately need better standards for identifying entities, and I'm going to do what I can to help. I just see this as a problem we need to tackle more with engineering than evangelism. I think our energy is best spent on building smarter algorithms to handle a fallen world, and designing interchange formats for the data we do salvage from the chaos.

The web is literature; sprawling, ambiguous, contradictory, and weird. Let's preserve those as virtues, and write better code to cope with the resulting mess.

Five short links

Lichenstarfish
Photo by Phillip Hay

Kartograph – An open-source web component for rendering beautiful interactive maps using SVG. Fantastic work by Gregor Aisch.

Hard science, soft science, hardware, software – I have a blog crush on John D. Cook's site, it's full of thought-provoking articles like this. As someone who's learned a lot from the humanities, I think he gets the distinction between the sciences exactly right. Disciplines that don't have elegant theoretical frameworks and clear-cut analytical tools for answering questions do take a lot more work to arrive at usable truths.

Don't fear the web – A good overview of moral panics on the internet, and how we should react to the dangers of new technology.

Using regression isolation to decimate bug time-to-fix – Once you're dealing with massive, interdependent software systems, there's a whole different world of problems. This takes me back to my days of working with multi-million line code bases, automating testing and bug reporting becomes essential.

Humanitarian OpenStreetMap Team – I knew OSM did wonderful work around the world, but I wasn't aware of HOT until now, great to see it all collected in one place.

Five short links

Fivelonglinks
Photo by Jody Morgan

Open Data Handbook Launched – I love what the Open Knowledge Foundation are doing with their manuals. Documentation is hard and unglamorous work, but has an amazing impact. I'm looking forward to their upcoming title on data journalism.

My first poofer Workshop – This one's already gone, but I'm hoping there will be another soon. I can't think of a better way to spend an afternoon than learning to build your very own ornamental flamethrower.

Using photo networks to reveal your home town – Very few people understand how the sheer volume of data that we're producing makes it possible to produce scarily accurate guesses from seemingly sparse fragments of information. When you look at a single piece in isolation it looks harmless, but pull enough together and the result becomes very revealing.

Introducing SenseiDB – Another intriguing open-source data project from LinkedIn. There's a strong focus on the bulk loading process, which in my experience is the hardest part to engineer. Reading the documentation leaves me wanting more information on their internal DataBus protocol, I bet that includes some interesting tricks.

IPUMS and NHGIS – As someone who recently spent far too long trying to match the BLS's proprietary codes for counties with the US Census's FIPS standard, I know how painful the process of making statistics usable can be. There's a world of difference between a file dumps in obscure formats with incompatible time periods and units, and a clean set that you can perform calculations on. I was excited to discover the work being done at the University of Minnesota to create unified data sets that cover a long period of time, and much of the world.

Data scientists came out of the closet at Strata

Outofthecloset
Photo by Sarah Ackerman

Roger Magoulas asked me an interesting question during Strata – what was the biggest theme that emerged from this year's gathering? It took a bit of thought, but I realized that I was seeing a lot of people from all kinds of professions and organizations becoming conscious and open about their identity as data scientists.

The term itself has received a lot of criticism and there's always worries about 'big-data-washing', but what became clear from dozens of conversations was that it's describing something very real and innovative. The people I talked to came from professions as diverse as insurance actuaries, physicists, marketers, geologists, quants, biologists, web developers, and they were all excited about the same new tools and ways of thinking. Kaggle is concrete proof that the same machine-learning skills can be applied across a lot of different domains to produce better results than traditional approaches, and the same is being proved for all sorts of other techniques from NoSQL databases to Hadoop.

A year ago, your manager would probably roll her eyes if you were in a traditional sector and she caught you experimenting with the standard data science tools. These days, there's an awareness and acceptance that they have some true advantages over the old approaches, and so people have been able to make an official case for using them within their jobs. There's also been a massive amount of cross-fertilization, as it's become clear how transferrable across domains the best practices are.

This year thousands of people across the world have realized they have problems and skills in common with others they would never have imagine talking to. It's been a real pleasure seeing so much knowledge being shared across boundaries, as people realize that 'data scientist' is a useful label for helping them connect with other people and resources that can help with their problems. We're starting to develop a community, and a surprising amount of the growth is from those who are announcing their professional identity as data scientists for the first time.

Five short links

Fivedogs1
Picture by Don O'Brien

DepthCam – An open-source Kinect hack that streams live depth information to a browser using WebSockets for transport and WebGL for display. If you pick the right time of day, you'll see the researcher sipping his tea and tapping at the keyboard, in depth form!

OpenGeocoder – Steve Coast is at it again, this time with a wiki-esque approach to geocoding. You type in a query string, and if it's not found you can define it yourself. I'm obsessed with the need for an open-source geocoder, and this is a fascinating take on the problem. By doing a simple string match, rather than trying to decompose and normalize the words, a lot of the complexity is removed. This is either madness or genius, but I'm hoping the latter. The tradeoff will be completely worthwhile if it makes it more likely that people will contribute.

A beautiful algorithm – I spent many hours as a larval programmer implementing different versions of Conway's Game of Life. As I read about new approaches, I was impressed by how much difference in speed there could be between my obvious brute force implementation, and those that used insights to avoid a lot of the unnecessary work. It's been two decades since I followed the area, so I was delighted to see how far it has come. In the old days, it would take a noticeable amount of time for a large grid to go through a single generation. Nowdays "it takes a second or so for Bill Gosper’s HashLife algorithm to leap one hundred and forty-three quadrillion generations into the future". There truly is something deeply inspiring about the effort that's gone into that progress, for a problem that's never had any commercial application.

BerkeleyDB's architecture – This long-form analysis of the evolution of a database's architecture rings very true. Read the embedded design lesson boxes even if you don't have time for the whole article, they're opinionated but thoughtful and backed up with evidence in the main text.

"View naming and style inconsistencies as some programmers investing time and effort to lie to the other programmers, and vice versa. Failing to follow house coding conventions is a firing offense".

"There is rarely such thing as an unimportant bug. Sure, there's a typo now and then, but usually a bug implies somebody didn't fully understand what they were doing and implemented the wrong thing. When you fix a bug, don't look for the symptom: look for the underlying cause, the misunderstanding"

Content Creep - There's a lot to think about in this exploration of media's response to a changing world. Using the abstract word "content" instead of talking concretely about stories, articles, or blog posts seems to go along with a distant relationship with the output your organization is creating. Thinking in terms of content simplifies problems too much, so that the value of one particular piece over another is forgotten.

Why Facebook’s data will change our world

Topshot1

When I told a friend about my work at Jetpac he nodded sagely and said "You just can't resist Facebook data can you? Like a dog returning to its own vomit". He's right, I'm completely entranced the information we're pouring into the service. All my privacy investigations were by-products of my obsessive quest for data. So with Facebook's IPO looming, why do I think research using its data will be so world-changing?

Population

Everyone is on Facebook. I know, you're not, but most organizations can treat you like someone without a phone or TV twenty years ago. The medium is so prevalent, if you're not on it's commercially viable to ignore you. This broad coverage also makes it possible to answer questions with the data that are impossible with other sources.

It's intriguing to know which phrases are trending on Twitter, but with only a small proportion of the population on the service, it's hard to know how much that reflects the country as a whole. The small and biased sample immediately makes every conclusion you draw suspect. There's plenty of other ways to mess up your study of course, but if you have two-thirds of a three hundred million population in your data that makes a lot of hard problems solvable.

Coverage

Love, friendship, family, cooking, travel, play, partying, sickness, entertainment, study, work: We leave traces of almost everything we care about on Facebook. We've never had records like this, outside of personal diaries. Blogs, government records, school transcripts, nothing captures such a rich slice of our lives.

The range of activities on Facebook not only lets us investigate poorly-understood areas of our behavior, it allows us to tie together many more factors than are available from any other source. How does travel affect our chances of getting sick? Are people who are close to their family different in how they date from those who are more distant?

Frequency

The majority of my friends on Facebook update at least once a day, with quite a few doing multiple updates. We've found the average Jetpac user has had over 200,000 photos shared with them by their friends! This continuous and sustained instrumentation of our lives is unlike anything we've ever seen before, we generate dozens or hundreds of nuggets of information about what we're doing every week. This coverage means it's possible to follow changes over time in a way that few other sources can match.

Accessibility

It's at least theoretically possible for researchers to get their hands on Facebook's data in bulk. A large and increasing amount of activity on the site happens in communal spaces where people know casual friends will see it. Expectations of privacy are a fiercely fought-over issue, but the service is fundamentally about sharing in a much wider way than emails or phone calls allow.

This background means that it's technically feasible to access large amounts of data in a way that's not true for the fragmented and siloed world of email stores, and definitely isn't true for the old-school storage of phone records. The different privacy expectations also allow researchers to at least make a case for analyses like the Politico Facebook project. It's incredibly controversial, for good reason, but I expect to see some rough consensus emerge about how much we trade off privacy for the fruits of research.

Connections

I left this until last because I think it's the least distinctive part of Facebook's data. It's nice to have the explicit friendships, but every communication network can derive much better information on relationships based on the implicit signals of who talks to who. There are some advantages to recording the weak ties that most Facebook friendships represent, and it saves an extra analysis set, but even most social networks internally rely on implicit signals for recommendations and other applications that rely on identifying real relationships.

The Future

This is the first time in history that most people are creating a detailed record of their lives in a shared space. We've always relied on one-time, narrow surveys of a small number of people to understand ourselves. With Facebook's data we have an incredible source that's so different from existing data we can gather, it makes it possible to answer questions we've never been able to before.

We can already see glimmers of this as hackers machete their way through a jungle of technical and privacy problems, but once the working conditions improve we'll see a flood of established researchers enter the field. They've honed their skills on meagre traditional information sources, and I'll be excited when I see their results on far broader collections of data. The insights into ourselves that their research gives us will change our world radically.

Five short links

Goastarfish
Photo by Vikas Rana

Dr Data's Blog – I love discovering new blogs, and this one's a gem. The State of Data posts are especially useful, with a lot of intriguing resources like csvkit.

TempoDB – Dealing with time series data at scale is a real pain, so I was pleased to run across this Techstars graduate. It's a database-as-a-service optimized for massive sets of time series data,behind a simple and modern REST/JSON API. We're generating so many streams of data from sensors and logs, the world needs something like this, as evidenced by the customers they're signing up, and I'm excited to follow their progress.

Foodborne Outbreaks – Unappetizing they may be, but this collection of food-poisoning cases is crying out to be visualized. (via Joe Adler)

Scalding – Another creation of the Cambrian Explosion of data tools, this Scala API for Cascading looks like it's informed by a lot of experience in the trenches at Twitter.

How to create a visualization – In a post on the O'Reilly blog I lay out how I tackle building a new visualization from scratch.