How to speed up data loads to SimpleDB

January 18, 2010 By Pete Warden in Uncategorized Leave a comment

I'm really keen to use Amazon's SimpleDB service to store my data, but the upload process is just too damn slow. A naive implementation of a loader lets me upload about 20 rows a second, and since I've got over 200 million rows, that would take around 6 months! Sid kindly shared his experiences with Netflix's massive data transfer to SimpleDB over at practicalcloudcomputing.com, and he achieved rates of over 10,000 items a second. He's been very generous with advice, but obviously can't share any proprietary code, so I've set out to implement an open-source data loader in Java to implement his suggestions.

The code is up at:
http://github.com/petewarden/simpledb_loader

It uploads 10,000 generated rows using these optimizations:
– Calling BatchPutAttributes() to upload 20 rows at a time
– Multiple threads to run requests in parallel
– Leaving Replace as false for the overwrite behavior

Despite that, I'm still only seeing around 140 items a second, which is a long way off Sid's results. I'm going to be doing some more work on this, but I'd love it if anyone from Amazon could jump in and help put together an example that implements all their best practices. Judging from the forums there's a lot of people stuck on exactly this problem and it would making porting over existing services a lot easier.

The missing control panel for SimpleDB

January 16, 2010 By Pete Warden in Uncategorized Leave a comment

I've been trying to upload around 210 million items to Amazon's SimpleDB service, which has been quite an adventure! Sid Anand's advice has been invaluable (he's done an even larger migration of data for Netflix), and I'll be blogging in more depth on the details, but one of the early problems I hit was the lack of any easy way to interact with the store. With MySQL you at least get a console you can use to sanity check your results, but SimpleDB was a black box.

Eventually I discovered a handy solution, SimpleDB Explorer. It's a commercial product, but comes with a free 30 day trial and only costs $35. I loathe Java for GUIs, and it does have some quirks like over-enthusiastic dialogs that pop up willy-nilly, but it does run on Windows, Linux and OS X. It's got the functionality you'd expect, you can edit the overall structure of the store, run queries or just browse the data to make sure it looks reasonable. It's saved me a lot of time, if you're doing any serious work with SimpleDB I'd highly recommend buying it.

How to find user information from an email address

January 8, 2010 By Pete Warden in Uncategorized 1 Comment

Photo by Mzelle Biscotte

I’ve had a lot of people ask me about the FindByEmail service I set up, so I’ve decided to release the code as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address. Give it a try for yourself by entering an email address below:

Email address:

The code is under the 2-clause BSD license, to make it easy for commercial reuse. It’s all in PHP, and you’ll need to add your own API keys for some of the services to config.php before you can use it yourself. It’s up on github at

http://github.com/petewarden/findbyemail

If you do find more services that offer an email-to-user mapping, either let me know and I’ll add them, or fork the project and I’ll merge your changes back in. The module currently supports these services:

Gravatar
Yahoo
43things
Vimeo
Amazon
Brightkite
AIM
Friendfeed
Google Social Graph
Rapleaf
DandyID

The last four conglomerate information for multiple services, so it can sometimes retrieve Twitter, LinkedIn and Facebook account data. There’s also some code for querying Skype, but since that involves setting up a Skype client instance running inside a headless X-Window session, I’ve commented that code out for now.

C Hashmap

January 7, 2010 By Pete Warden in Uncategorized 2 Comments

Photo by crazybarefootpoet

I still remember my excitement when I discovered Google after years of struggling with awful search engines like Altavista, but every now and again it really doesn't find what I'm looking for.

I was just going to bed on Tuesday night when I remembered I had to start a job processing 500 GB of data, or it would never be done in time for a deadline. This process (merging adjacent lines of data into a single record) was painfully slow in the scripting languages I tried, so I'd written a tool in plain C to handle it. Unfortunately I'd never tried it on this size of data, and I quickly discovered an O(n^2) performance bug that ground progress to a halt. To fix it, I needed a hashmap of strings to speed up a lookup, so I googled 'c hashmap' to grab an implementation. I was surprised at the sparseness of the results, the top hit appeared to be a learning project by Eliot Back.

Before I go any further, if you need a good plain C hashmap that's been battle-tested and generally rocks, use libjudy. Don't do what I did, trying to build your own is a silly use of anyone's time! My only excuse is that I thought it would be quicker to grab something simpler than libjudy, and I'd had a martini…

I stayed up until 2am trying to get the hash map integrated, discovering some nasty performance bugs in the implementation as I did. For instance, the original code actually tried to completely fill the hash map before it reallocated, which means for a large map it often searches linearly through most of the entries if the key isn't present, since it only stopped when it found a gap. I also removed the thread primitives, and converted it over to use strings as keys, with a CRC32 hashing function.

I don't make any claims for the strength of the resulting code, but at least this version has a unit test and I've used it in anger. Thanks to Eliot for the original, here's my updated source:

http://github.com/petewarden/c_hashmap

Hopefully this will help out any other late-night coders like me searching for 'C hashmap'!

Is it time to use page-views as loan collateral?

January 3, 2010 By Pete Warden in Uncategorized Leave a comment

Photo by Joshua De Laughter

I recently finished The Big Rich, a history of Texas oil-men by the author of Barbarians at the Gate. It was striking how similar the early days of Texas oil felt to the current web startup world, full of skeptical old companies, a few new-born giants and a crowd of wild-catters convinced they were just one lucky strike away from riches.

One detail that really struck me was an innovation in financing that enabled the independent operators to build their businesses. Bankers in Houston began giving out loans with the collateral based on the estimated reserves underneath a wildcatter's oil wells. This was unheard of, but it made perfect commercial sense. As long as the banks could rely on a trustworthy geological report, the reserves represented a steady stream of cash to guarantee any loan. In return, the independents were able to re-invest in the gear and labor needed to sink new wells and expand.

This got me wondering if this is a better model than the current angel/VC equity standard for web financing? If you have a pretty reliable income stream from advertising on a site, are there banks comfortable enough scrutinizing audited visitor reports to lend you money against that? Nothing I'm working on fits that description, but I'm genuinely curious if we're at a stage of maturity in the industry where this sort of thing makes sense.

I see a lot of businesses out there that are never going to be the next Google but could be decent money spinners with some reasonable financing. The VC model relies on hitting for the fences, so most of the solid prospects I see end up either boot-strapping painfully slowly, getting angels and disappointing them with comparatively unexciting growth, or just hitting the end of the runway.

How to speed up massive data set analysis by eliminating disk seeks

January 1, 2010 By Pete Warden in Uncategorized 1 Comment

Photo by Pchweat

Building fanpageanalytics.com means analyzing with billions of pieces of information about hundreds of millions of users. At this sort of scale not only do traditional relational databases become impractical for my needs (even loading a few tens-of-millions of rows into a mysql table and then creating an index can take days), key-value stores also fail.

Why do they fail? Let's walk through a typical data-flow example for my application. I have an input text file containing new information about a user, so I want to update that user's record in the database. Even with a key-value store that means moving the disk head to the right location to write that new information, since user records are scattered arbitrarily across the drive. That typically takes around 10ms, giving an effective limit of around 100 users per second. Even a million users will take over two hours to process at that rate, with almost all the time spent tapping our toes waiting for the hard drive.

Stores like Mongo and Redis try to work around this by caching as much as they can in RAM, and using delayed writes of large sectors to disk so that updates don't block on disk seeks. This works well until the data set is too large to fit in RAM. Since my access locations are essentially random, the system ends up thrashing as it constantly swaps large chunks in and out of main memory, and we're back to being limited by disk seek speed.

So what's the solution? SSD drives don't have the massive seek bottleneck of traditional disks, but I'm still waiting for them to show up as an option on EC2. Instead, I've re-engineered my analysis pipeline to avoid seeks at all costs.

The solution I've built is surprisingly low-tech, based entirely on text files and the unix sort command-line tool. For the user record example I run through my source data files and output a text file with line for each update, beginning each line with the user id, eg:

…
193839: { fanof:['cheese', 'beer'] }
…

I then run sort on these individual files, which since the command is very efficient and the individual files are only a couple of gigabytes in size, only takes a few seconds each. I can then take several hundred of these sorted sub-files and use the -m option on sort to very quickly merge them into an uber-file that's sorted, which avoids the thrashing you get when it tries to sort files larger than RAM.

What does this buy me? Within this uber-file, all the information related to a given user id is now in adjacent lines, eg:

…
193839: { fanof:['cheese', 'beer'] }
193839: { fanof:['hockey', 'ice fishing'] }
193839: { location:'Wisconsin' }
193839: { name:'Sven Hurgessoon' }
…

It's now pretty simple to write a script that runs through the uber-file and can output complete records containing all of a user's information from multiple source files without having to do any seeking, since you're just outputting each user to a new row or file, and all the source data is also local.

This same technique can be applied to any attribute you want to index in your source data. You can use the fan page name as the key in the first part of each line instead, which is how I'm assembling the data on each topic.

So in summary, I'm using sort to pre-order my data before processing to avoid seeks. I'm sure I'm not the only person to discover this, but it's not something that I've run across before, and it's enabled me to cope with orders-of-magnitude larger data sets than my pipeline could handle before.

How to guess gender from a first name in PHP

December 31, 2009 By Pete Warden in Uncategorized Leave a comment

Photo by Davezilla

If you've got someone's first name, it's possible to make a pretty accurate guess what their gender is. Obviously there's plenty of exceptions, Sean and Francis spring to mind, but for lots of applications you don't need 100% accuracy or coverage. In my case I want a better understanding the demographics of my users, so a figure that's within a few percent is fine.

There's a great Perl module called Text::GenderFromName that implements this idea, with accumulated wisdom dating all the way back to a 1991 awk script! I haven't found anything that fits well into my PHP projects though, so I finally bit the bullet and ported that Perl script to PHP. The result is up at

http://web.mailana.com/labs/genderfromname/

and you can get the source at

http://github.com/petewarden/genderfromname

Thanks to Eamon Daly and Jon Orwant for the original code, and apologies for the mechanical translation of Perl code to PHP. It's now painfully non-idiomatic, but it does work!

For best results you should also install the doublemetaphone PHP module, though it will function without it.

Introducing Fan Page Analytics

December 30, 2009 By Pete Warden in Uncategorized Leave a comment

Fan Page Analytics is a new project I've just launched to help answer questions about Facebook pages. Here's some examples:

Which parts of the world have the most fans of God? In the US the map pretty clears shows the traditional Bible Belt, but looking worldwide the Phillipines is pretty god-fearing too.

How does ReadWriteWeb's fan base compare to TechCrunch's? From the map, RWW is much more broadly based, whereas TC's readership is heavily concentrated in the traditional US tech centers of California, Washington and Massachusetts. I only see one venture capital fan page in RWW's top 20 most related pages, but I count 8 in TC. On the other hand there's a couple of HR related pages for RWW, and none for TC, which suggests a less geeky audience.

That's all fascinating, but what problem does it solve? Suppose I'm planning the next DEMO conference. Glancing at the related pages shows that Charlene Li and Fred Wilson are people my audience care about, so they should be top of my list to attend and spread the word. ReadWriteWeb and GigaOm fans are more likely to be fans of DEMO than Techcrunch readers, so I might get more bang for my buck buying ad space on those sites. Looking at the locations, CA, WA and MA are way ahead, so I can craft some Facebook ads targeted only at those areas and tied in to some of the other related interests. I can even look at some examples of users in particular locations or with shared interests to understand if they're really my target market. These appear in the right side-bar when you click on an area or a location.

This is an initial release, so expect a few bugs, and it's not yet got complete coverage of fan pages, so apologies if yours isn't there yet. Hopefully you can still have some fun uncovering things like Glenn Beck's fan base in Outer Mongolia.

What can I find out about you if I know your email address?

December 30, 2009 By Pete Warden in Uncategorized Leave a comment

Photo by HerzogBR

One of the least-understood developments of the last few years is the growth of databases of personal information linked to email addresses. Rapleaf is probably the leader in this field, but even Flickr lets companies search their API for users based on an email address. I wrote a service that queries all the data sources I could find to demonstrate how much is out there:

http://web.mailana.com/labs/findbyemail/

Give it a try for yourself, you might be surprised by how much companies can discover about you once they know your email address! Many services give out at least your full name and a location, which is often enough to get your address and a phone number from a service like whitepages.com.

How to make your intensity maps interactive

December 21, 2009 By Pete Warden in Uncategorized Leave a comment

Google’s Intensity Map charts are a really easy and clean way to show heat maps of geographic data. Unfortunately, there’s no way to take the next step in the user experience, and let people mouse over and click on the maps to see additional information about a particular area.

To solve that problem, I’ve written a PHP script that extracts the country and state boundaries from the maps and constructs a Javascript function to return the state code for any point on the maps. You can see an example of this in practice at http://web.mailana.com/labs/mapclicker/ or you can download the complete source, including the boundary extractor.

The example page will show the state or country code for any point you move the mouse over on either of the maps, and if you click, will display an alert showing the name. It assumes your image has the maximum 440×220 dimensions, but you can apply scaling to the coordinates if you are using something smaller.

	Moonshine Voice v2 v… on Announcing Moonshine Voice
	Pete Warden on Launching a free, open-source,…
	riddelln on Launching a free, open-source,…
	I see dead people. Y… on Announcing Moonshine Voice
	Pete Warden: Announc… on Announcing Moonshine Voice

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.