Why I abandoned Tokyo Tyrant

Gamera

Photo by TCM Hitchhiker

First off, I want to say how impressed I've been by the work Mikio Hirabiyashi has done on Tokyo Cabinet and Tyrant. I like it a lot and really, really wanted to use it as the data store for twitter.mailana.com. Since I decided it wouldn't work for me after trying it out, it seems worthwhile documenting my experiences. There's obviously plenty of people using it in production, so don't let me put you off trying it, I just want to leave a trail of bread-crumbs for anyone else hitting similar issues.

As I explained in my initial tutorial, getting started with Tokyo was straightforward. Once I'd got a basic Tyrant server running, I exported a 2 million row table from MySQL and wrote a PHP test-harness to load that into Tyrant. I chose an on-disk hash table for the store, since that promised an in-memory index for super-fast lookups from a key, and used the default localhost:1978 port.

For my first attempt I went with the HTTP interface. It took a bit of jiggery-pokery to persuade PHP's CURL to properly handle a custom PUT method, but I implemented it and ran the test. After a few thousand rows, it suddenly started taking almost exactly 2 seconds for each HTTP put. Looking at the timing I output to the error log, I'd see 1.998s, 2.001s, etc. This was very suspicious, since it smelt like a timeout, though the CURL library was set to timeout after 30 seconds, not 2.

I guessed this must be something wrong with CURL, so to get around it I wrote a pure socket-based version of the code, which wasn't too hard since both the requests and responses were very simple. Running it again, I started to see a PHP notice "errno=11 Resource temporarily unavailable" from fwrite after a few thousand insertions. Digging around the internet, this seemed like a pretty generic error, but it might be related to running out of socket resources. I was closing the file handle to the socket every time, and looking through the network tools on my Red Hat Fedora 8 system, I couldn't see a leak.

Since this seemed like it might be related to my use of TCP/IP, I switched Tokyo and my code over to using a Unix file-system socket instead. I still saw the same error.

To see if this was only a problem with the HTTP I pulled down the Net_TokyoTyrant PHP library that uses the raw binary socket interface and integrated that. This also allowed me to keep a single socket connection for my whole import, to avoid any resource leaking. Still no dice, I was seeing the same error.

At this point I was getting suspicious of PHP's socket handling. Both my original code and Net_TokyoTyrant were using the newer fsockopen() style stream-based sockets rather than the older socket_create() family of functions. I rewrote Net_TokyoTyrant to use the deprecated socket_* functions instead, and reran my test. Instead of hanging and printing a warning notice, it now just hung after a few thousand rows.

Looking for something else to try, I opened and closed the socket connection for every transaction, rather than keeping it open for the whole import. This actually worked! I left it running, and noticed that any of my values that were larger than 128,000 bytes (interestingly not 128*1024) were being truncated. Since only a few of my rows had BLOBs that large, I left the test running. When I came back to check on it a while later, I discovered that it had crashed. Looking into why, I found that the Tyrant log files had filled up the drive that held all my data. Every socket open and close was logged, and there were enough of them to fill gigabytes of drive space.

That was the point I decided to stop my evaluation, I'd run into more issues than I felt comfortable with. If there were just a couple I could have justified investing some debugging time and submitted patches or at least good reports. Instead I was obviously well off the beaten path, and getting Tokyo reliably working in my PHP/Fedora environment felt like it might take several more days.

It definitely wasn't wasted work though. As I'd been researching some of the issues I'd run across an intriguing note in the memcached FAQ, stating that the InnoDB engine for MySQL supported very fast primary key lookups. Since the main reason I was moving away from MySQL was painfully slow primary key fetches when using the default MyISAM engine on massive tables, this offered a solution. I kept all the other code I'd written to translate my more complex queries and analysis into a key/value store, and replaced the Tokyo get/put/out primitive functions with implementations that ran against an InnoDB table in MySQL.

This worked out great, I was able to speed up twitter.mailana.com a lot. Even better, I'm still in a good position to use Tokyo in the future if I am able to get it running reliably, I'll just need to swap out the lowest layer.

[Update – I've now switched back to Tokyo Tyrant after fixing the truncation bug in the PHP wrapper: http://petewarden.typepad.com/searchbrowser/2009/06/how-to-get-tokyo-tyrant-working-in-php.html ]

Visualize your networking with PeopleMaps

Peoplemapsscreenshot

A friend recently pointed me towards PeopleMaps, a new service that's currently in open beta. It's a tool for exploring your social network aimed at business users, focused on helping you make connections to avoid cold sales calls and get introductions. They use path-finding algorithms to navigate all the possible chains between you and your target, trying to find the shortest number of hops between, and displaying the possibilities as a simple tree.

I really like their emphasis on a painful problem for a very specific market, sales professionals. It's not easy to figure out how to get decent introductions for friends-of-friends through LinkedIn, so I can see this being a very popular service. It looks like a great tool for my "You guys should talk" crusade.

One thing I'm not sure about is where they're getting their relationship data from. They have a great set of tools to import your contacts from Gmail, LinkedIn and Outlook, but they promise to only use that information for your own private network. Companies like Spoke did something similar in the past, aggregating lots of individual contacts from their Outlook plugin to build a global graph, but PeopleMap's privacy policy appears to rule this out.

Anyway, I love what they're up to, seems very well thought-out and they've got an impressive team, I look forward to seeing more.

Reblog: Finding your customers

Supermarket

Photo by Steve Crane

As I was checking out some of my new Twitter followers, I ran across Keith William's blog. It's new and so pretty sparse, but his post on finding your customers stood out. As you may be able to tell from my blogging, getting customers has been top of my priority list recently, and it's something I'm a novice at, so I'm hungry for any advice I can get.

He lays out some practical steps to test any startup idea against the real world by trying to win customers. It reminds me of Mike Maple's Customer Development talk, but with a very concrete 9 stage program that will hone your startup ideas at the same time as you build a base of influential customers. It's obviously hard-won wisdom from his time in the trenches, and even if I don't follow the exact recipe it's got me thinking about what I can steal from it. Read the full post, but here's the steps:

1) Figure out what you think your product or technology is, who
would be your actual customer, and generally how you think your
business model would work.

2) Envision a complete start to finish incarnation of this ‘thing’
and find some general supporting evidence that there is a market. 
Don’t get married to this, it’s only a starting point.

3) Find a Trade Show, presentation, or gathering where you can find
many of the target customers.  This can also just mean going out and
talking with individuals.

4) Go there and talk to as many people as possible.  Pitch
everyone.  If you’re getting funny looks keep adjusting your pitch
until they ‘get it’ or you ‘get theirs’ instead.

5) Find at least one champion at an organization that would be a
target customer, but also has clout within your industry to sway
others. If you’ve polished your pitch quickly enough, you most likely
will find one at what gathering you’re at.

6) Keep pitching to your champion and honing your message until it
begins to generate traction both with your champion and also thorughout
the rest of your audience.

7) Start submitting abstract proposals to present at other industry
gatherings, create a corporate blog, generate supporting data to show
your position as valid and publish it as a whitepaper.

Wash, rinse, repeat.  Keep going around and around steps 3 through 7. 
This will create the market churn and you’ll find one contact feeding
off of any small bit of positive news from the other…..soon this growth
becomes somewhat organic.

9) Now, step back and look what you have:  A cogent message for your
market, a pitch that has turned into to definition of your product or
service, and real customers to reach out, touch, AND QUANTIFY for your
business planning pleasure.

Tokyo Tyrant Tutorial

Godzillavsbuddha

Photo by Olivander

As the number of messages on twitter.mailana.com has approached 200 million, the response speed has been dropping. In fact, as @travisspencer observed, it's slow as sin. The back end is all MySQL, and I've spent a lot of time denormalizing my data, indexing and trying other optimizations to speed it up, but I've reached the point where just doing a simple SELECT * WHERE primarykey=something; can take up to a minute. I'm sure a real guru could wave a dead chicken over my table structure whilst murmering incantations and get the performance I need, but I've reached the point where it's easier to move to a simpler system with less moving parts.

That's where Tokyo Cabinet comes in. It's a much more primitive system than MySQL, just a key->value store rather than a relational database. That means I'll have to implement things like sorting and grouping in the client code, but my key requirement is that it fetch modest numbers of rows from massive datasets very quickly. The performance numbers quoted are very impressive, with millions of fetches and inserts possible a second, so I'm evaluating it now.

If you're interested in trying it yourself, first read this article on the background behind Tokyo Cabinet and Tyrant, and then go through the 30 slide introduction by the author. The main documentation is at http://tokyocabinet.sourceforge.net/spex-en.html and http://tokyocabinet.sourceforge.net/tyrantdoc/, and it's pretty good, though at first the number of functions and their naming is overwhelming (eg tcadbputkeep2!). They do talk you through the installation process, but it's a bit scattered, so here's the steps I went through to get it running on my Red Hat Fedora 8 system:

To set up the underlying Tokyo Cabinet database engine:

curl "http://tokyocabinet.sourceforge.net/tokyocabinet-1.4.9.tar.gz" > tokyocabinet-1.4.9.tar.gz
gunzip tokyocabinet-1.4.9.tar.gz
tar -xf tokyocabinet-1.4.9.tar
cd tokyocabinet-1.4.9
yum install zlib-devel
yum install bzip2-devel
./configure
make
make check
(awesomely geeky text scrolls past for 10 mins)
make install (as root)

To install the Tokyo Tyrant server that provides remote access to the database:

curl "http://tokyocabinet.sourceforge.net/tyrantpkg/tokyotyrant-1.1.16.tar.gz" > tokyotyrant-1.1.16.tar.gz
gunzip tokyotyrant-1.1.16.tar.gz
tar -xf tokyotyrant-1.1.16.tar
cd tokyotyrant-1.1.16
./configure
make
make install
ttserver
(to check it installed ok)

emacs /etc/rc.local
  Add '/usr/local/sbin/ttservctl start' to the end of the file

ttservctl start

You should now have a server running. To test it out, push a value into the store and then retrieve it using the HTTP interface:

curl -X PUT -d 'first' 'http://localhost:1978/one'
curl 'http://localhost:1978/one'

Apple’s best-kept secret

Companystore

Photo by Ondra Soukup

Jeff Nolan just reminded me of one of my favorite parts of visiting the Apple campus in Cupertino – the Company Store. Despite the name it's not restricted to employees, and they have an amazing range of Apple-branded clothes, mugs, mouse-mats, none of it sold anywhere else in the world. It's been my secret weapon for the last few Christmases, there's something for everyone. As Jeff's lovely model demonstrates, they even have the cutest baby-wear:

If you have a loved one who's joined the Apple Borg, I guarantee some serious brownie points!

How to find Twitter communities from keywords

Glueconscreenshot

Twitter.mailana.com started by visualizing your circle of friends, but I want to show networks formed around places or ideas too. Now if you pick a keyword you can see how all the people who've talked about that term are connected. For example, the graph above is a screenshot from the social network that's formed around Gluecon.

So what's this good for? My favorite use is discovery, finding new people in an area I'm interested in. For example I've been learning Flash over the last few months, so I searched for the community defined by 'Actionscript'. Immediately I saw an interesting central cluster:

Actionscriptscreenshot

Looking up some of these folks, @mesh is Mike Chambers, a Flash expert at Adobe, @mdowney is Mike Downey a Flex/AIR expert, @ddura and @leebrimelow are Adobe Flash Evangelists, @flashchemist is Vipin Chandran, an expert Flex developer.

What's important about this list is that you know these are well-connected, active people in the Flash community. In a way their position in the graph is like PageRank, with each conversation a vote for their importance. Most other measures like follower count or update totals can be gamed, but someone's place in the overall network is much tougher to fake. Looking at the whole community around a keyword is a great way of discovering the most interesting people, something no other search technique can match.