An example of Tokyo Tyrant in PHP

Kenvsgodzilla

Photo by TCM Hitchhiker

To make some progress in my attempts to use Tokyo Tyrant from PHP, I finally bit the bullet and created a standalone script with no dependencies that stress-tests the interface. It also works as an expanded example of how to use Tokyo from within PHP. You can download tokyotest.php here.

It includes both the Net_TokyoTyrant raw socket interface and my work-alike clone Net_HttpTyrant that uses HTTP instead. The main tokyo_test() function stores a large number of values, retrieves them and checks they're correct, and then deletes them, timing performance. Here's my findings based on my own experiments.

TURN OFF ULOG! I wish the <blink> tag still
worked, it's that important. I only just discovered this as the root
cause of log files will eating my hard drive. The default ttservctl
script has the innocuous-sounding ulog option turned on by default.
This update log holds details of every transaction you make with the
server. This is great if you're replicating or need to do restores, but
these files grow rapidly and can easily fill up your hard drive! It
also helps performance quite a bit in my tests to disable it. Obviously if you need this kind of backup you can't turn it off, but you'll also need some strategy to avoid running out of space.

Keep connections open. The initial problem I hit was caused by opening and closing a lot of sockets rapidly. After a few thousand, either PHP or the system throws an error. Try these URLs on the script to test for the problem on your system:
tokyotest.php?interface=http&numberofkeys=40000&valuelength=8
tokyotest.php?interface=raw&numberofkeys=40000&valuelength=8&closeeveryop=true

Don't use the HTTP interface. HTTP is great for quick hacks, but it is tough to avoid opening and closing a lot of connections in PHP. I notice one of the Perl example scripts uses stay-alive to keep a persistent connection, doing the same in PHP might help a lot. HTTP is a lot more verbose than the raw sockets though, so if you're going to that much trouble it's probably simpler to use Net_TokyoTyrant instead.

Net_TokyoTyrant will truncate long values. An issue I haven't solved yet is that after a certain point, the current raw socket interface code will fail to store the end of long values. For example, if you store a 20,000 character string you'll only get back about 16,000 when you retrieve it. This isn't a Cabinet problem, since doing the same operation through HTTP works as expected. Here's a way to reproduce the problem:
tokyotest.php?interface=raw&numberofkeys=10&valuelength=100000

The same operation using the HTTP interface gives the expected results:
tokyotest.php?interface=http&numberofkeys=10&valuelength=100000

[Update- I've now tracked down what's going wrong, and have a fix for the PHP wrapper: http://petewarden.typepad.com/searchbrowser/2009/06/how-to-get-tokyo-tyrant-working-in-php.html ]

What is Mailana?

Questionmark

Photo by Charles Chan

I was putting together an email this morning introducing what I'm up to with Mailana, and I realized it would make a good blog post too. I've talked about a lot of this stuff in different articles, but never gathered it all together in one place. I'm also preparing for my talk at the Boulder/Denver NewTech meetup on March 23rd, so it's good practice for that as well.

For motivation, my unofficial tagline is "You guys should talk".
I'm driven by the last 5 years I spent at Apple, which was chock full
of smart people but had no system for connecting them to solve
problems. I want to be able to do things like find internal experts
based on email analysis and locate people with contacts at external
companies by building an opt-in company directory of skills. Since it
can be a tough sell to persuade companies to hand over their email data
to a startup, I've created http://twitter.mailana.com/
as a shop window for the technology. It's still early days but it lets
you visualize the actual patterns of conversations in Twitter in some
different ways.

Here's the technical background on Mailana: It's a system for
grabbing emails from all sorts of sources and doing server-side
analysis on the large data-sets you end up with. To grab the data I
have IMAP, Exchange, Outlook PST and Twitter import components. To
serve up the results I have a Facebook-style mini-app API with
HTML-based presentation components running through the browser, as
native Outlook tools, and in Sharepoint.

What do I need? I'm looking for progressive organizations interested in solving the sort of problems I'm tackling. I want to expand beyond my initial proof-of-concept pilots pulling data from Exchange and start tailoring the technology to address people's pressing needs.

Adding authentication to the SPIURL permanent Twitter portrait project

Contrastportrait

Photo by S~Revenge

Whenever a user changes their picture on Twitter the URL changes. This is a massive pain for applications like twitter.mailana.com that show user's image since it requires a lot of code to handle checking and updating the links. In an ideal world Twitter would offer a permanent URL for every user's portrait. That's on their roadmap, but until they update their API, Shannon Whitley's SPIURL project offers the next best thing.

You can either download the Python code and host it on your own free AppSpot account, or use Shannon's public http://purl.org/net/spiurl/ link. Josh Fraser extended the code to support large portraits and added some other useful tweaks like a content-type for browser viewing.

I've been happily using my own copy of SPIURL for the last couple of weeks, but a few days ago I started noticing broken image links again. After a bit of investigating, I found I was hitting a limit of 100 requests per hour. This never used to happen, so I assume something changed on the Twitter side. To fix this I've added authentication to the API call (along with some more error reporting). Here's the main change:


import base64

        #Enter your own account details here
        authString = "Basic " + base64.encodestring("yourusername:yourtwitterpassword")
        response = urlfetch.fetch("http://twitter.com/users/show/&quot; + _screen_name + ".xml", payload=None, method=urlfetch.GET, headers={"AUTHORIZATION" : authString}, allow_truncated=False, follow_redirects=False)

You can download the full code here. You'll need to change the authorization details to your own account, and ensure the account is white-listed. I'm still waiting for my rate limit to be bumped, so I'm not totally certain it's working, I'll update this when I am.

Why I abandoned Tokyo Tyrant

Gamera

Photo by TCM Hitchhiker

First off, I want to say how impressed I've been by the work Mikio Hirabiyashi has done on Tokyo Cabinet and Tyrant. I like it a lot and really, really wanted to use it as the data store for twitter.mailana.com. Since I decided it wouldn't work for me after trying it out, it seems worthwhile documenting my experiences. There's obviously plenty of people using it in production, so don't let me put you off trying it, I just want to leave a trail of bread-crumbs for anyone else hitting similar issues.

As I explained in my initial tutorial, getting started with Tokyo was straightforward. Once I'd got a basic Tyrant server running, I exported a 2 million row table from MySQL and wrote a PHP test-harness to load that into Tyrant. I chose an on-disk hash table for the store, since that promised an in-memory index for super-fast lookups from a key, and used the default localhost:1978 port.

For my first attempt I went with the HTTP interface. It took a bit of jiggery-pokery to persuade PHP's CURL to properly handle a custom PUT method, but I implemented it and ran the test. After a few thousand rows, it suddenly started taking almost exactly 2 seconds for each HTTP put. Looking at the timing I output to the error log, I'd see 1.998s, 2.001s, etc. This was very suspicious, since it smelt like a timeout, though the CURL library was set to timeout after 30 seconds, not 2.

I guessed this must be something wrong with CURL, so to get around it I wrote a pure socket-based version of the code, which wasn't too hard since both the requests and responses were very simple. Running it again, I started to see a PHP notice "errno=11 Resource temporarily unavailable" from fwrite after a few thousand insertions. Digging around the internet, this seemed like a pretty generic error, but it might be related to running out of socket resources. I was closing the file handle to the socket every time, and looking through the network tools on my Red Hat Fedora 8 system, I couldn't see a leak.

Since this seemed like it might be related to my use of TCP/IP, I switched Tokyo and my code over to using a Unix file-system socket instead. I still saw the same error.

To see if this was only a problem with the HTTP I pulled down the Net_TokyoTyrant PHP library that uses the raw binary socket interface and integrated that. This also allowed me to keep a single socket connection for my whole import, to avoid any resource leaking. Still no dice, I was seeing the same error.

At this point I was getting suspicious of PHP's socket handling. Both my original code and Net_TokyoTyrant were using the newer fsockopen() style stream-based sockets rather than the older socket_create() family of functions. I rewrote Net_TokyoTyrant to use the deprecated socket_* functions instead, and reran my test. Instead of hanging and printing a warning notice, it now just hung after a few thousand rows.

Looking for something else to try, I opened and closed the socket connection for every transaction, rather than keeping it open for the whole import. This actually worked! I left it running, and noticed that any of my values that were larger than 128,000 bytes (interestingly not 128*1024) were being truncated. Since only a few of my rows had BLOBs that large, I left the test running. When I came back to check on it a while later, I discovered that it had crashed. Looking into why, I found that the Tyrant log files had filled up the drive that held all my data. Every socket open and close was logged, and there were enough of them to fill gigabytes of drive space.

That was the point I decided to stop my evaluation, I'd run into more issues than I felt comfortable with. If there were just a couple I could have justified investing some debugging time and submitted patches or at least good reports. Instead I was obviously well off the beaten path, and getting Tokyo reliably working in my PHP/Fedora environment felt like it might take several more days.

It definitely wasn't wasted work though. As I'd been researching some of the issues I'd run across an intriguing note in the memcached FAQ, stating that the InnoDB engine for MySQL supported very fast primary key lookups. Since the main reason I was moving away from MySQL was painfully slow primary key fetches when using the default MyISAM engine on massive tables, this offered a solution. I kept all the other code I'd written to translate my more complex queries and analysis into a key/value store, and replaced the Tokyo get/put/out primitive functions with implementations that ran against an InnoDB table in MySQL.

This worked out great, I was able to speed up twitter.mailana.com a lot. Even better, I'm still in a good position to use Tokyo in the future if I am able to get it running reliably, I'll just need to swap out the lowest layer.

[Update – I've now switched back to Tokyo Tyrant after fixing the truncation bug in the PHP wrapper: http://petewarden.typepad.com/searchbrowser/2009/06/how-to-get-tokyo-tyrant-working-in-php.html ]

Visualize your networking with PeopleMaps

Peoplemapsscreenshot

A friend recently pointed me towards PeopleMaps, a new service that's currently in open beta. It's a tool for exploring your social network aimed at business users, focused on helping you make connections to avoid cold sales calls and get introductions. They use path-finding algorithms to navigate all the possible chains between you and your target, trying to find the shortest number of hops between, and displaying the possibilities as a simple tree.

I really like their emphasis on a painful problem for a very specific market, sales professionals. It's not easy to figure out how to get decent introductions for friends-of-friends through LinkedIn, so I can see this being a very popular service. It looks like a great tool for my "You guys should talk" crusade.

One thing I'm not sure about is where they're getting their relationship data from. They have a great set of tools to import your contacts from Gmail, LinkedIn and Outlook, but they promise to only use that information for your own private network. Companies like Spoke did something similar in the past, aggregating lots of individual contacts from their Outlook plugin to build a global graph, but PeopleMap's privacy policy appears to rule this out.

Anyway, I love what they're up to, seems very well thought-out and they've got an impressive team, I look forward to seeing more.

Reblog: Finding your customers

Supermarket

Photo by Steve Crane

As I was checking out some of my new Twitter followers, I ran across Keith William's blog. It's new and so pretty sparse, but his post on finding your customers stood out. As you may be able to tell from my blogging, getting customers has been top of my priority list recently, and it's something I'm a novice at, so I'm hungry for any advice I can get.

He lays out some practical steps to test any startup idea against the real world by trying to win customers. It reminds me of Mike Maple's Customer Development talk, but with a very concrete 9 stage program that will hone your startup ideas at the same time as you build a base of influential customers. It's obviously hard-won wisdom from his time in the trenches, and even if I don't follow the exact recipe it's got me thinking about what I can steal from it. Read the full post, but here's the steps:

1) Figure out what you think your product or technology is, who
would be your actual customer, and generally how you think your
business model would work.

2) Envision a complete start to finish incarnation of this ‘thing’
and find some general supporting evidence that there is a market. 
Don’t get married to this, it’s only a starting point.

3) Find a Trade Show, presentation, or gathering where you can find
many of the target customers.  This can also just mean going out and
talking with individuals.

4) Go there and talk to as many people as possible.  Pitch
everyone.  If you’re getting funny looks keep adjusting your pitch
until they ‘get it’ or you ‘get theirs’ instead.

5) Find at least one champion at an organization that would be a
target customer, but also has clout within your industry to sway
others. If you’ve polished your pitch quickly enough, you most likely
will find one at what gathering you’re at.

6) Keep pitching to your champion and honing your message until it
begins to generate traction both with your champion and also thorughout
the rest of your audience.

7) Start submitting abstract proposals to present at other industry
gatherings, create a corporate blog, generate supporting data to show
your position as valid and publish it as a whitepaper.

Wash, rinse, repeat.  Keep going around and around steps 3 through 7. 
This will create the market churn and you’ll find one contact feeding
off of any small bit of positive news from the other…..soon this growth
becomes somewhat organic.

9) Now, step back and look what you have:  A cogent message for your
market, a pitch that has turned into to definition of your product or
service, and real customers to reach out, touch, AND QUANTIFY for your
business planning pleasure.

Tokyo Tyrant Tutorial

Godzillavsbuddha

Photo by Olivander

As the number of messages on twitter.mailana.com has approached 200 million, the response speed has been dropping. In fact, as @travisspencer observed, it's slow as sin. The back end is all MySQL, and I've spent a lot of time denormalizing my data, indexing and trying other optimizations to speed it up, but I've reached the point where just doing a simple SELECT * WHERE primarykey=something; can take up to a minute. I'm sure a real guru could wave a dead chicken over my table structure whilst murmering incantations and get the performance I need, but I've reached the point where it's easier to move to a simpler system with less moving parts.

That's where Tokyo Cabinet comes in. It's a much more primitive system than MySQL, just a key->value store rather than a relational database. That means I'll have to implement things like sorting and grouping in the client code, but my key requirement is that it fetch modest numbers of rows from massive datasets very quickly. The performance numbers quoted are very impressive, with millions of fetches and inserts possible a second, so I'm evaluating it now.

If you're interested in trying it yourself, first read this article on the background behind Tokyo Cabinet and Tyrant, and then go through the 30 slide introduction by the author. The main documentation is at http://tokyocabinet.sourceforge.net/spex-en.html and http://tokyocabinet.sourceforge.net/tyrantdoc/, and it's pretty good, though at first the number of functions and their naming is overwhelming (eg tcadbputkeep2!). They do talk you through the installation process, but it's a bit scattered, so here's the steps I went through to get it running on my Red Hat Fedora 8 system:

To set up the underlying Tokyo Cabinet database engine:

curl "http://tokyocabinet.sourceforge.net/tokyocabinet-1.4.9.tar.gz&quot; > tokyocabinet-1.4.9.tar.gz
gunzip tokyocabinet-1.4.9.tar.gz
tar -xf tokyocabinet-1.4.9.tar
cd tokyocabinet-1.4.9
yum install zlib-devel
yum install bzip2-devel
./configure
make
make check
(awesomely geeky text scrolls past for 10 mins)
make install (as root)

To install the Tokyo Tyrant server that provides remote access to the database:

curl "http://tokyocabinet.sourceforge.net/tyrantpkg/tokyotyrant-1.1.16.tar.gz&quot; > tokyotyrant-1.1.16.tar.gz
gunzip tokyotyrant-1.1.16.tar.gz
tar -xf tokyotyrant-1.1.16.tar
cd tokyotyrant-1.1.16
./configure
make
make install
ttserver
(to check it installed ok)

emacs /etc/rc.local
  Add '/usr/local/sbin/ttservctl start' to the end of the file

ttservctl start

You should now have a server running. To test it out, push a value into the store and then retrieve it using the HTTP interface:

curl -X PUT -d 'first' 'http://localhost:1978/one&#039;
curl 'http://localhost:1978/one&#039;