Photo by Jef Poskanzer
Last week I suddenly noticed the number of imported people on http://twitter.mailana.com/ increase dramatically. This was suspicious since it typically takes a minute or two to import a single person's messages from the Twitter API, so seeing 50,000 added in less than a day rang alarm bells. Simultaneously I got a small flood of emails from users whose profiles were showing up completely blank. Since this is usually the result of a failed import I checked the server logs and there were indeed lots of errors.
Unfortunately I was in the middle of moving house, so I had no time to investigate and fix the problem. Instead I took down the names of everyone who contacted me in my bug database, sent them notes so they'd know I was on the case, and then completely turned off all imports. This meant at least no further profiles would be corrupted before I could solve the problem.
Yesterday we'd finally completed the drive and got the internet running at our new place, so I could sit down and figure out what was going wrong. The immediate cause was this change to the Twitter API on April 9th. Previously I'd been able to use POST for all my calls, but now some would only work with GET. This limitation was always in the documentation but never enforced, so I hadn't spotted it.
That wasn't the true problem though – that sort of changes happens all the time and it shouldn't cause corrupted data and empty profiles. In the spirit of Eric's Five Whys, here's a root-cause analysis:
1. Why did people's profiles show up blank? The import failed and output bogus data when Twitter's API changed.
2. Why did the import fail and output bogus data? The errors weren't detected and handled correctly.
3. Why weren't the errors handled? The import code wasn't tested thoroughly enough.
4. Why wasn't it tested? There was no easy way to run a test.
5. Why was there no easy test? I'd never expected the Twitter import to be so heavily used, it was quickly written code reused from another project.
With that in mind, I worked backwards down the list today, trying to address each layer of the problem in turn. Going in that direction is important because you want to leave the immediate cause until last, so you can verify that the deeper fixes actually do catch that problem.
5. This is a priority issue. I've been trying to juggle my work on email and Twitter simultaneously. That's given me too many top priorities, which really means I have no priorities. To fix that, I'm formally pausing my Exchange work for the next few months. Twitter has become a great platform to showcase my ideas. I still believe passionately that email is a killer application for this, but Twitter is a fantastic way to sell people on what I'm building, once I have people convinced it will be a lot easier to persuade them to invest time and trust installing my email version. This decision will let me give the Twitter code the resources it needs to shine.
4. I built a new unit test into the Twitter import script.
3. That test is now part of my routine whenever that code is changed.
2. I implemented entirely new error catching code. It now correctly halts the script whenever the API returns a fatal error, so no bad data is ever stored in the database. As a bonus, I also now catch temporary errors caused by server overload, etc, and wait 10 seconds and retry a fixed number of times. It's surprising watching the logs how many 502 errors I see!
1. Finally, I switched the API call from POST to GET, and got the import process rolling again.
That wasn't the end of it though. I still had a database with several thousand corrupted profiles. I'd implemented a manual method to force a reimport for users I knew about, but I had to switch to something automatic to handle that number of problems. Another issue was there was no easy way to identify all the affected users thanks to the way I'm storing the data.
I settled on detecting when a blank profile was loaded, displaying an error message then and forcing a full reimport. This is far from ideal, but with the reimport bumped to the top of the queue, should only take a few minutes. If your profile was previously showing up blank, please give it another try, hopefully this will fix that problem for you.
Thanks to everyone who helped me with bug reports on this one, and sorry for those caught with empty graphs for the last week. As always, please let me know about any other issues you're hitting.