Scrape your call history with Selenium

Floorscrapers
Photo by WallyG

There's a lot of interesting data out on the web that's locked up in web pages, with no API access to make it machine-readable. I'm particularly interested in phone records; just like emails, IMs and tweets they form a detailed shadow of your social network. To tackle automatically grabbing my phone call history from the AT&T site I turned to Selenium, originally built as a testing tool but also well-suited to screen-scraping on sites with complex login procedures.

To get started you can install the Selenium IDE in Firefox and record the steps you'd manually take to log in and get to the screen you're interested in. Selenium turns those actions into a script you can manually edit and replay. In my case I needed to add some 'type' commands to enter the phone number and password since those weren't captured. Here's the resulting script, you should be able to run this on your own account to download your call details in a csv file once you've added your own details:

Download Attdownload

What's really handy is that you can use Selenium Remote Control to then re-run that same script from your server, using PHP or other popular languages. It's a bit of a hack because it still requires windowing capabilities so it can run within Firefox and a proxy server process to insert the needed code into external pages, but once it's running it's an incredibly flexible way to deal with constantly changing websites.

Move fast and break stuff

Breakglass
Photo by mpires

I recently talked to someone at a very innovative large web company (under Frie-NDA) who described their official engineering motto as "Move fast and break stuff". I love that philosophy because it ties in to research showing that really successful people get there by trying a lot more approaches than average folks. They fail faster, cheaper and more often than ordinary people.

The key to making that work is that the cost of the total failures must be less than the value of the cumulative successes. This is a hard problem, because the default for most organizations is "managing to avoid blame". Their implicit motto is "Reward success and inaction, punish failure", which ends up making inaction the most appealing course. "Move fast and break stuff" encourages a different mentality, "Reward success and failure, punish inaction".

So how do you get that mindset in your organization? The most important step is to de-stigmatize failure. The web company I mentioned makes it clear to their engineers that they will not be punished if they break the site, even if it costs millions of dollars in lost revenue. I didn't get to dig deeper on that topic but I'd imagine there are some serious post-mortem procedures to understand why things go wrong and build tools to prevent a recurrence, like the Five Whys.

Can you help me shape Mailana?

Sculptor

I've got some important and tricky decisions to make about the future direction of Mailana. To make those choices I need to better understand the problems that people are facing, so I've designed a short 8 question survey. If you are interested in the work I'm doing, it will help me a lot if you're able to take a few minutes to fill it out. It also gives you the chance to sign up for early previews of new features before they're publicly released. Thanks!

10 ways to kill my startup

Poisonlabel

Planning is overwhelming, it's hard to know where to begin. One solution I've picked up is 'anti-planning'; write out all the actions you'd take if you wanted to ensure failure. It's far easier to remember the background to past disasters than to understand why things succeeded. With those fresh in your mind you'll find drawing up an actual plan much simpler. It's also great to keep pinned to your notice-board, to remind yourself when you do start wandering towards one of those seductive traps.

Here's how I'd sabotage my startup, in 10 easy steps:

  1. Get distracted by every shiny new idea and forget what my big goal is
  2. Leave my product to sell itself; build it and they will come, right?
  3. Have no idea who my customers are
  4. When I describe what I'm building, focus on the technology
  5. Worry about my grand strategy, not the logistics of executing
  6. Spend more time meeting with investors than customers
  7. Build features customers don't want
  8. Focus on minor bug fixes
  9. Ignore people who want to help
  10. Rely on my intuition to tell what's working, not dull metrics

An alternative Gmail API

Opendoor
Photo by Funky64

Gabor Cselle, formerly a Gmail engineer and now a founder of the YCombinator startup Remail, has been doing some really interesting work in the email field recently. Their main Remail product takes the normal approach of asking for your Gmail username and password and then fetching all your messages through IMAP. As far as I knew this was the only way of accessing your inbox, but it is horrible for security since it requires users to hand over their Google passwords to a third-party website.

That meant I was intrigued to see that one of their experimental projects using OAuth to access user's inboxes. This is a massive improvement, since the third-party never sees the original password, but I didn't know that any of the mail APIs supported this. Trying to figure out how he did it I discovered it's possible to grab an RSS feed of your messages. Here's a few command-line examples you can try for yourself, replacing username and password with your Gmail credentials:

curl "https://username:password@mail.google.com/mail/feed/atom/unread#all"
Shows unread emails from all your folders

curl "https://
username:password@mail.google.com/mail/feed/atom/inbox"
Shows unread emails in your inbox

curl "https://username:password@mail.google.com/mail/feed/atom/spam"
Shows all your unread spam emails

These all use basic HTTP authentication but web applications can call the same URLs after authenticating with OAuth, giving users a much more secure experience.

There are some pretty serious limitations though. These only let you see unread emails, and is limited to 20 messages at most. That rules out applications that need a lot of email to analyze, but I'm sure there's some other interesting tools that could be built within the restrictions. I'd be curious to know if any other developers are using this and if there's any ways around the limitations. In the meantime I'll keep debugging my IMAP code!

How to use Yahoo’s Placemaker API to extract places from documents

Oldmap

Today I was lucky enough to hear Greg Cohn walk us through all the goodies Yahoo offers developers. I'm a big fan and heavy user of their Geoplanet geocoding API, so I was stoked to hear they'd just launched a service to recognize placenames in arbitrary HTML and XML documents. Why is this so interesting? Look at what Just Landed have done by searching for the words "Just landed in" in Twitter messages and then geocoding and visualizing the placenames. Placemaker makes it a lot simpler to build tools like this with anything that can be expressed as XML or HTML. That covers web pages, REST APIs like Twitters and even RSS feeds, so you can see why I'm excited!

I've put together a simple example that shows off how to use it as a bash script, tested on OS X. You can download it as geturlplaces.zip here, or I've included the source below. To use it, pass a web page address as the first argument, eg ./geturlplaces http://news.bbc.co.uk/

For production code you'll want a real XML parser rather than the regexs used below.

#!/bin/bash

# enter your Yahoo geo app id here – to obtain one go to http://developer.yahoo.com/wsregapp/index.php and register
# (interestingly as of May 20th 2009 it works with a bogus id!)
APPID=XXXXX

if [ $# -ne 1 ]
then
  echo "Extract a list of all the recognized place names from a web page using Yahoo's Placemaker API"
  echo "Usage: `basename $0` <web page url>"
  exit 65
fi

curl –silent -d "documentURL=$1&documentType=text/html&outputType=xml&appid=$APPID" "http://wherein.yahooapis.com/v1/document&quot; | grep '<text><\!\[CDATA\[' | sed 's/<text><\!\[CDATA\[//; \
s/\]\]><\/text>//' | sort | uniq

Privacy’s vanishing; how screwed are we?

Veil
Photo by Matanya

The whole theory behind Mailana is that people's attitudes to privacy are changing; there's a younger generation willing to open up private information as long as they get something useful in return and retain control. I've written about this before, but a recent post by Marc Hedlund brought some of my thoughts into focus.

He's a self-confessed "privacy freak" but concedes that he's on the losing side of the battle. Selfishly speaking that's a great validation of the bet I'm making on my business, but what's interesting is his motivation. He says that people are blase about privacy online because they've never been stalked or the victim of identity theft. Once you go through that hell, like he has, you realize how useful all those old-fashioned notions really are.

That makes a lot of sense to me, those are black swan events; statistically speaking pretty unlikely to happen to you but devastatingly bad when they do. What's worse is that easy-going attitudes towards privacy create an environment where criminals will thrive, actually making it more likely you'll be attacked in the future. By handing over personal information and even passwords we're all picking up pennies in front of a steam-roller right now.

I'm still a fan of people's new freedom to trade some privacy for something they want more, but I'm acutely aware that people are care-free about that bargain because they've never been stung. A lot of people are going to get hurt before we reach a new equilibrium, with widely understood ground rules for what's acceptable and safe.

Wild and wacky ways to access email

Catmail
Photo by Stephanie Booth

When I designed the Mailana architecture I built my data pipeline around an XML format capturing the message information I need. That meant I could support a wide variety of sources by just writing a single import component for each that translated the native format into my XML. That's worked out really well, letting me pull in data directly from Exchange servers, Outlook PST files, Gmail and other IMAP services, and of course from Twitter.

I've been having some fascinating chats with Pete Sheinbaum, and one thing he's been enthusiastic about is tapping into the mass market by grabbing communications data that isn't easily accessible. In practice that means screen-scraping and other unconventional techniques, all of which are immensely appealing to my subversive geeky streak (see my old GoogleHotKeys project) and would be easy to integrate into Mailana as an import component. Here's some of my favorite approaches to grabbing email:

Yahoo IMAP Spoofing

Normally you can only get IMAP or POP access to your Yahoo inbox if you upgrade to a premium account. Last year they introduced their Zimbra desktop client which works even with free accounts, and it wasn't long before some enterprising coders discovered it was using a slightly modified version of IMAP. To programmatically access all Yahoo email accounts all you need to do is send one non-standard command!

Outlook Web Access Screen Scraping

The Substandard Evil Genius has a nice little snippet for logging into OWA and grabbing the HTML for an email inbox. Parsing that would give you the headers for a page of emails, and then you could grab the links to download each message's content. It's definitely tougher than using a real API, but with some of time and care it's very feasible to pull down everything from an Outlook account.

TrueSwitch's Uber Screen Scraper

I hadn't run across TrueSwitch until recently, but they're a fascinating company. Their purpose in life is to let people transfer everything from old to new email accounts, including all messages and contacts. What rocks is that they use screen scraping to support all the main email services, even those like Hotmail that only have a webmail interface. Amusingly they're also used by all the main email providers to make it easy for users to switch to them, even those like Yahoo and MSN that deliberately don't offer an API, presumably to make it harder for users switch away!

They don't offer an API, but what it does demonstrate is that it's possible to access almost everyone's email by screen scraping if you're willing to invest the time and effort.

Should you ignore the data?

Wrongway
Photo by EranB

I was lucky enough to hear Scott Petry, Brad Feld and Ryan McIntyre telling the story of Postini, the company Scott founded, and Brad and Ryan funded. One phrase from Scott really resonated:

Data is necessary, but not sufficient, to make good decisions

I love and live by metrics, and I'm a big fan of the Customer Development philosophy, but I think there's dangers in taking it too far, and Scott crystallized the problem. Customer Development rocks because it forces startups to gather data, something most would otherwise neglect. It sucks if it prevents you from following paths that cause your metrics to dip in the short-term in return for a long-term win.

One fascinating example of this was profitability. Scott explained how reaching profitability early became a curse, because it became immensely important every quarter after that to avoid dipping into a loss, even when spending more money would be a long-term win. Irrational primates that we are, humans get a lot more worried by a small drop that causes the company to go from profit to loss than they do by the same drop when you're already making a loss, so the statistic becames their master not their servant.

I asked the guys to talk more about the metrics they used, and they talked about their in-depth tools for measuring customer satisfaction which helped them to an astonishing 96% customer renewal rate! But interestingly Brad and Scott both stressed the need to sometimes ignore or go against the evidence. It all reminded me of Tom Evslin's motto "nothing great has ever been accomplished without irrational exuberance" – the trick is knowing when to take that leap of faith. To do that, you have to know that you're taking a leap in the dark, which requires knowing the evidence in the first place and making a conscious decision to override it.

You should never ignore the data, but sometimes you have to listen and make a deliberate choice to go against it.