How to extract and categorize email addresses

Headfiling

Photo by SlightlyLessRandom

It’s possible to extract some interesting information from someone’s email address, such as which organization they represent, what type of organization it is, and whether it’s a work or personal account. This is very useful if you want to do automatic contact location in a Spoke-like way, eg who do I know at company X, and for the statistical analysis of large email stores in my own Mailana.

The key is the 80/20 rule. 80% of emails come from 20% of organizations. That makes it feasible to create a white-list that covers the most common US companies, colleges and ISPs, noting their type and giving the organization’s full name. With Liz’s help, I’ve put together an initial list of 2200. Here’s a demonstration of it in practice, or you can enter some addresses into the box below:

You can also download the source and list at http://mailana.com/labs/addresscategorizer.zip

It’s definitely not infallible, but it’s good enough to be useful for my purposes. The more organizations get added, the more accurate it gets, so to add your own edit the domaininformation.txt file. There’s a line for each organization, in this format:

organization domain|display name|type

Let me know if you do generate a larger list you’re willing to share, and I’ll update the example. Thanks to Christine DeMello for compiling her directory of colleges.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: