Mining information from text using regular expressions

Mininghat

I had so much fun playing with the regular expressions for this one, that I ended up building a fairly elaborate testbed and missed my usual morning-cup-of-tea posting deadline. The page demonstrates using REs to pull out phone numbers, emails, urls, dates, times and prices from unstructured text, and uses a JS library I’m making freely available to search and replace within an HTML document. There’s some sample text to show it off, and you can put it through its paces by inputting your own strings to really test it.

If you want to see the power of this approach, try grabbing some of your own emails and pasting them into the custom box (it’s all client-side so I’ll never see any of them). You’ll be surprised at how much these expressions will pick up. Imagine how much useful information you could get from a whole company’s mailstore.

Here’s the exact REs I’m using, you may need to escape the back-slashes depending on your language:

Phone numbers (10 digit US format, with any separators)
([0-9]{3})[^0-9]*([0-9]{3})[^0-9]*([0-9]{4})

Email addresses
[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}

URLs with protocols
https?://[a-z0-9\./%+_\-\?=&#]+

Naked URLs (this will also pick up non-URLs like document.body)
[a-z0-9\./%+_-]+\\.[a-z]{2,4}[a-z0-9\./%+_\\-\\?=&#]*

Dollar amounts
\$\s?[0-9,]+(\.[0-9]{1,2})?(\s(thousand|m[^a-z]|mm[^a-z]|million|b[^a-z]|billion))?

Numerical times
[012]?[0-9]:[0-5][0-9]((\.|:)[0-5][0-9])?(\s?(a|p)m)?

Dates
(January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sept|October|Oct|November|Nov|December|Dec)[^0-9a-z]+([0-9]{1,2})(st|nd|rd|th)?[^0-9a-z]+((19|20)?[0-9]{2})

Numerical dates

([0-9]{1,2})[/-]([0-9]{1,2})[/-]((19|20)?[0-9]{1,2})

These handle the common cases I’m interested in at the moment. There’s no end to how elaborate you could make them to handle all the possible different formats, but these cover a lot of ground. Now I have to resist the temptation to build these into a Firefox extension. IE’s RE engine does seem to want to overmatch with the email expression, sometimes pulling in characters past the end, but that seems to be an implementation quirk since I don’t notice that in the other environments like Firefox, Safari or grep.

2 responses

  1. Thanks Josh, that’s a very good point. I forgot because I’m testing all of these REs with the i modifier to make them case-insensitive, but it’s better to specify them the way you do.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: