Mining information from text using regular expressions

Mininghat

I had so much fun playing with the regular expressions for this one, that I ended up building a fairly elaborate testbed and missed my usual morning-cup-of-tea posting deadline. The page demonstrates using REs to pull out phone numbers, emails, urls, dates, times and prices from unstructured text, and uses a JS library I’m making freely available to search and replace within an HTML document. There’s some sample text to show it off, and you can put it through its paces by inputting your own strings to really test it.

If you want to see the power of this approach, try grabbing some of your own emails and pasting them into the custom box (it’s all client-side so I’ll never see any of them). You’ll be surprised at how much these expressions will pick up. Imagine how much useful information you could get from a whole company’s mailstore.

Here’s the exact REs I’m using, you may need to escape the back-slashes depending on your language:

Phone numbers (10 digit US format, with any separators)
([0-9]{3})[^0-9]*([0-9]{3})[^0-9]*([0-9]{4})

Email addresses
[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}

URLs with protocols
https?://[a-z0-9\./%+_\-\?=&#]+

Naked URLs (this will also pick up non-URLs like document.body)
[a-z0-9\./%+_-]+\\.[a-z]{2,4}[a-z0-9\./%+_\\-\\?=&#]*

Dollar amounts
\$\s?[0-9,]+(\.[0-9]{1,2})?(\s(thousand|m[^a-z]|mm[^a-z]|million|b[^a-z]|billion))?

Numerical times
[012]?[0-9]:[0-5][0-9]((\.|:)[0-5][0-9])?(\s?(a|p)m)?

Dates
(January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sept|October|Oct|November|Nov|December|Dec)[^0-9a-z]+([0-9]{1,2})(st|nd|rd|th)?[^0-9a-z]+((19|20)?[0-9]{2})

Numerical dates
([0-9]{1,2})[/-]([0-9]{1,2})[/-]((19|20)?[0-9]{1,2})

These handle the common cases I’m interested in at the moment. There’s no end to how elaborate you could make them to handle all the possible different formats, but these cover a lot of ground. Now I have to resist the temptation to build these into a Firefox extension. IE’s RE engine does seem to want to overmatch with the email expression, sometimes pulling in characters past the end, but that seems to be an implementation quirk since I don’t notice that in the other environments like Firefox, Safari or grep.

2 responses

Josh Fraser says:

February 2, 2008 at 1:55 pm

Pete,
Great post. I love what you’re doing.
I took your email regexp and modified it to make it case-insensitive:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,4}

Pete Warden says:

February 2, 2008 at 2:21 pm

Thanks Josh, that’s a very good point. I forgot because I’m testing all of these REs with the i modifier to make them case-insensitive, but it’s better to specify them the way you do.

	Ideal Dataset Size f… on How many images do you need to…
	How to set up Raspbe… on Why has the Internet of Things…
	Thomas on Launching Moonshine Micro
	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Mining information from text using regular expressions

2 responses

Leave a comment Cancel reply

Share this:

Related

2 responses

Leave a comment Cancel reply