More resources on mining information from plain text

Cursebubble
In my previous post, I presented some regular expressions you can use to spot dates, times, prices, email addresses and web links, along with a test page to see them in practice. REs can be pretty daunting when you’re first working with them, so I wanted to recommend a few resources that have helped me in the past.

The best overall guide on the web is regular-expressions.info, and I used some of Jan’s suggestions for email address matching. He has also written a very clever regular expressions assistant that breaks down any cryptic RE into a human-readable description. I also liked this python tutorial on REs, it’s focused on a good practical example and shows how you’d build up the expression step by step.

As I mentioned yesterday, to demonstrate the power of regular expressions on a web page I had to write my own library for handling search and replace on a web page in Javascript. This is a surprisingly tricky problem to solve. First you have to actually get the text for the web-page, which involves walking the DOM, extracting all the text nodes and then concatenating them back together. That lets you search on the full text, but if you want to change anything you have to remember which parts came from which elements. Then since only part of an element’s text may match, and the matching text may spread across several DOM elements, you have to do some awkward node splitting and reparenting to get nodes that just contain the match.

I’ve included some documentation in the library as comments, but the main entry point is the searchAndProcess() function. This takes three arguments, a regular expression to search for, a callback function you supply to create a new node to be the parent of the element that contains the matching text, and a cookie value that’s passed to the callback function so you can customize its behavior easily.

The callback function itself receives three arguments, the current document so it can create a new element, the results of the RE match, and the client-supplied cookie. The RE results are the most interesting part of this, since they’re the same format that’s returned from the JS RegExp.exec() function. They’re an array where the first entry is the full text that’s matched by the expression, but then subsequent entries contain the text that was matched by each sub-set contained with parentheses. This means I can use the second, third and fourth array entries in the phone number callback to create a number that excludes any spaces or separator characters. Here’s an example of that in practice from the test page. View the entire page’s source to see more examples of how to use it. The cookie is used to pass in the protocol to use for phone number links, usually ‘callto:’.

function makePhoneElement(currentDoc, matchResults, cookie)
{
var anchor = currentDoc.createElement("a");

anchor.href = cookie+matchResults[1]+matchResults[2]+matchResults[3];

return anchor;
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: