More resources on mining information from plain text

Cursebubble
In my previous post, I presented some regular expressions you can use to spot dates, times, prices, email addresses and web links, along with a test page to see them in practice. REs can be pretty daunting when you’re first working with them, so I wanted to recommend a few resources that have helped me in the past.

The best overall guide on the web is regular-expressions.info, and I used some of Jan’s suggestions for email address matching. He has also written a very clever regular expressions assistant that breaks down any cryptic RE into a human-readable description. I also liked this python tutorial on REs, it’s focused on a good practical example and shows how you’d build up the expression step by step.

As I mentioned yesterday, to demonstrate the power of regular expressions on a web page I had to write my own library for handling search and replace on a web page in Javascript. This is a surprisingly tricky problem to solve. First you have to actually get the text for the web-page, which involves walking the DOM, extracting all the text nodes and then concatenating them back together. That lets you search on the full text, but if you want to change anything you have to remember which parts came from which elements. Then since only part of an element’s text may match, and the matching text may spread across several DOM elements, you have to do some awkward node splitting and reparenting to get nodes that just contain the match.

I’ve included some documentation in the library as comments, but the main entry point is the searchAndProcess() function. This takes three arguments, a regular expression to search for, a callback function you supply to create a new node to be the parent of the element that contains the matching text, and a cookie value that’s passed to the callback function so you can customize its behavior easily.

The callback function itself receives three arguments, the current document so it can create a new element, the results of the RE match, and the client-supplied cookie. The RE results are the most interesting part of this, since they’re the same format that’s returned from the JS RegExp.exec() function. They’re an array where the first entry is the full text that’s matched by the expression, but then subsequent entries contain the text that was matched by each sub-set contained with parentheses. This means I can use the second, third and fourth array entries in the phone number callback to create a number that excludes any spaces or separator characters. Here’s an example of that in practice from the test page. View the entire page’s source to see more examples of how to use it. The cookie is used to pass in the protocol to use for phone number links, usually ‘callto:’.

function makePhoneElement(currentDoc, matchResults, cookie)
{
var anchor = currentDoc.createElement("a");

anchor.href = cookie+matchResults[1]+matchResults[2]+matchResults[3];

return anchor;
}

Mining information from text using regular expressions

Mininghat

I had so much fun playing with the regular expressions for this one, that I ended up building a fairly elaborate testbed and missed my usual morning-cup-of-tea posting deadline. The page demonstrates using REs to pull out phone numbers, emails, urls, dates, times and prices from unstructured text, and uses a JS library I’m making freely available to search and replace within an HTML document. There’s some sample text to show it off, and you can put it through its paces by inputting your own strings to really test it.

If you want to see the power of this approach, try grabbing some of your own emails and pasting them into the custom box (it’s all client-side so I’ll never see any of them). You’ll be surprised at how much these expressions will pick up. Imagine how much useful information you could get from a whole company’s mailstore.

Here’s the exact REs I’m using, you may need to escape the back-slashes depending on your language:

Phone numbers (10 digit US format, with any separators)
([0-9]{3})[^0-9]*([0-9]{3})[^0-9]*([0-9]{4})

Email addresses
[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}

URLs with protocols
https?://[a-z0-9\./%+_\-\?=&#]+

Naked URLs (this will also pick up non-URLs like document.body)
[a-z0-9\./%+_-]+\\.[a-z]{2,4}[a-z0-9\./%+_\\-\\?=&#]*

Dollar amounts
\$\s?[0-9,]+(\.[0-9]{1,2})?(\s(thousand|m[^a-z]|mm[^a-z]|million|b[^a-z]|billion))?

Numerical times
[012]?[0-9]:[0-5][0-9]((\.|:)[0-5][0-9])?(\s?(a|p)m)?

Dates
(January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sept|October|Oct|November|Nov|December|Dec)[^0-9a-z]+([0-9]{1,2})(st|nd|rd|th)?[^0-9a-z]+((19|20)?[0-9]{2})

Numerical dates

([0-9]{1,2})[/-]([0-9]{1,2})[/-]((19|20)?[0-9]{1,2})

These handle the common cases I’m interested in at the moment. There’s no end to how elaborate you could make them to handle all the possible different formats, but these cover a lot of ground. Now I have to resist the temptation to build these into a Firefox extension. IE’s RE engine does seem to want to overmatch with the email expression, sometimes pulling in characters past the end, but that seems to be an implementation quirk since I don’t notice that in the other environments like Firefox, Safari or grep.

A few easy ways to spot dates, times, phone numbers and prices I

Cards

As I mentioned in my review of ManagedQ, you can do some really interesting things with regular expressions. Roger Matus talked about IBM’s Personal Email Search tool back in December, and the core of that appears to be using REs to recognise phone numbers, email addresses and URLs in the body of messages. Skype and other companies have been working pretty intensively on phone number recognition, taking things a bit further with knowledge about possible dialing codes to help them reformat the numbers in a standard way. I won’t be taking things that far, but in the next article I’ll be showing you the expressions you need to recognize simple dates, times, phone numbers, email addresses, prices and URLs from a text document. You’ll be able to pick out all of these examples using a few simple expressions:

805 277 3606, 8052773606, 805 277-3606
pete@petewarden.com
http://foo.com , foo.com, http://www.foo.com
$10, $10.99, $10 million
10:30
June 1st, 2008, 6/1/08, 6/1/2008

What I’ve learnt from being a trail crew boss

Mcleodandgloves

Last week I had a record turnout to a volunteer trail maintenance day I was organizing, with 28 people. It was exhausting but fun, and we ended up getting a lot of the Guadalasca trail fixed up. It also got me thinking about the lessons I’ve learnt over the last few years of being a crew leader. Some of them came from the classroom training I received from Frank Padilla and Kurt Loheit, but most of what I know is from watching experienced leaders. I’ve learnt under some great bosses like Frank, Kurt, Rich Pinder, Hans Keifer, Jerry Mitcham, Ron Webster, and too many others to mention here.

You’re there to make decisions. People want to know what they should do, there’s no need to feel bad about telling them what you want. They usually don’t want to know the detailed reasoning behind it, but be ready to briefly explain if they do. Be confident, but give time to any objections. Act as if you’re right, but listen like you’re wrong.

Everyone’s there to have fun. Everybody on the crew is a volunteer, there because they choose to be. You’re a leader because they consent to being lead, and you’ve essentially no power over them. You can’t dock their pay or threaten to fire them, the only tools you have are praise, persuasion and self-confidence. Luckily, pretty much by definition every volunteer is self-motivated, so those will take you a long way.

The Sandwich Principle. This is Frank’s phrase, and something that I’ve never seen anyone use as skillfully as Rich Pinder. If you’re going to correct somebody, sandwich it in-between two pieces of praise. As in ‘wow, that tread is looking great! Could you try not to kill those endangered woodpeckers with your pick-ax? You’ve sure shaped that drain nicely’. Sounds corny, but it really seems to change people’s behavior without leaving them feeling like they’re being pushed around.

Be prepared. I forgot to bring along my flags when I rode the trail to scout out the work in preparation for last week. This meant that I was reduced to verbally describing a section about a mile up the trail that I wanted a group of experienced volunteers to walk ahead and tackle. Unfortunately my description wasn’t good enough, and they hiked past it and ended up with a gruelling uphill climb. That was my responsibility, as a leader it’s impossible to organize that many people and plan at the same time. That makes it vital to have a plan both thought-out and well-marked for people to follow before the event.

Surround yourself with good people.
Even though there were 28 volunteers, many who’d never done trailwork before, I was able to cope because several of them were SMMTC or CORBA regulars with plenty of experience under their belt. I split people into 5 groups each with an experienced leader, showed them their work sites and what I was hoping to get done at each, and then was able to leave the details to them. I was still busy answering plenty of questions and giving further directions, but the supervisors took on the bulk of the individual management.

Your main job is help others work, not do the work yourself. When you’re experienced, it’s hard to stand back and watch someone else struggling without wanting to step in and take over. The trouble is, there’s almost certainly someone who needs direction who you’re neglecting while you do that, and the person you’re taking over from would learn a lot more if you train them and let them do it rather than just watching you. If you’re lucky, you’ll reach a point where everyone’s working away and you can sneak in a drain or two yourself, but your most important contribution is to get everyone else doing the right tasks safely and effectively.

Thanks again to all the volunteers who made it out to Guadalasca, we got some great work done. Liz has got some of her photos up on the Trails Council site, it really was a fantastic day!

Why ManagedQ’s in-page searching is so useful

Mqlogo

After stumbling across ManagedQ last week and giving them an unplanned launch, I wasn’t expecting a warm reception from the team. Thankfully it turned out that I already knew one of the founders, which explained why they’d appeared in my visitor logs. They were even kind enough to invite me onto their beta program!

I’m a long-time advocate of unbundling search engines and presentation, so I’m naturally pretty excited about how they overlay a deeply interactive UI on top of Google search results. There’s a lot of features I could talk about but I’ll focus on one of the most novel, the in-page searching.

In ManagedQ, search results show up as a grid of images, each showing a snapshot of the page. Unlike other thumbnail search engines, these are live HTML frames not just pre-canned images. The power of this is pretty obvious once you start trying to narrow down your search. To start with, you can just start typing a word anywhere on the page and all occurrences of that word will show up within each thumbnail.

For example, if you do a search on "Peter Thiel", and then want to narrow it to results that talk about PayPal, you type in the term and the thumbnails instantly either show you where the word is in the page:

Paypalinpage

or indicates that the term isn’t there:

Paypalmissing

As it stands, this is powerful stuff. I rely heavily on the summaries Google shows below every result to understand what’s on each page, now I can create custom summaries to find out more about a whole set of results at once. The in-page query stays active as you move through the results, so you can power-search by rapidly browsing through all the pages.

Where it gets even more interesting is when regular expressions are added to the mix. RE’s are the building blocks of most text processing languages, and offer a very flexible way of describing patterns of letters and numbers. For example you can describe some text that contains a dollar sign, followed by a number, followed by a whole word, with /\$\d* \w/

If you type that as your in-page search for Peter Thiel, you’ll get results that look like this:

Digitlarge

In detail, each thumbnail now shows every place that a dollar amount is followed by a word, which pulls out all of the fund figures that are mentioned in connection with Peter.

Digitsmall

This is very useful if you’re doing heavy research. By crafting different REs you can match all sorts of useful patterns, like C function calls with /\w*\(/ , or find a gene in a particular context. Since regular expressions just look like a cat walked across your keyboard to most of the world, the team is planning on offering shortcuts for common queries like dollar amounts.

To my mind, the big advance here is in the workflow. Traditionally you do a search and then click through to the results pages, eyeballing each one for the information you want. If the results aren’t good enough, you’ll go back and refine your query, doing a complete new search. With ManagedQ, you’ve suddenly got an interactive refinement stage that lets you poke and prod the result set and easily get a lot more information. You can instantly narrow your search by ignoring bad results that don’t contain terms you want, without throwing away all the others that could be interesting. You can get a quick feel for whether the results are worth exploring by throwing in good indicator terms that are likely to be in the ones you want. And as I mentioned at the start, you’ve suddenly got the ability to pull out your own summaries rather than relying on Google’s.

Expect to hear more from me on ManagedQ as I dig into its feature set. The concept of breaking out search presentation from the indexing engine has a lot of promise. Even this early version is a powerful demonstration of how far that approach can take you.

A lovely language visualization

80mill

In case you missed it on ReadWriteWeb, researchers at MIT and NYU have created a fascinating visual map of nouns. They’re pulling the word relationships from WordNet, a venerable data set that maps the relationships between over 150,000 words. I’ll be studying their paper to understand exactly how they grouped the nouns, since they seem to have done a great job of clustering them in a meaningful way across a 2D surface. That could be very useful for keyword similarity measurements and email grouping.

It also reminds me of a film-strip visualization that I saw a couple of years back, but which I can’t find the reference for unfortunately. A frame was taken every 5 seconds from a movie, and shrunk to a few pixels, and then the sequence of images were arranged in a grid. You could get information about the different scenes and moods in the whole movie at once, just from the color of each section. It wasn’t much but it was enough to let your brain’s visual processing machinery comprehend the structure.

In the same way, this map is a good way of presenting the whole space of noun categories in a way that’s much easier to navigate than a hierarchical tree or table. A common trick for memorizing arbitrary data like long random numbers is to associate each part with physical locations, because we evolved to be really good at remembering exactly where all the fruit (and leopards!) were in the local jungle. It’s easy to find and return to a given noun in this setup because you’re using the same skills.

The silent rise of Sharepoint

Godzilla

According to a new report, over half of companies that use an Exchange mail server also use Sharepoint. This backs up my personal experience. For example, Liz works for a fairly conservative large company but even they are heavy Sharepoint users.

This is a big technological change, but it tends to slip under a lot of people’s radar because it’s a  closed-source, me-too technology with a very traditional business model. It’s successful because it’s stable, uses a familiar UI, is easy to deploy, often comes for free with Office and overall works remarkably well.

Microsoft are providing a ready-made distribution channel for getting your technology in front of employees. They’re training massive numbers of people to create and consume user-generated content on the company’s intranet. The great thing is that they leave plenty of room for third-party products to take advantage of this. They have some Exchange/Sharepoint integration, and no doubt will be increasing that in the future, but there’s a fantastic opportunity to present all sorts of interesting mail-derived information in a place people are already looking. A good example of this would be automatically populating each employee’s homepage with links to her most frequent internal and external contacts, or adding email-driven keywords there to be found by a ‘FindaYoda’ style search.

I’m so convinced this is an important direction, I have my own Sharepoint site I’m using as a testbed, hosted with Front Page Web Services [Update- They’re now FPWeb.net, at http://www.fpweb.net/sharepoint-hosting/ ]. I’ll be posting more about the integration opportunities as I dig deeper, as well as using it when I need to collaborate.

[Update- Eric did a great Sharepoint post on Friday too, with some interesting points on the way collaboration with Sharepoint is heavily grass-roots driven at the moment, which will mean a strong drive for the IT department to catch up]

[Second update –