More resources on mining information from plain text

Cursebubble
In my previous post, I presented some regular expressions you can use to spot dates, times, prices, email addresses and web links, along with a test page to see them in practice. REs can be pretty daunting when you’re first working with them, so I wanted to recommend a few resources that have helped me in the past.

The best overall guide on the web is regular-expressions.info, and I used some of Jan’s suggestions for email address matching. He has also written a very clever regular expressions assistant that breaks down any cryptic RE into a human-readable description. I also liked this python tutorial on REs, it’s focused on a good practical example and shows how you’d build up the expression step by step.

As I mentioned yesterday, to demonstrate the power of regular expressions on a web page I had to write my own library for handling search and replace on a web page in Javascript. This is a surprisingly tricky problem to solve. First you have to actually get the text for the web-page, which involves walking the DOM, extracting all the text nodes and then concatenating them back together. That lets you search on the full text, but if you want to change anything you have to remember which parts came from which elements. Then since only part of an element’s text may match, and the matching text may spread across several DOM elements, you have to do some awkward node splitting and reparenting to get nodes that just contain the match.

I’ve included some documentation in the library as comments, but the main entry point is the searchAndProcess() function. This takes three arguments, a regular expression to search for, a callback function you supply to create a new node to be the parent of the element that contains the matching text, and a cookie value that’s passed to the callback function so you can customize its behavior easily.

The callback function itself receives three arguments, the current document so it can create a new element, the results of the RE match, and the client-supplied cookie. The RE results are the most interesting part of this, since they’re the same format that’s returned from the JS RegExp.exec() function. They’re an array where the first entry is the full text that’s matched by the expression, but then subsequent entries contain the text that was matched by each sub-set contained with parentheses. This means I can use the second, third and fourth array entries in the phone number callback to create a number that excludes any spaces or separator characters. Here’s an example of that in practice from the test page. View the entire page’s source to see more examples of how to use it. The cookie is used to pass in the protocol to use for phone number links, usually ‘callto:’.

function makePhoneElement(currentDoc, matchResults, cookie)
{
var anchor = currentDoc.createElement("a");

anchor.href = cookie+matchResults[1]+matchResults[2]+matchResults[3];

return anchor;
}

Mining information from text using regular expressions

Mininghat

I had so much fun playing with the regular expressions for this one, that I ended up building a fairly elaborate testbed and missed my usual morning-cup-of-tea posting deadline. The page demonstrates using REs to pull out phone numbers, emails, urls, dates, times and prices from unstructured text, and uses a JS library I’m making freely available to search and replace within an HTML document. There’s some sample text to show it off, and you can put it through its paces by inputting your own strings to really test it.

If you want to see the power of this approach, try grabbing some of your own emails and pasting them into the custom box (it’s all client-side so I’ll never see any of them). You’ll be surprised at how much these expressions will pick up. Imagine how much useful information you could get from a whole company’s mailstore.

Here’s the exact REs I’m using, you may need to escape the back-slashes depending on your language:

Phone numbers (10 digit US format, with any separators)
([0-9]{3})[^0-9]*([0-9]{3})[^0-9]*([0-9]{4})

Email addresses
[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}

URLs with protocols
https?://[a-z0-9\./%+_\-\?=&#]+

Naked URLs (this will also pick up non-URLs like document.body)
[a-z0-9\./%+_-]+\\.[a-z]{2,4}[a-z0-9\./%+_\\-\\?=&#]*

Dollar amounts
\$\s?[0-9,]+(\.[0-9]{1,2})?(\s(thousand|m[^a-z]|mm[^a-z]|million|b[^a-z]|billion))?

Numerical times
[012]?[0-9]:[0-5][0-9]((\.|:)[0-5][0-9])?(\s?(a|p)m)?

Dates
(January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sept|October|Oct|November|Nov|December|Dec)[^0-9a-z]+([0-9]{1,2})(st|nd|rd|th)?[^0-9a-z]+((19|20)?[0-9]{2})

Numerical dates

([0-9]{1,2})[/-]([0-9]{1,2})[/-]((19|20)?[0-9]{1,2})

These handle the common cases I’m interested in at the moment. There’s no end to how elaborate you could make them to handle all the possible different formats, but these cover a lot of ground. Now I have to resist the temptation to build these into a Firefox extension. IE’s RE engine does seem to want to overmatch with the email expression, sometimes pulling in characters past the end, but that seems to be an implementation quirk since I don’t notice that in the other environments like Firefox, Safari or grep.

A few easy ways to spot dates, times, phone numbers and prices I

Cards

As I mentioned in my review of ManagedQ, you can do some really interesting things with regular expressions. Roger Matus talked about IBM’s Personal Email Search tool back in December, and the core of that appears to be using REs to recognise phone numbers, email addresses and URLs in the body of messages. Skype and other companies have been working pretty intensively on phone number recognition, taking things a bit further with knowledge about possible dialing codes to help them reformat the numbers in a standard way. I won’t be taking things that far, but in the next article I’ll be showing you the expressions you need to recognize simple dates, times, phone numbers, email addresses, prices and URLs from a text document. You’ll be able to pick out all of these examples using a few simple expressions:

805 277 3606, 8052773606, 805 277-3606
pete@petewarden.com
http://foo.com , foo.com, http://www.foo.com
$10, $10.99, $10 million
10:30
June 1st, 2008, 6/1/08, 6/1/2008

What I’ve learnt from being a trail crew boss

Mcleodandgloves

Last week I had a record turnout to a volunteer trail maintenance day I was organizing, with 28 people. It was exhausting but fun, and we ended up getting a lot of the Guadalasca trail fixed up. It also got me thinking about the lessons I’ve learnt over the last few years of being a crew leader. Some of them came from the classroom training I received from Frank Padilla and Kurt Loheit, but most of what I know is from watching experienced leaders. I’ve learnt under some great bosses like Frank, Kurt, Rich Pinder, Hans Keifer, Jerry Mitcham, Ron Webster, and too many others to mention here.

You’re there to make decisions. People want to know what they should do, there’s no need to feel bad about telling them what you want. They usually don’t want to know the detailed reasoning behind it, but be ready to briefly explain if they do. Be confident, but give time to any objections. Act as if you’re right, but listen like you’re wrong.

Everyone’s there to have fun. Everybody on the crew is a volunteer, there because they choose to be. You’re a leader because they consent to being lead, and you’ve essentially no power over them. You can’t dock their pay or threaten to fire them, the only tools you have are praise, persuasion and self-confidence. Luckily, pretty much by definition every volunteer is self-motivated, so those will take you a long way.

The Sandwich Principle. This is Frank’s phrase, and something that I’ve never seen anyone use as skillfully as Rich Pinder. If you’re going to correct somebody, sandwich it in-between two pieces of praise. As in ‘wow, that tread is looking great! Could you try not to kill those endangered woodpeckers with your pick-ax? You’ve sure shaped that drain nicely’. Sounds corny, but it really seems to change people’s behavior without leaving them feeling like they’re being pushed around.

Be prepared. I forgot to bring along my flags when I rode the trail to scout out the work in preparation for last week. This meant that I was reduced to verbally describing a section about a mile up the trail that I wanted a group of experienced volunteers to walk ahead and tackle. Unfortunately my description wasn’t good enough, and they hiked past it and ended up with a gruelling uphill climb. That was my responsibility, as a leader it’s impossible to organize that many people and plan at the same time. That makes it vital to have a plan both thought-out and well-marked for people to follow before the event.

Surround yourself with good people.
Even though there were 28 volunteers, many who’d never done trailwork before, I was able to cope because several of them were SMMTC or CORBA regulars with plenty of experience under their belt. I split people into 5 groups each with an experienced leader, showed them their work sites and what I was hoping to get done at each, and then was able to leave the details to them. I was still busy answering plenty of questions and giving further directions, but the supervisors took on the bulk of the individual management.

Your main job is help others work, not do the work yourself. When you’re experienced, it’s hard to stand back and watch someone else struggling without wanting to step in and take over. The trouble is, there’s almost certainly someone who needs direction who you’re neglecting while you do that, and the person you’re taking over from would learn a lot more if you train them and let them do it rather than just watching you. If you’re lucky, you’ll reach a point where everyone’s working away and you can sneak in a drain or two yourself, but your most important contribution is to get everyone else doing the right tasks safely and effectively.

Thanks again to all the volunteers who made it out to Guadalasca, we got some great work done. Liz has got some of her photos up on the Trails Council site, it really was a fantastic day!

Why ManagedQ’s in-page searching is so useful

Mqlogo

After stumbling across ManagedQ last week and giving them an unplanned launch, I wasn’t expecting a warm reception from the team. Thankfully it turned out that I already knew one of the founders, which explained why they’d appeared in my visitor logs. They were even kind enough to invite me onto their beta program!

I’m a long-time advocate of unbundling search engines and presentation, so I’m naturally pretty excited about how they overlay a deeply interactive UI on top of Google search results. There’s a lot of features I could talk about but I’ll focus on one of the most novel, the in-page searching.

In ManagedQ, search results show up as a grid of images, each showing a snapshot of the page. Unlike other thumbnail search engines, these are live HTML frames not just pre-canned images. The power of this is pretty obvious once you start trying to narrow down your search. To start with, you can just start typing a word anywhere on the page and all occurrences of that word will show up within each thumbnail.

For example, if you do a search on "Peter Thiel", and then want to narrow it to results that talk about PayPal, you type in the term and the thumbnails instantly either show you where the word is in the page:

Paypalinpage

or indicates that the term isn’t there:

Paypalmissing

As it stands, this is powerful stuff. I rely heavily on the summaries Google shows below every result to understand what’s on each page, now I can create custom summaries to find out more about a whole set of results at once. The in-page query stays active as you move through the results, so you can power-search by rapidly browsing through all the pages.

Where it gets even more interesting is when regular expressions are added to the mix. RE’s are the building blocks of most text processing languages, and offer a very flexible way of describing patterns of letters and numbers. For example you can describe some text that contains a dollar sign, followed by a number, followed by a whole word, with /\$\d* \w/

If you type that as your in-page search for Peter Thiel, you’ll get results that look like this:

Digitlarge

In detail, each thumbnail now shows every place that a dollar amount is followed by a word, which pulls out all of the fund figures that are mentioned in connection with Peter.

Digitsmall

This is very useful if you’re doing heavy research. By crafting different REs you can match all sorts of useful patterns, like C function calls with /\w*\(/ , or find a gene in a particular context. Since regular expressions just look like a cat walked across your keyboard to most of the world, the team is planning on offering shortcuts for common queries like dollar amounts.

To my mind, the big advance here is in the workflow. Traditionally you do a search and then click through to the results pages, eyeballing each one for the information you want. If the results aren’t good enough, you’ll go back and refine your query, doing a complete new search. With ManagedQ, you’ve suddenly got an interactive refinement stage that lets you poke and prod the result set and easily get a lot more information. You can instantly narrow your search by ignoring bad results that don’t contain terms you want, without throwing away all the others that could be interesting. You can get a quick feel for whether the results are worth exploring by throwing in good indicator terms that are likely to be in the ones you want. And as I mentioned at the start, you’ve suddenly got the ability to pull out your own summaries rather than relying on Google’s.

Expect to hear more from me on ManagedQ as I dig into its feature set. The concept of breaking out search presentation from the indexing engine has a lot of promise. Even this early version is a powerful demonstration of how far that approach can take you.

A lovely language visualization

80mill

In case you missed it on ReadWriteWeb, researchers at MIT and NYU have created a fascinating visual map of nouns. They’re pulling the word relationships from WordNet, a venerable data set that maps the relationships between over 150,000 words. I’ll be studying their paper to understand exactly how they grouped the nouns, since they seem to have done a great job of clustering them in a meaningful way across a 2D surface. That could be very useful for keyword similarity measurements and email grouping.

It also reminds me of a film-strip visualization that I saw a couple of years back, but which I can’t find the reference for unfortunately. A frame was taken every 5 seconds from a movie, and shrunk to a few pixels, and then the sequence of images were arranged in a grid. You could get information about the different scenes and moods in the whole movie at once, just from the color of each section. It wasn’t much but it was enough to let your brain’s visual processing machinery comprehend the structure.

In the same way, this map is a good way of presenting the whole space of noun categories in a way that’s much easier to navigate than a hierarchical tree or table. A common trick for memorizing arbitrary data like long random numbers is to associate each part with physical locations, because we evolved to be really good at remembering exactly where all the fruit (and leopards!) were in the local jungle. It’s easy to find and return to a given noun in this setup because you’re using the same skills.

The silent rise of Sharepoint

Godzilla

According to a new report, over half of companies that use an Exchange mail server also use Sharepoint. This backs up my personal experience. For example, Liz works for a fairly conservative large company but even they are heavy Sharepoint users.

This is a big technological change, but it tends to slip under a lot of people’s radar because it’s a  closed-source, me-too technology with a very traditional business model. It’s successful because it’s stable, uses a familiar UI, is easy to deploy, often comes for free with Office and overall works remarkably well.

Microsoft are providing a ready-made distribution channel for getting your technology in front of employees. They’re training massive numbers of people to create and consume user-generated content on the company’s intranet. The great thing is that they leave plenty of room for third-party products to take advantage of this. They have some Exchange/Sharepoint integration, and no doubt will be increasing that in the future, but there’s a fantastic opportunity to present all sorts of interesting mail-derived information in a place people are already looking. A good example of this would be automatically populating each employee’s homepage with links to her most frequent internal and external contacts, or adding email-driven keywords there to be found by a ‘FindaYoda’ style search.

I’m so convinced this is an important direction, I have my own Sharepoint site I’m using as a testbed, hosted with Front Page Web Services [Update- They’re now FPWeb.net, at http://www.fpweb.net/sharepoint-hosting/ ]. I’ll be posting more about the integration opportunities as I dig deeper, as well as using it when I need to collaborate.

[Update- Eric did a great Sharepoint post on Friday too, with some interesting points on the way collaboration with Sharepoint is heavily grass-roots driven at the moment, which will mean a strong drive for the IT department to catch up]

[Second update –

Two ways you can easily find interesting phrases from an email

D20
Maybe it was my weekly D&D game last night, but probability is on my mind. One thing I’ve learnt from working in games is that accuracy is overrated in AI. Most problems in that domain have no perfect solution. The trick is to find a technique that’s right often enough to be useful, and then make it part of a workflow that makes coping with the incorrect guesses painless for the user.

A lot of Amazon’s algorithms work like this. They recommend other books based on rough statistical measures which bring up mostly uninteresting items, but it’s right often enough to justify me spending a few seconds looking at what they found. The same goes for their statistically improbable phrases. They’re odd and random most of the time but usually one or two of them do give me an insight into the book’s contents.

This is interesting for email because when I’m searching through a lot of messages I need a quick way to understand something about what they contain without reading the whole text. One of the key features of Google’s search results is the summary they extract surrounding the keywords for each hit. This gives you a pretty good idea of what the page is actually discussing. In a similar way I want to present some key phrases from an email that very quickly give you a sense of what it’s about.

The main approach I’m using is vanilla SIPs, but there’s a couple of other interesting heuristics (sounds so much more technical than ‘ways of guessing’). The first is looking for capitalized phrases within sentences. These are usually proper nouns, so you’ll get a rough idea of what people or places are discussed in a document. The second is to find sentences that end with a question mark, so you can see what questions are asked in an email.

These are fun because they’re both reliant on easily-parsed quirks of the language, rather than deep semantic processing. This means they’re quick and easy to implement. It also means that they’re not very portable to other languages, German capitalizes all nouns for example, but one problem at a time!

How to use corporate data to identify experts

Yoda

Nick over at the Disruptor Monkey blog talks about how their FindaYoda feature has proved a surprise hit. This is a way of seeing who else has a lot of material with a keyword you’re looking for, and its success backs up one of the hunches that’s driving my work. I know from my own experience of working in a large tech company that there’s an immense amount of wheel-reinventing going on just because it’s so hard to find the right person to talk to.

As a practical example I know of at least four different image comparison tools that were written by different teams for use with automated testing, with pretty much identical requirements. One of the biggest ways I helped productivity was simply by being curious about what other people were working on and making connections when I heard about overlap.

One of the tools I’d love to have is a way to map keywords to people. It’s one of the selling points of Krugle’s enterprise code search engine. Once you can easily search the whole company’s code you can see who else has worked with an API or algorithm. Trampoline systems aim to do something similar using a whole company’s email store, they describe it as letting you discover knowledge assets. I’m trying to do something similar with my automatic tag generation for email.

It’s not only useful for the people on the coal face, it’s also a benefit that seems to resonate with managers. The amount and cost of the redundant effort is often clearer to them than to the folks doing the work. Since the executives are the ones who make the purchasing decisions, that should help the sales process.

How to write a socket server VI

Wallsockets

Once you’ve got a socket server running locally on a web-hosting machine, you need to expose it to the outside world. Luckily this is quite easy using PHP. Sockets are well-supported and already heavily used to communicate with MySQL.

Take the source from the previous article, and build the server part on the machine you’ll be hosting the service on. You’ll want this service to run even if you’re logged off, so for now use the command

nohup ./wcserver /tmp/myservicesocket &

This will keep the server process running even if you exit the current terminal window. On a production system you’ll actually want to start the server when the system reboots, but this approach is simpler for testing purposes.

The next hurdle to overcome is making sure that your Apache httpd process, which runs PHP, has the right permissions to access that socket file. This will depend on the user setup on your machine, but typically you’ll have a special apache user account that you’ll need to add to the file access list. For testing purposes you can always grant everyone on the machine access to the socket file, though it would be preferable for security reasons to be a bit more picky in production. Run this command to grant everyone permission to access it:

chmod a+rw /tmp/myservicesocket

Now the server is running and available, you need to write an equivalent to the command-line client in PHP. Here’s the source and I’ll go over the details below.

$fp = stream_socket_client("unix:///tmp/myservicesocket", 
$errno, $errstr, 30);

PHP takes care of a lot of the socket setup code for you. The only bit I found tricky was specifying a local file socket, it turns out you do that using the special ‘unix’ protocol specification in the URL, followed by the file path.

    fwrite($fp, "Some Message\n");
    while (!feof($fp)) {
        echo fgets($fp, 1024);
    }
    fclose($fp);

The connect call returns a file handle that you can use with the standard file access functions. As you can see, it looks pretty similar to the C version of the same code. You can see the results running on one of my servers here.

The example service isn’t doing very much so far. What’s really exciting about this approach is that it offers a completely language-independent way for any interesting computational service to be integrated into the standard LAMP stack. A lot of my work involves processor-heavy work on hard problems like laying out large graphs and statistical analysis. This should let me move them from the client to become online services.