Now you can try ManagedQ for yourself

Explosion

My anonymous friends over at ManagedQ have left their private beta and opened their search service to everyone. I already covered how helpful their regular expression in-page searching can be, and they have a lot more to offer too, like their entity extraction and the most accurate thumbnails I’ve seen. You can see more reviews on AltSearchEngines and thenextweb.com.

I’ve been having some fun using the regular expressions I posted a few days ago with ManagedQ. To see their power, follow these steps:

1) Go to managedq.com and enter your main search terms (eg pete warden)
2) On the results page, start typing to bring up the inpage search box
3) Delete anything that’s already in there and enter the following regular expression:
/([0-9]{3})[^0-9a-z]*([0-9]{3})[^0-9a-z]*([0-9]{4})/

This should highlight any phone numbers in the results pages. I made the expression a bit more restrictive than my previous version to exclude letters as phone number seperators.

Managedqnumbers

Why is web search so popular and mail mining so rare?

Acorns
Looked at from a high level, they both take unstructured data and try to understand its meaning. A big practical difference is that web search tools are designed for the masses to use, whereas email mining is only used by a small number of professionals either doing litigation discovery or business intelligence work. Why is this?

There’s no obvious painful problem. With web search, the problem is "I need to find authoritative information on X". With mail, the question is more like "I need to find the discussion I was involved in on X", which can be solved locally by searching your inbox. This doesn’t need mining, just a search on your drive or personal webmail repository.

Email is private. Whilst technically your work email belongs to the company and they’re free to do whatever they like with it, a lot of people have sensitive personal infomation or discussions over their work account. Even leaving aside the ethical issues, you won’t get adoption unless employees feel comfortable about their privacy. A mass-use mining system needs to have privacy policies built-in from the start, which is a tricky balancing act because you also want to make as much available as possible.

Messages have no hyperlinks to each other.
PageRank works because there’s a network of links between web pages. The closest equivalent to this for mail is the graph of who emails whom, and how often and quickly an email is replied to or forwarded. This is still a research topic though, it’s not a widely used or understood metric.

This all sounds fairly downbeat, but what really excites me is that I think there are plenty of painful problems that can be solved with mail mining (eg find an expert, find contacts, collaboration), they’re just not as obvious. There’s a lot of smart ideas on web search that can be applied to mail too. I also think there’s some big advantages to email.

You know who your users are. Inside a company something like Active Directory gives you a wealth of information about who everyone is, what their formal relationship is, and allows you to easily authenticate identity to control access. The web is struggling towards this, but it’s still a long way off. Even for people outside the company, an email address is a good proxy for identity and usually comes with an alternate readable name too. Knowing about your users ahead of time also opens the door to doing a lot of pre-processing before they even try the service, so you can present them with useful information immediately, for example pre-building their social graph.

Time. Another great feature of email is that you’ve got data from a whole range of time, not just a snapshot of how the content looks right now. This opens up the door to a lot of time-based analysis techniques, such as measuring how metrics change over a year. The web has the wayback machine, which is an amazing feat but still a long way from the depth of mail.

See what Google thinks your site is about with a search cloud

Searchcloud
If you want to know which search terms are most likely to find your site, I’ve uploaded a PHP library that creates search clouds from your logs. To use it include searchcloud.php and call create_search_cloud(), passing in the location of your log file, the name of your site, the number of tags to produce and the min/max font sizes in percentages. You’ll be returned a string containing the HTML for the cloud. Here’s an example:

echo create_search_cloud("visitlogs_petewarden.txt", "petewarden.com", 50, 50, 250);

You can see it working on this example page based on statistics from my old open-source image processing site, which I’ve also included with the library for testing purposes.

Based on the examples I’ve tried, my hypothesis that the most frequent search terms are a good approximation for the meaning of the site holds up. If you take the top 8 terms from the petewarden.com cloud, you get "after effects", "plugins", "effects", "after", "how to", "install". "how to install", "petes plugins". 4 of them would be good tags or taxonomy categories for the content, and on inspection the use of more sophisticated rejection of duplicates and stop words would help increase that ratio. I’ll be interested to hear how this works on some of your sites.

 

How to get search terms from referer logs

Disksearch

To give search clouds a try, I need to extract the terms that visitors used to find my site from the referer logs. Luckily that’s another place where regular expressions come in handy. Here’s the REs for the three major search engine’s URLs, including MSN Live, Yahoo and Google, along with regional variants like google.co.uk. You should run them with case sensitivity disabled.

google(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)
yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)
live(\.[a-z]{2,6})?\.[a-z]{2,4}\/results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)

You’ll end up with the search terms in the form "term1+term2+term3" in the second parenthesized results. If you want them as plain text, run urldecode() or equivalent on the string. Here’s a PHP function that takes a log file as a location, and returns an array of all the search terms listed in the referer URLs:

function extract_search_terms($filename)
{
  $logcontents = file_get_contents($filename);

  $searchrelist = array(
    "/google(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
    "/yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)/i",
    "/live(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
  );

  $result = array();
  foreach ($searchrelist as $searchre)
  {
    preg_match_all($searchre, $logcontents, $logmatches,
      PREG_PATTERN_ORDER);

    foreach ($logmatches[2] as $currentmatch)
      array_push($result, urldecode($currentmatch));
  }

  return $result;
}

A happy anniversary

Liz

Exactly 6 years ago I met a beautiful young girl called Liz. We were both working as volunteers fixing up the Musch Trail, and our shared love of the mountains brought us together. Since then she’s been a constant source of love, strength and happiness and has never stop surprising me. I’m looking forward to many more adventures together.

If you are a local hiker you should check out the Santa Monica Mountains Trails Council site that she’s been working on. I especially like her Plant of the Month feature, and the hundreds of trailwork pictures she’s collected.

You can turn search on its head

Upsidedown

A search engine is designed to take some keywords, and return web pages that match them. What fascinates me is that that mapping of words to pages could easily be done in reverse. Given a particular web page, tell me what keywords are most likely to find it. My hunch is that that set of words, maybe presented as a tag cloud, would give a pretty good summary of what the page is about.

The closest example I’ve found out there is this blog entry. It’s got what appears at first to be a fairly random list of keywords, but digging into them, it looks like Darrin is a Vancouver-based Titanic fan who’s posted about the beautiful agony art project and has done a lot of wedding posts.

What’s really interesting about this is that the search terms that show up aren’t just based on textual frequency within the site, they’re also the product of how often people search for particular words at all. Essentially it’s giving a lot more weight to terms people actually care about, rather than just all terms that are statistically improbable.

At the moment the only way to implement this is to process an individual site’s visitor logs to pull out the frequency of keyword searches that lead to a visit. However search engines know the historical frequency of particular queries terms up front, so it would be possible for them to take an arbitrary new page and simulate which searches would be likely to land on it. You could do something similar for a mail message, essentially you’d be filtering statistically improbable phrases to get statistically improbable and interesting phrases instead.