How to get search terms from referer logs

Disksearch

To give search clouds a try, I need to extract the terms that visitors used to find my site from the referer logs. Luckily that’s another place where regular expressions come in handy. Here’s the REs for the three major search engine’s URLs, including MSN Live, Yahoo and Google, along with regional variants like google.co.uk. You should run them with case sensitivity disabled.

google(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)
yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)
live(\.[a-z]{2,6})?\.[a-z]{2,4}\/results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)

You’ll end up with the search terms in the form "term1+term2+term3" in the second parenthesized results. If you want them as plain text, run urldecode() or equivalent on the string. Here’s a PHP function that takes a log file as a location, and returns an array of all the search terms listed in the referer URLs:

function extract_search_terms($filename)
{
  $logcontents = file_get_contents($filename);

  $searchrelist = array(
    "/google(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
    "/yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)/i",
    "/live(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
  );

  $result = array();
  foreach ($searchrelist as $searchre)
  {
    preg_match_all($searchre, $logcontents, $logmatches,
      PREG_PATTERN_ORDER);

    foreach ($logmatches[2] as $currentmatch)
      array_push($result, urldecode($currentmatch));
  }

  return $result;
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: