To give search clouds a try, I need to extract the terms that visitors used to find my site from the referer logs. Luckily that’s another place where regular expressions come in handy. Here’s the REs for the three major search engine’s URLs, including MSN Live, Yahoo and Google, along with regional variants like google.co.uk. You should run them with case sensitivity disabled.
google(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)
yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)
live(\.[a-z]{2,6})?\.[a-z]{2,4}\/results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)
You’ll end up with the search terms in the form "term1+term2+term3" in the second parenthesized results. If you want them as plain text, run urldecode() or equivalent on the string. Here’s a PHP function that takes a log file as a location, and returns an array of all the search terms listed in the referer URLs:
function extract_search_terms($filename)
{
$logcontents = file_get_contents($filename);
$searchrelist = array(
"/google(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
"search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
"/yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
"search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)/i",
"/live(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
"results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
);
$result = array();
foreach ($searchrelist as $searchre)
{
preg_match_all($searchre, $logcontents, $logmatches,
PREG_PATTERN_ORDER);
foreach ($logmatches[2] as $currentmatch)
array_push($result, urldecode($currentmatch));
}
return $result;
}