A search engine is designed to take some keywords, and return web pages that match them. What fascinates me is that that mapping of words to pages could easily be done in reverse. Given a particular web page, tell me what keywords are most likely to find it. My hunch is that that set of words, maybe presented as a tag cloud, would give a pretty good summary of what the page is about.
The closest example I’ve found out there is this blog entry. It’s got what appears at first to be a fairly random list of keywords, but digging into them, it looks like Darrin is a Vancouver-based Titanic fan who’s posted about the beautiful agony art project and has done a lot of wedding posts.
What’s really interesting about this is that the search terms that show up aren’t just based on textual frequency within the site, they’re also the product of how often people search for particular words at all. Essentially it’s giving a lot more weight to terms people actually care about, rather than just all terms that are statistically improbable.
At the moment the only way to implement this is to process an individual site’s visitor logs to pull out the frequency of keyword searches that lead to a visit. However search engines know the historical frequency of particular queries terms up front, so it would be possible for them to take an arbitrary new page and simulate which searches would be likely to land on it. You could do something similar for a mail message, essentially you’d be filtering statistically improbable phrases to get statistically improbable and interesting phrases instead.
Utterly Fascinating Idea. Our only concern would be additional unrelated but statistically probable bulk in the keyword.
Imagine taking each word in a keyword and projecting its semantic value onto an axis of abstractness. Generally, there is quite a variance in search keywords with some words being very abstract other quite specific.
We argue the utility of a tag-cloud is directly proprtional to the overlap between the abstractness of the words in the tag cloud and the user’s desired degree of abstractness.
For instance, people searching for the St Regis Hotel probably search “St Regis Hotel Room” frequently. But once someone was on the St Regis site, the term “hotel room” is probably quite useless as a navigation term.
The problem we think is one of asking which portion of the Semantic Hierarchy is worth exploring. “Hotel” “Room Service” is probably not terribly value in terms of in-site navigation once on the St Regis site. But it could be highly valuable in a broader search on google for distinguishing between the person St Regis and the hotel chain.
Hence, reverse-indexing data provides a useful insight on a number of these topics, but if used wholesale will probably result in many terribly obvious keywords. How many people still append keywords like “web site” for instance when searching?
But with a little fiddling, we bet this data could become highly useful.
Very true, the top keyword shown on Darren’s cloud is ‘the’! It feels like looking at the whole search phrase might help too, but then as you say there’s a lot of variations.
What this needs is some practical experimentation. I’ve just installed statcounter here since I can’t download referrer logs from Typepad’s default stats panel.