PageRank has to be one of the most successful algorithms ever. I’m wary of stretching the implicit web definition until it breaks, but it shares a lot of similarities with the algorithms we need to use.
Unintended information. It processes data for a radically different purpose than the content’s creators had in mind. Links were meant to simply be a way of referencing related material, nobody thought of them as indicators of authority. This is the definition of implicit data for me, it’s the information that you get from reading between the lines of the explicit content.
Completely automatic. No manual intervention means it can scale up to massive sets of data without having a corresponding increase in the numbers of users or employees you need. This means that its easy to be comprehensive, covering everything.
Hard to fake. When someone links to another page, they’re putting a small part of their reputation on the line. If the reader is disappointed in the destination, their opinion of the referrer drops, and this natural cost keeps the measure correlated with authority. This makes the measure very robust against manipulation.
Unreliable. PageRank is only a very crude measure of authority, and I’d imagine that a human-based system would come up with different rankings for a lot of sites.
As a contrast, consider the recipe behind a social site like Digg that aim to rank content in order of interest.
Explicit information. Every Digg vote is done in the knowledge that it will be used to rank stories on the site.
Human-driven. It relies completely on users rating the content.
Easy to fake. The voting itself is simple to game, so account creation and other measures are required to weed out bad players.
Reliable. The stories at the top of its rankings are generally ones a lot of people have found interesting, it seems good at avoiding boring content, though of course there’s plenty that doesn’t match my tastes.
A lot of work seems to be fixated on reliability, but this is short-sighted. Most implicit data algorithms can only ever produce a partial match between the output and the quality you’re trying to measure. Where they shine is their comprehensiveness and robustness. PageRank shows you can design your system around fuzzy reliability and reap the benefits of fully automatic and unfakeable measures.