Jeremy Liew recently posted some hints, tips and cheats to better datamining. The main thrust, based on Anand Rajaraman’s class at Stanford, is that finding more data is a surer way to improve your results than tweaking the algorithm. This matches both my own experience trying to do datamining, and what I’ve seen with other company’s technologies. Engineers have a bias towards making algorithms more complex, because that’s the direction that gets the most respect from your peers and offers the most intellectual challenge.
That makes us blind to the advantages of what the Jargon File calls wall-followers, after the Harvey Wallbanger robot that simply kept one hand on a wall at all times to complete a maze, and gave far better results than the sophisticated competitors using complex route-finding. Google’s PageRank is a great example, almost zen in its simplicity, with no semantic knowledge at all.
One hidden advantage to this simple-mindedness is very predictable behavior, since the simplicity means there’s a lot fewer variables that affect the outcome. This is why machine-learning is so scary a change for Google, there’s no guarantee that some untested combination of inputs won’t result in very broken results.
Another obvious benefit is speed of both development and processing. This lets you get up and running very fast, and get through a lot of data. This gives you a lot more coverage. Yahoo’s directory wasn’t beaten because Google ranked pages more accurately than humans, but because Yahoo could only cover a tiny fraction of what was out there.
On the face of it, this doesn’t sound good for a lot of the alternative search engines that are competing out there. If it’s hard to beat a simple ranking algorithm, should they all pack up and go home? No, I think there’s an immense amount that can be improved both on the presentation side and by gathering novel data sets. For example why can’t I pull a personal page rank based on my friends and friends-of-friends preferences for sites? What about looking at my and their clickstreams to get an idea of those preferences?