Looked at from a high level, they both take unstructured data and try to understand its meaning. A big practical difference is that web search tools are designed for the masses to use, whereas email mining is only used by a small number of professionals either doing litigation discovery or business intelligence work. Why is this?
There’s no obvious painful problem. With web search, the problem is "I need to find authoritative information on X". With mail, the question is more like "I need to find the discussion I was involved in on X", which can be solved locally by searching your inbox. This doesn’t need mining, just a search on your drive or personal webmail repository.
Email is private. Whilst technically your work email belongs to the company and they’re free to do whatever they like with it, a lot of people have sensitive personal infomation or discussions over their work account. Even leaving aside the ethical issues, you won’t get adoption unless employees feel comfortable about their privacy. A mass-use mining system needs to have privacy policies built-in from the start, which is a tricky balancing act because you also want to make as much available as possible.
Messages have no hyperlinks to each other. PageRank works because there’s a network of links between web pages. The closest equivalent to this for mail is the graph of who emails whom, and how often and quickly an email is replied to or forwarded. This is still a research topic though, it’s not a widely used or understood metric.
This all sounds fairly downbeat, but what really excites me is that I think there are plenty of painful problems that can be solved with mail mining (eg find an expert, find contacts, collaboration), they’re just not as obvious. There’s a lot of smart ideas on web search that can be applied to mail too. I also think there’s some big advantages to email.
You know who your users are. Inside a company something like Active Directory gives you a wealth of information about who everyone is, what their formal relationship is, and allows you to easily authenticate identity to control access. The web is struggling towards this, but it’s still a long way off. Even for people outside the company, an email address is a good proxy for identity and usually comes with an alternate readable name too. Knowing about your users ahead of time also opens the door to doing a lot of pre-processing before they even try the service, so you can present them with useful information immediately, for example pre-building their social graph.
Time. Another great feature of email is that you’ve got data from a whole range of time, not just a snapshot of how the content looks right now. This opens up the door to a lot of time-based analysis techniques, such as measuring how metrics change over a year. The web has the wayback machine, which is an amazing feat but still a long way from the depth of mail.