A friend recently asked if there was a good way to detect just the added text in an email reply. This would allow users to reply directly to emails showing things like Facebook messages, and have the reply show up in a decent form on that other service. Spotting just the new content is fairly tricky, because you’ve not only got the quoted text of the original message, different email programs also add their own decorations to give attribution to the quotations, eg:
------ Original Message -----
On Tue, Mar 4, 2008 at 8:15 PM, Pete Warden <pete@petewarden.com> wrote:
From: Pete Warden
Sent: Wednesday, March 04, 2008 8:17 PM
To: Pete Warden
Subject: Testing 2
The solution he is looking at for removing this boilerplate is collecting a library of examples, and figuring out some regular expressions that will match them. They’re fairly distinctive, so it should be possible to do a pretty accurate job spotting them. The main problem is that there’s so many different mail programs out there, and they all seem to add slightly different decorations.
Detecting the quoted text is more of an algorithmic problem, and comes down to doing a fuzzy string search to work out if some text roughly matches the contents of the original mail. Another approach would be to look for >’s at the start of a line, and would work reasonably well if it wasn’t for Outlook. For once, there’s actually a helpful patent that describes how Google does this in Gmail. I really hate software patents, but at least this one contains some non-obvious parts, is not insanely broad and explains reasonably well the implementation behind it. They don’t talk about handling the boilerplate decoration very much, apart from mentioning they look for common headers like "From:". For the quotations, it looks like they do some magic with hash calculations to spot small sections of matching text between the two documents, and then try to merge them into larger blocks.