Can a computer generate a good (enough) summary?

Robot

A description of a document is really useful if you’re dealing with large amounts of information. If I’m searching through emails, or even when they’re coming in, I can usually decide whether a particular message is worth reading based on a short description. Unfortunately, creating a full human-quality description is an AI-complete problem, since it requires an understanding of an email’s meaning.

Automatic tag generation is a promising and practical way of creating a short-hand overview of a text, with a few unusual words pulled out. It’s somewhat natural, because people do seem to classify objects mentally using a handful of subject headings, even if they wouldn’t express a description that way in conversation.

If you asked someone what a particular email was about, she’d probably reply with a few complete sentences; "John was asking about the positronic generator specs. He was concerned they wouldn’t be ready in time, and asked you to give him an estimate." This sort of summary also requires a full AI, but it is possible to at least mimic the general form of this type of description, even if the content won’t be as high-quality.

The most common place you encounter this is on Google’s search results page:
Googlescreenshot
The summary is generated by finding one or two sentences in the page that contain the terms you’re looking for. If there’s multiple occurences, usually the sentences earliest in the text are favored, along with ones that contain the most terms closest together. It’s not a very natural-looking summary, but it does do a good job at picking out the relevant quotations to what you’re looking for, and giving a good idea whether it’s actually talking about what you want to know.

Amazon’s statistically improbable phrases for books are an interesting approach, they try to identify combinations of words that are distinctive to a particular book. These are almost more like tags, and are found by a similar method to statistics-based automatic tagging, by spotting combinations that are frequent in a particular book, and not as common in a background of related items. They don’t act as a very good description in practice, they’re more useful as a tool for discovering distinctive content. I also discovered they’ve introduced capitalized phrases, which serve a similar purpose. That’s an intriguing hack on the English language to discover proper nouns, I may need to copy that approach.

The final, and most natural, type of summary is created by picking out key sentences from the text, and possibly shortening them. Microsoft Word’s implementation is the most widely used, and it isn’t very good. There’s also an online summarizer you can experiment with that suffers a lot of the same problems.

There are two big barriers to getting good summaries with this method. First, it’s hard to identify which bits of the document are actually important. Most methods use location, if it’s a heading or the start of a paragraph, and statistical frequency of unusual words, but these aren’t very good predictors. Even once you’ve picked the ones you want to use, here’s also very little guarantee that the sentences will make any sense when strung together outside the context of the full document. You often end up with a very confusing narrative. Even MS in their description of their auto summary tool acknowledge that at best it produces a starting point that you’ll need to edit.

Overall, for my purposes displaying something like Google or Amazon’s summaries for an email might be useful, though I’ll have to see if it’s any better than just showing the first sentence or two of a message. It doesn’t look like the other approaches to producing a more natural summary are good enough to be worth using.

What puzzles can company-wide email solve?

Puzzle

My intuition is that a company’s collection of email messages is a rich source of useful information, and people will pay for a service that gives them access to it. What could users do in practice though?

Discover experts.
By analyzing each person’s sent messages, it’s possible to figure out some good tags to describe them. These would need to be approved and tweaked by the subject before being published, but then you’d have a deep company directory that anyone could query. So many times I’ve ended up reinventing the wheel because I didn’t know that somebody in another department had already tackled a particular problem.

Uncover expertise. Email is the most heavily used content-generation system, hands-down. There’s lots of valuable information in messages that never makes it to a wiki or internal blog. The trouble is, that information quickly vanishes, emails are ephemeral. Any mail that’s sent to an internally public mailing list should be automatically included on an intranet page that’s searchable by keyword, or by person or team. You should also have a button in Outlook that lets you publish any mail thread on that same page. Those published messages produce something very like a blog for each person, effortlessly.

Work together.
People collaborate by emailing each other attachments. Rather than trying to change that, put in a tool that by default uploads the attachment to Sharepoint, accessible only by the email recipients, and rewrites the message so it links to that instead. You’ll need a safety-valve that allows people to override that if they really do need it as an attachment, but this method should retain most of the advantages of email collaboration (clear access control, ease-of-use) and add the collaboration benefits of change tracking and a single version of the file.

Can you automatically generate good tags?

Tag
One interesting feature of Disruptor Monkey’s Unifyr is their automatic generation of tags from web pages. Good tags are the basis of a folksonomy, and with them I could do some very useful classification and organization of data. With an organization’s email, I’d be able to show people’s areas of expertise if I knew which subjects they sent messages about. This could be the answer to the painful problem of ‘Who can I ask about X?’.

Creating true human-quality tags would require an AI that could understand the content to the same level a human could, so any automatic process will fall short of that. Is there anything out there that will produce good enough results to use?

There are two main approaches to this problem, which is sometimes known as keyword extraction, since it’s very similar to that search engine task. The first is to use statistical analysis to work out which words are significant, with no knowledge of what the words actually mean. This is fundamentally how Google’s search works. The second is to use rules about language and information about the meanings of words to pick out the right words. As an example, knowing that notebook means the same as laptop, and so having both words count for the same concept. Powerset is going to be using this approach to search. Danny Sullivan has a thought-provoking piece on why he doesn’t think that the method will ever live up to its promise.

KEA is an open-source package for keyword extraction, and is towards the rules-based end of the spectrum, though it sets up those rules using training and some standard thesauruses, rather than manually. I was initially very interested, because it’s designed to do exactly what I need, pulling descriptive keywords from a text. Unfortunately, I’d still have to set up a thesaurus and some manually tagged documents for the system to learn from before running it on any information. I would like to start off with something completely unsupervised, so it can be deployed without a skilled operator or any involved setup.

The other alternative is using statistical analysis to identify words that are uncommonly used in most texts, but which are common in the particular one you’re looking at. The simplest example I’ve seen is the PHP automatic keyword generation class. You’ll need to register to see the code, but all it does is exclude all stop words, and  then returns the remaining words, and two and three-word phrases, in descending order of frequency. The results are a long way from human tagging, but just good enough to make me think it’s worth expanding.

An obvious next step is to expand the stop word concept, and keep track of the general frequency of a lot more words, so you can exclude other common terms, and focus on the unusual ones. The standard way to do this is to take the frequencies from a large corpus of text, often a general one like the Brown corpus that includes hundreds of articles from a variety of sources. For my purposes, it would also be interesting to use the organization’s overall email store as the corpus, and identify the words a particular employee uses that most others in the company don’t. This would prevent things like the company name from appearing too often.

You’ll never get human-grade tags from this sort of system, but you can get keywords that are good enough for some tasks. I hope it will be good enough to identify subject-matter experts within a company, but only battle-testing will answer that.

Email data mining by Spoke and Contact Networks

Mining

I’ve been thinking hard about painful problems that email analysis could solve, and one of them is the use of a corporation’s email store to discover internal colleagues who have existing relationships with an external company or person you want to talk to. For example, if you want to sell to IBM, maybe there’s someone in your team who’s already talking to someone there. Or internally, you might want an introduction to someone in another department to discuss a problem, and it would be good to know who in your team had contacts there already.

I was discussing these thoughts with George Eberstadt, co-founder of nTag, and he pointed me to a couple of successful companies who are already mining email to do this, Spoke and Contact Networks.

Spoke are interesting because they’re entirely client-based, rather than running on an organization’s whole message store. They work by taking data from everybody who’s running their Outlook add-on, along with information pulled from publicly available sources, and feeding into their own database. You can then search that yourself to find information on people you’re interested in contacting, and people you know who’ve already been in touch with them. It’s effectively creating a global social network largely based on the email patterns of everybody who belongs.

Technically, it sounds like they’re doing some interesting things, such as exchanging special emails to confirm that two people really do know each other, but when I tried for my own name, I didn’t get any useful information. I also am surprised that companies would allow the export of their employees email relationships to a third-party. It may just be that this is happening under the radar, but it seems like the sort of thing that a lot of companies would want some safeguards on. The service encourages individual employees to install the software themselves, without any warning that they might be opening up the organization’s data to third-party analysis. I know at a lot of the companies I deal with would frown on this, to say the least.

Contact Networks seem much more focused on selling to corporations as a whole, rather than individual employees. They build a social graph pulling from several sources internal to the company including email and calendars, CRM, marketing databases, HR and billing systems. They use this to identify colleagues who know a particular individual, which is a succinct description of a ‘painful problem’ that companies would be willing to pay money to solve. They seem to have been very successful, with lots of big-name clients and they were just bought out by Thomson.

It’s good to see how well Contact Networks have done, it’s proof there’s demand for the sort of services I’m thinking of, even if they have already solved the immediate problems I was considering.

Defrag has arrived!

Defragbanner

I flew into Denver this morning, and even though Defrag doesn’t officially start until tomorrow, I’ve already had a couple of early meet-ups with some of the local folks. It was fun seeing Rob and Josh from EventVue in the flesh for the first time, and hearing about all their hard work. They’ve been running at full steam since May I hope they get a chance for a break soon.

I also made an interesting discovery; Denver has two Hyatt hotels just a couple of blocks from each other, the Grand Hyatt and the Hyatt Regency. I only found this out after I’d dropped my car at the Grand’s valet parking and tried to check in! Luckily I was able to make it to the right one without further misadventure.

Funhouse Photo User Count: 2,097 total, 111 active. Ticking up gradually, with some good weekend active numbers.

Event Connector
User Count
: 109 total, 4 active.

Beautiful data

Datavisualization

With the mass of raw data I’m getting from a couple of years of my own email, I’m looking around for a good way to turn that into information. A simple ranking of my closest contacts is a good start, but I want to also see how much of the real-life groupings between others can be revealed. I’m working on a basic force-directed graph implementation, but that still leaves a lot of display choices.

VisualComplexity.com is one of my favorite places to find inspiration. They’ve done a great job collecting some of the most striking methods of presenting graph data visually. I also enjoy the Data Mining blog. Matthew’s a great resource and he’s good at reminding me to focus on getting something useful from my visualizations, not just pretty pictures. He’s headed to Defrag, so I hope I’ll get a chance to say hello.

Funhouse Photo User Count: 2,042 total, 52 active. Steady growth, but a low active count.

Event Connector User Count: 106 total, 13 active. A miniature growth spurt over the last day or two, with a comparatively large number of engaged users.

Take a walk on the client-side

Desktop

I’ve been following Mozilla’s Prism launch with a lot of interest. One comment that caught my eye was from  Ryan Stewart, who believes "that the desktop isn’t dead, and that a hybrid approach is a successful way to go". There’s a lot of opportunities for really useful services that need a client-side component, and that’s key to my strategy. It’s the reason I’ve worked on Java Applets, Firefox and Internet Explorer extensions and now an Outlook plugin.

Web-based apps have a lot of advantages over traditional desktop apps:
No install needed! It’s an incredible advantage to be able to go to a URL and instantly start using the service. The installation process is a big barrier to people even trying your app, since they’ve been trained to know it takes minutes, involves answering questions they may not know the answer to, and worrying about their configuration.
Complete safety. There’s no worries about viruses or spyware, the only information it has access to is what you type in.
Use anywhere. You can use the app from any machine in the world, regardless of OS or configuration.
Up-to-date and comprehensive information. Since the information you’re accessing is pulled live from a server, it can be updated instantly, along with the application itself.
Easy to develop. You have a lot fewer variables to worry about with a server-based app. You know exactly what hardware and software the main code is running on. Keeping the output HTML cross-platform is a couple of orders-of-magnitude easier than doing the same for executable code.

There are some obstacles that I think will prevent pure web-based services from taking over the app world:
A limiting sandbox. To achieve security, web-pages can only pull data from their own servers, and can’t use any of the information already on the user’s machine, or any other web-services the user is signed onto, without an explicit agreement with the provider. This effectively stovepipes all information, and is the reason I’ve entered my list of friends on at least a dozen different services. I don’t see this changing, because it would require a majority of the browser vendors to implement a more subtle security policy than the current blanket same-domain-policy. Loosening security like that doesn’t seem likely.
Poor UI. Google has done an astonishing job with their mail interface, but for power users the experience still doesn’t match Outlook or Apple Mail. Running within a web-browser makes it tough to offer a rich interface, and you can’t use the standard OS metaphors. This is where Prism is interesting, because XUL actually offers some of the UI that HTML is missing, like sliders.

What I think is really interesting is the idea of combining the strengths of the two approaches. To start off, there needs to be a pure web interface to build up enough interest and trust for people to download a client extension that offers more features. That extension can then offer some tools to the app, like cross-domain access and native UI, but still keeping the bulk of the application running on the server. That keeps the big advantages of SAAS, such as ease of development, and online access to information, but allows a whole range of applications that aren’t possible now.

Funhouse Photo User Count
: 1,975 total, 77 active.

Event Connector User Count: 89 total, 1 active.

Implicit web philately

Stamps

With Defrag fast approaching, I’ve been spending some cycles thinking about what the Implicit Web actually is, and where it’s going. When I’m staring at this sort of problem, a technique I find really useful is "stamp collecting". Gather as many examples as possible, list their important properties, group them into clusters and look at what patterns emerge.

Here’s my current list of Implicit Web services currently out there, with a couple that are on the borderline of the term. I’ve not got enough to meaningfully group them, so they’re alphabetical:

Adaptive Blue – More of a semantic web application, but they do offer a Firefox extension. Being client-based is a distinguishing feature of implicit web apps, since that’s the only way I know to get access to the user data needed.

Amazon – Their recommendation system is the grand-daddy of a lot of the apps that take raw information on user’s behavior, run some magic algorithms, and return something useful back to the customer. It’s a hard trick for most startups to repeat, since almost nobody has access to the Amazon’s breadth of data. This is why client-based solutions that can track behavior across many sites seem like the only practical solution.

last.fm – A true implicit web app, they have client-based tracking of the user’s behavior, they piggy-back on other people’s applications to gather their data and use that to return Amazon-style recommendations. It does make me wonder about the ‘web’ part of the term though, since that seems to imply web browsing. Maybe ‘implicit internet’ would be more appropriate?

me.dium – Another app that fully fits the term. A unique feature is that they use the social graph to combine information from multiple users, which I think is a very promising area for implicit web applications. Being able to pool data from your friends is a great way of discovering relevant new content.

MySportsNet.ca – This is one I came across relatively recently. It’s a client-side app that monitors your browsing, and tailors a sports portal site to match your interests based on that data. What’s really interesting is that it’s aimed at a mainstream audience of sports fans, rather than geeky early adopters. I know from my game career that the sports audience is massive, and willing to pay for something ties into their passion, so I’ll be following its progress closely. The only audience I know that’s similar is music, and it’s relevant that the most successful implicit app so far, last.fm, tapped into that demand.

tape failure – This is a service I’ve only read about, but unfortunately their site seems to be down at the moment. They’re not an implicit web app at all, but it does seem like they have a good solution to the browsing data collection problem.

Let me know if you think I’m missing any. I may put together a page tracking new services, since I think we’re going to be getting a lot more over the next year.

Funhouse Photo User Count
: 1,916 total, 92 active. The proportion of profile-box adds was a bit higher this time, which is promising because it scales a bit more virally than the product directory.

Event Connector User Count: 84 total, 9 active. Not much happening on this front.

Games, UI, and the implicit web

Joypad
I was a console programmer for six years. Games are the only pieces of software that people use purely for the joy of interacting with a computer. There’s no reason to play, except to have fun.

This means that the user interface is crucial. With other software, people will put up with the pain of a bad UI because they’re trying to accomplish some real-world task. If a consumer picks up a video game and it doesn’t let them have fun within a minute or two, they will give up on it. The interface has to be easy and fun. It can still be deep, but that complexity must be intuitive and discoverable, and not presented like the Space Shuttle’s control panel.

What really excites me about the implicit web is the promise of using the gathered data to turbo-charge everyday interfaces. A simple example is Firefox’s address bar; it remembers the URLs I visit, and when I start typing a new one, the suggestions are in most-visited order. By contrast, I wouldn’t class Google Suggest in the search box as an implicit service, since it doesn’t customize the suggestions based on my behavior, and it’s a lot less useful for me.

When I was working with Nintendo, the holy grail was the ‘one button game’. Think Mario 64, where you managed complex interactions with a 3D world mostly with the joystick and a single button to jump. Stumbleupon is the web service that’s closest to this, I’ve heard it described as the ‘Forward button‘ for the web, and it really delivers a lot of value with very little input needed from the user. Google Hot Keys is my attempt to move searching in that direction, though there’s no implicit component.

One of the parts I’m most anticipating about Defrag is seeing all of the innovative interfaces that the teams will be showing off. There’s so many possibilities for improving the user experience, I can’t wait to see what people are coming up with!

Funhouse Photo User Count: 1,817 total, 93 active. Still growing steadily, but slowly, with most of the additions coming through the product directory.

Event Connector User Count: 77 total, 4 active. Still working with conference organizers, not much to show yet though.

Slinky companies and public transport

Slinky
Yesterday, Brad posted an article talking about bubble times in Boulder, and quoted a great line from Bill Perry about how they spawned ‘slinky companies’ that "aren’t very useful but they are fun to watch as they tumble down the stairs".

Rick Segal had a post about why he took the train to work, and how people-watching there was a great reality check to a lot of the grand technology ideas he was presented with.

And via Execupundit, I came across a column discussing whether people were really dissatisfied with their jobs, or just liked to gripe and fantasize. One employee who’d been involved in two start-ups that didn’t take off said "Most dreams aren’t market researched."

These all seemed to speak to the tough balance between keeping your feet on the ground and your eyes on the stars. As Tom Evlin’s tagline goes, "Nothing great has ever been accomplished without irrational exuberance." I’ve been wrestling with how to avoid creating a slinky with technology that sounds neat enough to be funded, but will never amount to anything. To do that, I’ve focused on solving a painful problem, and validating both the widespread existence of the problem, and that people like my solution.

I’ve turned my ideas into concrete services, and got them into the wild as quickly as possible. Google Hot Keys has proved that it’s possible to robustly extract data from screen-scraping within both Firefox and IE, but its slow take-up suggests there isn’t a massive demand for a swankier search interface. Defrag Connector shows that being able to connect with friends before a conference is really popular, but the lack of interest so far in Event Connector from conference promoters I’ve contacted shows me it won’t just sell itself. Funhouse Photo’s lack of viral growth tells me that I need to provide a compelling reason for people to contact their friends about the app, and not just rely on offering them tools to do so.

I really believe in all of these projects, but I want to know how to take them forward by testing them against the real world. All my career, I’ve avoided grand projects that take years before they show results. I’ve been lucky enough that all of the dozen or so major applications I’ve worked on have shipped, none were cancelled. Part of that is down to my choice of working on services that have tangible benefits to users, and can be prototyped and iteratively tested against that user need from an early stage. Whether it’s formal market research, watching people on trains, or just releasing an early version and seeing what happens, you have to test against reality.

I’m happy to take the risk of failing, there’s a lot of factors I can’t control. What I can control is the risk of creating something useless!

Funhouse Photo User Count: 1,746 total, 70 active. Much the same as before, I haven’t made any changes yet.

Event Connector User Count: 73 total, 9 active. Still no conference takeup. I did experiment with a post to PodCamp Boston’s forum to see if I could reach guests directly, but I think the only way to get good distribution is through the organizers.