How to use corporate data to identify experts


Nick over at the Disruptor Monkey blog talks about how their FindaYoda feature has proved a surprise hit. This is a way of seeing who else has a lot of material with a keyword you’re looking for, and its success backs up one of the hunches that’s driving my work. I know from my own experience of working in a large tech company that there’s an immense amount of wheel-reinventing going on just because it’s so hard to find the right person to talk to.

As a practical example I know of at least four different image comparison tools that were written by different teams for use with automated testing, with pretty much identical requirements. One of the biggest ways I helped productivity was simply by being curious about what other people were working on and making connections when I heard about overlap.

One of the tools I’d love to have is a way to map keywords to people. It’s one of the selling points of Krugle’s enterprise code search engine. Once you can easily search the whole company’s code you can see who else has worked with an API or algorithm. Trampoline systems aim to do something similar using a whole company’s email store, they describe it as letting you discover knowledge assets. I’m trying to do something similar with my automatic tag generation for email.

It’s not only useful for the people on the coal face, it’s also a benefit that seems to resonate with managers. The amount and cost of the redundant effort is often clearer to them than to the folks doing the work. Since the executives are the ones who make the purchasing decisions, that should help the sales process.

How can you solve organizational problems with visualizations?


Valdis Krebs and his Orgnet consultancy have probably been looking at practical uses of network analysis longer than anyone. They have applied their InFlow software to hundreds of different cases, with a focus on problem solving within commercial organizations, but also looking at identifying terrorists, and the role of networks in science, medicine, politics and even sport.

I am especially interested in their work helping companies solve communication and organizational issues. I’ve had plenty of personal experience with merged teams that fail to integrate properly, wasted a lot of time reinventing wheels because we didn’t know a problem had already been solved within the company and been stuck in badly configured hierarchies that got in the way of doing the job.To the people at the coal-face the problems were usually clear, but network visualizations are a very powerful tool that could have been used to show management the reality of what’s happening. In their case studies, that seems to be exactly how they’ve used their work, as a navigational tool for upper management to get a better grasp on what’s happening in the field, and to suggest possible solutions.

Orgnet’s approach is also interesting because they are solving a series of specialized problems with a bespoke, boutique service, whereas most people analyzing company’s data are trying to design mass market tools that will solve a large problem like spam or litigation discovery with little hand-holding from the creators of the software. That gives them unique experience exploring some areas that may lead to really innovative solutions to larger problems in the future.

You should check out the Network Weaving blog, written by Valdis, Jack Ricchiuto and June Holley. Another great thing about their work is that their background is in management and organizations, rather than being technical. That seems to help them avoid the common problem of having a technical solution that’s looking for a real-world problem to solve!

Ratchetsoft responds

Joe Labbe of Ratchetsoft sent a very thoughtful reply to the previous article, here’s an extract that makes a good point:

The bottom line is the semantic problem becomes a bit more
manageable when you break it down into its two base components: access and
meaning. At RatchetSoft, I think we’ve gone a long way in solving the access
issue by creating a user-focused method for accessing content by leveraging
established accessibility standards (MSAA and MUIA).

To your point, the meaning issue is a bit more challenging. On that front, we
shift the semantic coding responsibility to the entity that actually reaps the
benefit of supplying the semantic metadata. So, if you are a user that wants to
add new features to existing application screens, you have a vested interest in
supplying metadata about those screens so they can be processed by external
services. If you are a publisher who has a financial interest in exposing data
in new ways to increase consumption of data, you have a strong motivation to semantically
coding your information.

That fairer matching between the person who puts in the work to mark the semantic information and who benefits feels like the key to making progress.

Can the semantic web evolve from the primordial soup of screen-scraping?


The promise of the semantic web is that it will allow your computer to understand the data on a web page, so you can search, analyze and display it in different forms. The top-down approach is to ask web-site creators to add information about the data on a page. I can’t see this ever working, it just takes too much time for almost no reward to the publisher.

The only other two alternatives are the status quo where data remains locked in silos or some method of understanding it without help from the publisher.

A generic term for reconstituting the underlying data from a user interface is screen-scraping, from the days when legacy data stores had to be converted by capturing their terminal output and parsing the text. Modern screen-scraping is a lot trickier now that user interfaces are more complex since there’s far more uninteresting visual presentation information that has to be waded through to get to the data you’re after.

In theory, screen-scraping gives you access to any data a person can see. In practice, it’s tricky and time-consuming to write a reliable and complete scraper because of the complexity and changeability of user interfaces. To produce the end-goal of an open, semantic web where data flows seamlessly from service to service, every application and site would need a dedicated scraper, and it’s hard to see where the engineering resources to do that would come from.

Where it does get interesting is that there could be a ratchet effect if a particular screen-scraping service became popular. Other sites might want to benefit from the extra users or features that it offered, and so start to conform to the general layout, or particular cues in the mark-up, that it uses to parse its supported sites. In turn, those might evolve towards de-facto standards, moving towards the end-goal of the top-down approach but with incremental benefits at every stage for the actors involved. This seems more feasible than the unrealistic expectation that people will expend effort on unproven standards in the eventual hope of seeing somebody do something with them.

Talking of ratchets leads me to a very neat piece of software called Ratchet-X. Though they never mention the words anywhere, they’re a platform for building screen-scrapers for both desktop and web apps. They have tools to help parse both Windows interfaces and HTML, and quite a few pre-built plugins for popular services like Salesforce. Screen-scrapers are defined using XML to specify the location and meaning of data within an interface, which holds out the promise that non-technical users could create their own for applications they use. This could be a big step in the evolution of scrapers.

I’m aware of how tricky writing a good scraper can be from my work parsing search results pages for Google Hot Keys, but I’m impressed by the work Ratchet have done to build a platform and SDK, rather than just a closed set of tools. I’ll be digging into it more deeply and hopefully chatting to the developers about how they see this moving forward. As always, stay tuned.

What’s the secret to Amazon’s SIPs algorithm?


The statistically improbable phrases that Amazon generates from a book’s contents seem like they’d be useful to have for a lot of other text content, such as emails or web pages. In particular, it seems like you could do some crude but useful automatic tagging.

There’s no technical information available on the algorithm they use, just a vague description of the results it’s trying to achieve. They define a SIP as "a phrase that occurs a large
number of times in a particular book relative to all Search Inside!

The obvious implementation of this for a word or series of words in a candidate text is

  • Calculate how frequently the word or phrase occurs in the current text, by dividing the number of occurrences by the total number of words in the text. Call this Candidate Frequency.
  • Calculate the frequency of the same word of phrase in a larger reference set of set, to get the average frequency that you’d expect it to appear in a typical text. Call this Usual Frequency.
  • To get the Unusualness Score for how unusual a word or phrase is, divide the Candidate Frequency by the Usual Frequency.

In practical terms, if a word appears often in the candidate text, but appears rarely in the reference texts, it will have a high value for Candidate Frequency and a low Usual Frequency, giving a high overall Unusualness Score.

This isn’t too hard to implement, so I’ve been experimenting using Outlook Graph. I take my entire collection of emails as a reference corpus, and then for every sender I apply this algorithm to the text of their emails to obtain the top-scoring improbable phrases. Interestingly, the results aren’t as compelling as Amazon’s. A lot of words that intuitively aren’t very helpful showing up near the top.

I have found a few discussions online from people who’ve attempted something similar. Most useful were Mark Liberman’s intial thoughts on how we pick out key phrases, where he discusses using "simple ratios of observed frequencies to general expectations", and how they will fail because "such tests will pick out far too many words and phrases whose expected frequency over the span of text in question is nearly zero". This sounds like a plausible explanation for some of the quality of the results I’m seeing.

In a later post, he analyzes Amazon’s SIP results, to try and understand what it’s doing under the hood. The key thing he seems to uncover is that "Amazon is limiting SIPs to things that are plausibly phrases in a linguistic sense". In other words, they’re not just applying a simplistic statistical model to pick out SIPs, they’re doing some other sorting to determine what combinations of words are acceptable as likely phrases. I’m trying to avoid that sort of linguistic analysis, since once you get into trying to understand the meaning of a text in any way, you’re suddenly looking at a mountain of hairy unsolved AI problems, and at the very least a lot of engineering effort.

As a counter-example, S Anand applied the same approach I’m using to Calvin and Hobbes, and got respectable-looking results for both single words and phrases, though he too believes that "clearly Amazon’s gotten much further with their system".

There are some other explanations for the quality of the results I’m getting so far. Email is a very informal and unstructured medium compared to books. There’s a lot more bumpf, stuff like header information that creeps into the main text that isn’t intended for humans to understand. Emails can also be a lot less focused on describing a particular subject or set of concepts, a lot closer to natural speech with content-free filler such as ‘hello’ and ‘with regards’. It’s possible too that trying to pull out keywords from all of a particular person’s sent emails is not a solvable problem, that there’s too much variance in what any one person discusses.

One tweak I found that really improved the quality was discarding any word that only occurs once in the candidate text. That seems to remove some of the noise of junk words, since the repetition of a token usually means it’s a genuine word and not just some random characters that have crept in.

Another possible source of error is the reference text I’m comparing against. Using all emails has a certain elegance, since it’s both easily available in this context, and will give personalized results for every user, based on what’s usual in their world. As an alternative, whilst looking at a paper on Automatically Discovering Word Senses, I came across the MiniPAR project, which includes a word frequency list generated from AP news stories. It will be interesting to try both this and the large Google corpus as the reference instead, and see what difference that makes.

I’m having a lot of fun trying to wrestle this into a usable tool, it feels very promising, and surprisingly neglected. One way of looking at what I’m trying to do is as the inverse of the search problem. Instead of asking ‘Which documents match the terms I’m searching for?’, I’m trying to answer ‘Which terms would find the document I’m looking at in a search?’. This brings up a lot of interesting avenues with search in general, such as suggesting other searches you might try based on the contents of results that seem related to what you’re after. Right now though, it feels like I’m not too far from having something useful for tagging emails.

As a final note, here’s an example of the top single-word results I’m getting for an old trailworking friend of mine:

The anti-immigration one is surprising, I don’t remember that ever coming up, but the others are mostly places or objects that have some relevance to our emails.

One thing I always find incredibly useful, and the reason I created Outlook Graph in the first place, is transforming large data sets into something you can see. For the SIPs problem, the input variables we’ve got to play with are the candidate and reference frequencies of words. Essentially, I’m trying to find a pattern I can exploit, some correlation between how interesting a word is and the values it has for those two. The best way of spotting those sort of correlations is to draw your data as a 2D scatter graph and see what emerges. In this case, I’m plotting all of the words from a senders emails over the main graph, with the horizontal axis the frequency in the current emails, and the vertical axis representing how often a word shows up in all emails.


You can see there’s a big log jam of words in the bottom left that are rare in both the candidate text, and the background. Towards the top-right corner are the words that are frequent in both, like ‘this’. The interesting ones are towards the bottom right, which represents words frequent in the current text, but infrequent in the reference. These are things like ‘trails’, ‘work’ or ‘drive’ that are distinctive to this person’s emails.

Should you cross the chasm or avoid it?


I recently came across a white paper covering Ten Reasons High-Tech Companies Fail. I’m not sure that I agree with all of them, but the discussion of continuous versus discontinuous innovation really rang true.

Crossing the Chasm is a classic bible for technology marketers, focused on how to move from early adopters to the early majority in terms of the technology adoption lifecycle. It describes the gap between them as a chasm because what you need to do to sell to the mainstream is often wildly different than what it takes to get it adopted by customers who are more open to change.

What the white paper highlights is that this ‘valley of death’ in the adoption cycle only happens when the technology requires a change of behavior by the customer, in his terms is discontinuous. Innovations that don’t require such a change are continuous. They don’t have such a chasm between innovators and the majority because the perceived cost of behavior changes is a large part of the mainstreams resistance to new technology.

This articulates one of my instincts I’ve been trying to understand for a while. I was very uncomfortable during one of the Defrag open sessions on adopting collaboration tools, because everyone but me seemed to be in the mode of ‘How do we get these damn stubborn users to see how great our wikis, etc are?’. They took it as a given that the answer to getting adoption was figuring out some way to change users’ behavior. My experience is that changing people’s behavior is extremely costly and likely to fail, and most of the time if you spend enough time thinking about the problem, you can find a way to deliver 80% of the benefits of the technology through a familiar interface.

This is one of the things I really like about Unifyr, they take the file system interface and add the benefits of document management and tagging. It’s the idea behind Google Hot Keys too, letting people keep searching as they always have done, but with some extra functionality. It’s also why I think there’s a big opportunity in email, there’s so much interesting data being entered through that interface and nobody’s doing much with it. Imagine a seamless bridge between a document management system like Documentum or Sharepoint and all of the informal emails that are the majority of a company’s information flow.

Of course, there are some downsides to a continuous strategy. It’s harder to get early adopters excited enough to try a product that on the surface looks very similar to what they’re already using. They’re novelty junkies, they really want to see something obviously new. You also often end up integrating into someone else’s product, which is always a precarious position to be in.

Another important complication is that I don’t think interface changes are always discontinuous. A classic example is the game Command and Conquer. I believe a lot of their success was based on inventing a new UI that people felt like they already knew. Clicking on a unit and then clicking on something else and having them perform a sensible action based on context like moving or attacking just felt very natural. It didn’t feel like a change at all, which drove the game’s massive popularity.

I hope to be able to discuss a more modern example of an innovative interface that feels like you already know it, as soon as some friends leave stealth mode!

Krugle’s approach to the enterprise


I’ve been interested in Krugle ever since I heard Steve Larsen speak at Defrag. They’re a specialized search company, focused on returning results that are useful for software developers, including code and technical documentation. What caught my interest was that they had a product that solved a painful problem I know well from my own career; where’s our damn code for X? Large companies accumulate a lot of source code over the years. The holy grail of software development has always been reuse, but with the standard enterprise systems it can be more work to find old code that solves a problem than to rewrite it from scratch.

Their main public face is the open-source search site at Here you can run a search on both source code and code-related public web pages. I did a quick test, looking for some reasonably tricky terms from image processing, erode and dilate. Here’s what Krugle finds, and for comparison here’s the same query on Google’s code search. Krugle’s results are better in several ways. First, they seem to understand something about the code structure, so they’ve focused on source that has the keywords in the function name and shows definition of the function. Most of Google’s hits are constants or variable names, which are a lot less likely to be useful. Krugle also shows more relevant results for the documentation, under the tech pages tab. A general Google web search for the same terms throws up a lot more pages that aren’t useful to developers. Finally, Krugle knows about projects, so you can easily find out more about the context of a piece of source code, rather than just viewing it as an isolated file as you do with Google’s code search.

Krugle have also teamed up with some big names like IBM and Sourceforge, to offer specialized search for the large public repositories of code that they control. Unfortunately, I wasn’t able to find the Krugle interface directly through Sourceforge’s site, and their default code search engine seems fairly poor, producing only two irrelevant results for erode/dilate. Using the Krugle Sourceforge interface produces a lot more, it seems like a shame that Sourceforge don’t replace theirs with Krugle.

So, they have a very strong solution for searching public source code. Where it gets interesting is that the same problem exists within organizations. Their solution is a network appliance server that you install inside your intranet, tell it where your source repositories are, and it provides a similar search interface to your proprietary code. I find the appliance approach very interesting, Inboxer take a similar approach for their email search product, and of course there’s the Google search appliance.

It seems like a lot of developers must be searching for a solution for the code-finding problem, because it is so painful. It also seems like an easy sell to management, since they’ll understand the benefits of reusing code rather than rewriting it. I wonder how the typical sale is made though? I’d imagine it would have to be an engineering initiative, and typically engineering doesn’t have a discretionary budget for items like this. They do seem to have a strong presence at technical conferences like AjaxWorld, which must be good for marketing to the development folks at software businesses.

Overall, it seems like a great tool. I think there’s a lot to learn from their model for anyone who’s trying to turn specialized search software into a product that businesses will buy.

If blog comments are dark matter, then what’s the dark energy?

Brad called blog comments as the dark matter of the net. They’re really hard to search, and so there’s a lot of useful information that’s effectively lost to the world. What’s driving a lot of my work is my belief that email is the dark energy.

Dark energy makes up 74% of the universe, versus  22% for dark matter. There’s an estimated 200 billion emails sent every day, whereas the number of active blogs is in the low millions. I’m wandering dangerously close to Chinese math, but even assuming the vast majority of emails are low in information content, that’s a lot of untapped data that people are entering into computers.

The reason nobody’s taking advantage of this is that emails are a very personal and private medium, not intended for public consumption, unlike blog posts or comments which are explicitly published to the world. My hypothesis is that there’s a category of people for whom exposing partial information about their email, possibly to a limited audience, will solve some painful problems. JP Rangaswami is my poster child; he opened up his inbox to all his direct reports, as a way of mentoring and sharing information with them, as well as ensuring he doesn’t hear much complaining about each other! I wouldn’t go that far, but I do wish I could easily expose all of my technical discussion email threads to the rest of my team.

There’s practical steps that can be taken within a business setting to make a lot more information available, since that’s one place where you have access to a whole set of interacting email messages. I want to find subject matter experts within the organization, or people who have been in contact with an external group or person you want information on. Doing social graph analysis on an exchange server full of messages will help with that, as will statistical analysis for picking out keywords. I’m excited to see what tools I can build on these foundations. Stay tuned…

Inboxer – An easy way to spy on your employee’s emails?

I first ran across Inboxer through their excellent Enron email exploration site. They offer a server appliance that sits inside a company’s firewall, analyzes all internal email, and offers a GUI interface to explore the messages. They have some sophisticated tools that let you see some common types of emails that management would be interested in, such as objectionable content, recruitment-related or by external contacts. They also let you set up alerts and triggers if particular conditions are met, such as unauthorized employees emailing messages that appear to contain contracts to external addresses. You can experiment with their UI through the Enron site, it seems to be pretty well laid out, and simple enough for non-technical people to use.


They offer graphs of important statistics over time.


There’s a set of pre-packaged searches for things management are commonly concerned about. You can drag and drop any of them onto the main pane, and you’ll get a view of all the relevant emails.

They’ve done a great job technically with Inboxer, it seems like a well-rounded service. I’m a bit disturbed that the this is what the market is demanding though. Despite it being pretty clear from a legal standpoint that the company has no duty of privacy, most people don’t treat their work emails as public documents. Some of the searches, such as those for recruitment terms, are clearly aimed at catching employees doing something they don’t want management to know about, but that aren’t aimed at harming the company. I get worried that it would be incredibly tempting to use this as a technical fix for a management problem. Instead of focusing on keeping employees from job-hunting by keeping them happy, just try and punish anyone who makes the mistake of using the company system in their search.

I believe the Inboxer team has done their homework, they’ve clearly tried a lot of different tools, and this is the one that seems most successful. There’s a lot of legitimate uses, especially in regulated industries and government organizations, where there’s liability issues that require some email controls. I just wish that a less command-and-control, top-down approach was more popular. If Inboxer also offered a client-side version, I’d much rather work for a company that required that. It could make it clear which emails would be flagged and looked at, before they were sent, and help employees understand how public their work emails really are.

Roger Matus, the CEO of Inboxer, has collected a lot of useful email and messaging news in his blog, Death by Email. I’d recommend a visit if you’re interested in their work.

How to access the Enron data painlessly


Yesterday I gave an overview of the Enron email corpus, but since then I’ve discovered a lot more resources. A whole academic ecosystem has grown up around it, and it’s led me to some really interesting research projects. Even better, the raw data has been put up online in several easy to use formats.

The best place to start is William Cohen’s page, which has a direct download link for the text of the messages in a tar, as well as a brief history of the data and links to some of the projects using it. Another great resource is a mysql database containing a cleaned-up version of the complete set, which could be very handy for a web-based demo.

Berkeley has done a lot of interesting work using the emails. Enronic is an email graph viewer, similar in concept to Outlook Graph but with a lot of interesting search and timeline view features. Jeffrey Heer’s produced a lot of other interesting visualization work too. He’s produced several toolkits, and some compelling work on collaborating through visualization, like the demographic viewer and annotator.

Equally interesting was this paper on automatically categorizing emails based on their content, comparing some of the popular techniques with the categorization reflected in the email folders that the recipients had used to organize them. Ron Bekkerman has some other interesting papers too, like this one on building a social network from a user’s mailbox, and then expanding it by locating the member’s home pages on the web.