Email data mining by Spoke and Contact Networks

Mining

I’ve been thinking hard about painful problems that email analysis could solve, and one of them is the use of a corporation’s email store to discover internal colleagues who have existing relationships with an external company or person you want to talk to. For example, if you want to sell to IBM, maybe there’s someone in your team who’s already talking to someone there. Or internally, you might want an introduction to someone in another department to discuss a problem, and it would be good to know who in your team had contacts there already.

I was discussing these thoughts with George Eberstadt, co-founder of nTag, and he pointed me to a couple of successful companies who are already mining email to do this, Spoke and Contact Networks.

Spoke are interesting because they’re entirely client-based, rather than running on an organization’s whole message store. They work by taking data from everybody who’s running their Outlook add-on, along with information pulled from publicly available sources, and feeding into their own database. You can then search that yourself to find information on people you’re interested in contacting, and people you know who’ve already been in touch with them. It’s effectively creating a global social network largely based on the email patterns of everybody who belongs.

Technically, it sounds like they’re doing some interesting things, such as exchanging special emails to confirm that two people really do know each other, but when I tried for my own name, I didn’t get any useful information. I also am surprised that companies would allow the export of their employees email relationships to a third-party. It may just be that this is happening under the radar, but it seems like the sort of thing that a lot of companies would want some safeguards on. The service encourages individual employees to install the software themselves, without any warning that they might be opening up the organization’s data to third-party analysis. I know at a lot of the companies I deal with would frown on this, to say the least.

Contact Networks seem much more focused on selling to corporations as a whole, rather than individual employees. They build a social graph pulling from several sources internal to the company including email and calendars, CRM, marketing databases, HR and billing systems. They use this to identify colleagues who know a particular individual, which is a succinct description of a ‘painful problem’ that companies would be willing to pay money to solve. They seem to have been very successful, with lots of big-name clients and they were just bought out by Thomson.

It’s good to see how well Contact Networks have done, it’s proof there’s demand for the sort of services I’m thinking of, even if they have already solved the immediate problems I was considering.

Seriosity – How much is your email worth?

Coins

I first ran across Seriosity in a recent Wall Street Journal article. It was discussing companies dealing with ‘colleague spam’ or the problem of bacn in the workplace, where unimportant messages crowd out significant ones in your inbox. Unfortunately the article itself is subscription only, but after mentioning ClearContext and Xobni, they talk about Seriosity’s approach.

The heart of their method is a virtual currency, Serios, that you can spend to mark a message you send as important. Essentially, it’s a way of making the existing ‘important flag‘ more useful, by making it a scarce resource. The current flag doesn’t mean very much, because there’s no natural regulation of its use, so some people can mark all their messages as important, whilst others don’t use it at all. The Serios spent on a message determine how prominently it appears in the recipient’s mailboxes.

The currency system is based on studies of online games, with the recipients of messages receiving the Serios the sender spent on them, and everybody having a limited store of points to use. They’ve already done a test deployment of their product for Ebay’s internal email, with some positive quotes from the company on its usefulness but performance problems with the installed client, apparently fixed in a newer version.

I’m very happy to see something so innovative in its approach to this problem, but I do think they’ve got some significant hurdles to overcome.

For starters, I’m not convinced that colleague spam is amenable to any algorithmic solution. In the article, they use the example of a department email sent out announcing brownies in the kitchen. I like brownies! I wouldn’t want to miss out on that message when it came in, but I might not care if the same email was for carrot-cake. These sort of messages require some understanding of the content and its significance to the recipient to process. I would end up scanning the subject lines of all emails as they came in to make sure I didn’t lose a message I cared about.

On a practical level, game economies are really hard to design well, and the constraints there are only about making the experience enjoyable. For a work tool, there’s a lot of additional requirements. I’d imagine that the CEO wouldn’t want her occasional all-company messages to appear in people’s spam folders, so that would either require giving her a massive store of Serios (which would encourage inflation, and by extension require Serios to be allocated by seniority) or give her an opt-out like the existing important flag, which others would also want to use and abuse. There’s also a very unbalanced pattern of communication for certain workers who need to send out a lot of informative emails, without necessarily getting many back. For example, an office manager probably sends out a lot of all-building emails, some of which might be urgent, but which will be hard to allocate the right amount of currency to.

They do also mention the analysis of the currency flow as a way of charting how the organization actually works. That’s an idea that’s close to my own heart, since it lets you see the strength of ties between people who exchange a message, something I’m trying to do by analysing which emails are actually replied to.

Seriosity have obviously been working hard on this for the last couple of years, and it looks like they’re getting close to a public release of their Attent product. I look forward to playing with this in practice, since it appears they have an open beta program I can apply for. I would link to the demo, but unfortunately the supplied http://www.seriosity.com/demo.html url seems to be broken, though you can click on the ‘view demo’ link on this page to get to it.

Disruptor Monkey’s Unifyr

Disruptormonkey
Disruptor Monkey are a North Carolina based startup, and they recently popped up on my radar through an article on Brad’s blog. There’s not much information available about their product, Unifyr, but they do have this well-produced teaser video. Based on that, I’d describe it as a way of accessing a lot of different data sources from a single interface.

The demo shows external web pages, email messages, CRM databases, and internal documents (both stored locally and on the network) all appearing in a unified file structure view. It looks like there’s a way to search, sort and organize all of these sources of data from the same interface.

The workflow of using their product appears very streamlined and intuitive. It’s using metaphors people already know, a button in a browser toolbar and a folders and files view. This opens up the product to a lot of people who’d be put off by something geekier. They also have obviously thought about making it as hands-off and automatic as possible, with features like automatic tag generation, and making folder sharing very easy. I’d like to use it myself, based on what I’ve seen, and as Nick put it in an email I like their thinking and am "well down the road drinking the kind of coolaid we enjoy".

I’ll look forward to hearing about how they position their product, it seems like it could help out with a lot of business processes, but they’re not discussing any focus yet. I’m especially interested in how they approach this, since I’m struggling with finding a painful enough problem to apply my own ideas to. Check out Nick Napp’s blog to keep up to date with this very promising company.

Unifyrscreenshot

Does email analysis invade privacy?

Secretpackage
At Defrag, JP Rangaswami was talking over lunch about how he’d opened up his inbox to all his direct reports. This fascinated me, both for the behavior of his subordinates (they were most interested in his sent items, as a way of understanding what he was thinking) and because it was a very logical idea, but one I’d never heard even discussed before.

One of the great strengths of email as a communications tool is that it has a very clear security model. You explicitly list the people you want to receive the email by name, and they’re the only ones who get it. There are plenty of ways information can be leaked, such as forwarding on to third parties, but these all require somebody to actively make a decision to do so. By contrast, it’s much harder to know who can see an internal wiki page.

A lot of people focused on collaboration seem to believe that worrying about access is just oldthink that needs to be eradicated. The new collaboration tools will enable a new world where information is freely shared across the corporation, vaulting over traditional communication barriers. In my experience, there’s a lot of non-technical reasons why people care about access.

Fundamentally, most leaders are rewarded for the results their team or department produces, since that’s a lot easier to measure than the contribution their employees have made across the whole company. This means that it’s hard to justify spending resources collaborating with other internal teams, even though looked at holistically that might be best for the company. Taken to an extreme, this stovepiping can be crippling, but it’s an inherent emergent feature of hierarchal organizations, so I don’t see it disappearing anytime soon.

On a individual level, knowledge is power. People may have invested a lot of time building relationships within the company, and one reward is access to information others don’t have. This may make them reluctant to share these sources, both for selfish reasons and so they can act as a filter for information requests, to make sure the source isn’t overwhelmed with inappropriate ones.

There’s also a lot of sensitive personnel-related communications that can go on. Even the knowledge that two people have ever emailed each other can be sensitive if it’s a subordinate bypassing her boss and contacting human resources.

The sort of email analysis I’m interested in takes all an organization’s messages, and does a global analysis to reveal useful relationships and information, especially about the social graph of the employees. This is not information that people were expecting to reveal when they sent their emails, and while there’s nothing illegal about doing this, the emails all belong to the company if they’re sent on company accounts, it is breaking the security model that people trust. That’s both ethically uncomfortable and likely to be a barrier to adoption.

The solution has to be keeping the ownership and sharing of information within the user’s control. One way of doing that is by default only allowing anonymous information to be publicly reported, which could include such things as how many people read or forwarded an email you sent. You could also designate certain internal mailing lists as publicly accessible across the organization. There’s already an understanding that lists with open membership policies are not private, so this isn’t changing the mental access model that people trust. Going a step further, you can give people tools to share certain emails, the way a lot of people share calendars at the moment. This would work particularly well tied into Sharepoint, since documents there have their own access model. In particular, it might be useful to add a special email address that adds the email to the public intranet, and visible to email analysis tools.

It should be possible to overcome user’s concerns about access and email analysis, but it will require some careful design. I can certainly understand why most existing services focus on either client-side tools, or global analysis designed to give top management or forensic analysts an unrestricted view of all emails, those both sidestep these issues.

Due diligence for startups

Signs

Here’s some of my favorite articles on due diligence, all obviously written from painful experience.

Rick Segal writes about The Due Diligence Dipsey Doodle. He’s great at providing concrete examples of what he’s talking about, and emphasizing ‘obvious’ points (like documenting everything) that everyone theoretically knows, but are still commonly ignored in practice. He’s also got an older post covering the actual list his firms goes through at Kicking the tires: A due diligence checklist.

Nivi and Naval at VentureHacks try to answer How much diligence should we do before we sign a term sheet? I like their relentless focus on giving entrepreneurs tools to put together good deals, with good coverage of what can go wrong to back up their advice.

Suzanne Dingwall Williams is a lawyer with a lot of experience working with startups. She gives some cautionary advice on what can go wrong with the disclosure process in Selling the startup: Due diligence disasters.

Furqan Narziri has some useful tips on the process in Due diligence: What to expect. I like his suggestion of using Sharepoint to manage the documents being worked on, especially since that ties in with a lot of what I’ve been working on.

Finally, Mark Peter Davis has an ambitious series  covering the whole process of raising capital, and is just starting on the due diligence phase in Sharing your cap table. I’m really interested in following the whole series, he’s doing a great job.

Ethnography and software development

Mask

Ethnography, which literally means people-writing, is one way anthropologists study communities. They write very detailed and uninterpreted descriptions of what the people living in them do and say. These raw and unstructured accounts can be almost like diaries.

Most computer services are doomed to fail because they’re based on wishful thinking about the way their users should behave, not how they actually work. This is where ethnography comes in. Just like good user statistics, it forces you to stare the gloriously illogical humanity of people’s behavior square in the face. Only after you’ve got a feel for that can you create something that might fit into their lives.

Defrag was full of visionaries, academics and executives, people used to creating new realities, and changing the way people work. One of the most useful open spaces I attended was a session on how to bring web 2.0 tools into big companies. What amazed me was their visceral loathing of the imperfect tools already being used. My suggestion that there might be some valid reasons to email a document as an attachment rather than collaborating on a wiki was met with a lot of resistance. An example is the fine-grained control and push model for distribution that email offers. Senders know exactly who they’re passing it to, and though those recipients can send it on to others, there’s a clear chain of responsibility. With a wiki, it’s hard to know who has access, which is tricky in a political environment (ie any firm with more than two people).

Digging deeper, especially with Andrew McAfee, it felt like most of the participants had encountered these arguments as smokescreens deployed by stick-in-the-muds who disliked any change, which explained a lot of their hostility to them. It felt to me like the reason they are so effective as vetoes to change is that they contain some truth.

This smelt like a interesting opportunity, with a chance to take useful technology currently packaged in a form only early adopters could love (where’s the wiki save file menu and key command?) and turn it into something a lot more accessible to the masses, requiring few habit changes.

As a foundation for thinking about that, here’s some psuedo-ethnographic observations on how I’ve collaborated on written documents. They’re written from memory, and I’ve structured them into a rough time-line, and conglomerated all my experiences into a general description. This makes it not quite as raw as real ethnography, but it’s still useful for me to organize my thoughts.

  • Someone realizes we need a document. This can come down as a request from management, or it can be something that happens internally to the team. The first step is to anoint someone as leading the document creation. This often happens informally if it’s a technical document, and often ends up the person who’s identified the need. If it’s a politically charged document, the leader is usually senior, and often someone with primarily management duties.
  • Then, they figure out if it’s something that can be handled by one person, or if it really needs several people’s inputs. There’s a difference between documents that are shared for genuine collaboration, and those which are passed around for politeness’s sake, without the expectation of changes being made. Assuming it’s the first kind, there will be a small number of people in the core group who need to work on it, very seldom more than four.
  • If it’s documenting something existing, somebody will usually prepare a first draft, and then comments are requested from that small, limited group.
  • If controversial technical decisions or discretion are involved, then the leader will often do informal, water-cooler chats to get a feel for what people are thinking, followed by a white-board meeting with the core group. An outline of the document is agreed, with notes taken on somebody’s laptop, and often emailed around afterwards.
  • If someone’s trying to sell the group on an idea, they may create a background document first. This is usually sent as an email, with either content in the message, or a wiki link.
  • In most cases, the leader’s document is emailed around to the core group for comments. No reply within a day or two is taken as assent, unless the leader has particular concerns, and follows up missing responses in person.
  • For very formal or technical documents, changes will be made in the document itself. More often the comments will be made in an email
  • thread, and the leader will revise the document herself with agreed changes, or argue against them by email, or talking directly to the person.
  • Documents being collaborated on are rarely on the wiki. Word with change tracking enabled is the usual format. Standing policy and status documents are two big exceptions. They’re almost always on the wiki, but may not appear there until they’re agreed on.
  • Final distribution is usually done by email. For upwards distribution to management, this will be as a document or email message. For important ones, this is often only a short time before or even after a personal presentation to the manager who’s the audience, to manage the interpretation.
  • For ‘sideways’ delivery to colleagues, a wiki link may be used, though the message is still sent by email, and might be backed up with an in-person meeting.

This is just a brief example, but looking through these raw notes, a few interesting things leap out at me. Who sees the document at each stage is important to the participants. It’s possible to argue that this is a bad thing, but it’s part of the culture. We aren’t using our wiki for most of our document collaboration, it’s still going through email and Word, which might be partly connected to this.

It’s a useful process, it’s a way of looking at the world that helps me see past a lot of my own preconceived ideas of the way things work, through to something closer to reality. Give it a shot with your problems.

Thanksgiving hiatus

Turkey
My parents have just arrived from the UK, and we’re all flying off to visit Liz’s family in Wisconsin tomorrow, so I’ll be trying to stay away from blogging for the next week. If you’re a new reader, the Defrag coverage  should keep you busy until I return. If you’re technically inclined, there’s also a series on the technical details of accessing Outlook emails,  and a step-by-step guide to writing an Internet Explorer extension, based on my experiences developing Google Hot Keys.

And finally for outdoors folk, check out some of my posts on hiking, biking and camping in LA, complete with tips on combining martinis and backpacking!

Google’s latest mail API

Stamp

As Brad spotted, my previous post strong-armed Google into introducing a new mail migration API. Well, there was correlation even if I’m not so sure on the causation. Looking through Google’s latest offering, it’s clearly aimed at one-way migration from other systems to Google Apps, rather than being a two-way interoperability standard that would allow a mix of Exchange and Gmail use within the same system.

To quote from the announcement, they introduced it because "some customers are reluctant to step into the future without bringing along the email from their past". I’d imagine there’s some customers who are ‘reluctant to step into the future’ if it’s a one-way trip for all their email data too, locking them into Google’s OS going forward. Email, calendars and contacts are crying out for a nice open integration layer. The information you need is comparatively well-defined and bounded, and there’s already supported standards for the components of the problem, like imap, vcard and icalendar.

Microsoft has always had a strategy with strong developer support as a priority. This is great for third-party vendors but arguably was a factor in a lot of their security and usability issues. Google doesn’t feel the same need to look after external developers, as shown by the removal of their search API. They’d much rather simplify the engineering and user-experience by avoiding the clutter of hosting third-party code within their apps.

Even though it’s ugly and COM-tastic, it’s possible with enough effort to dig deep into Exchange’s data stores and build deeply integrated tools. Moving to Google Apps (or most other SAAS apps I’ve seen) you’re losing that level of access. My hunch is that in a few years time we’ll see the same customer pressure that drove MS to open their enterprise tools to customization pushing SAAS companies to either offer APIs or lose business.

Colorado trip

Flatirons

Liz joined me in Denver on the last day of Defrag, and we spent the rest of the week exploring Colorado together. We started off that evening with a visit to Pints Pub. When I first came to the US, I felt very odd visiting simulacrums of British tea-shops and pubs, they always felt like such an exaggerated, "Mary Poppins" version of the old country. I’ve been over here long enough now that I’m very happy to find even a half-decent Bangers and Mash, and the other details are no longer jarring.

Pints was actually a great place, they had an amazing selection of scotch, some really impressive draught beers, and good food. They’d got the right atmosphere too, there were the obligatory pictures of Churchill and policemen, but the furnishings, fittings and lighting were all very pub-like.

We then spent three days up in the mountains, doing some early-season snowboarding at Loveland and Keystone, staying at the Inn at Keystone. It was so early in the season that there were only a few runs open, and they were very icy, so it was a pretty challenging experience. Liz ended up getting pretty bruised and battered from falls, but we both found the martinis on offer at the Inn’s bar very soothing.

We spent the last two nights in Boulder, the hotels were packed so we ended up at the slightly tattered Golden Buff Lodge. It wasn’t a bad location, with a nice Indian restaurant nearby, but the heater sounded like a helicopter taking off when it started, there was no cable for the internet, and no sound insulation in the ceiling, so just the upstairs neighbors walking around was enough to wake us up. We still had a great time in Boulder though, we hiked from the Rangers Cottage up the Chattaqua trail on the first afternoon, and then did a big loop from NCAR up to Bear Peak and back on the Saturday.

The Bear Peak loop was tough, with about 2,300 feet of elevation gain over about four miles and starting at over 6,000 feet. The gain was concentrated in the climb up the peak itself, and the final half mile was a really steep uphill scramble. Here’s Liz coming down from the peak:
Lizbearpeak

We took the Fern Valley trail back to NCAR, and that was shorter than the Bear Canyon route we took up, but involved about a mile and a half of extremely steep and slippy downhill that would have been a lot easier if we’d brought our hiking poles. The view from the top of Bear Peak was incredible though, both looking out towards the mountains and back towards Denver it was beautiful.

It was fun people-watching on the trail too; everybody looked like they could plausibly be part of the university faculty or students, and there was at least a dog for every person we saw, which made us think again about getting one ourselves. Even better, Boulder has a scheme where you can have your dog off-leash on the trails if you have a special tag hat proves it’s under your sight and voice control. I feel sorry for the dogs here in California, they never get to have any fun out in the mountains like that.

Google, Yahoo and MSN Mail APIs

Mailbox

Whilst there’s no official Gmail API, there is an unsupported but widely used standard, the functions used by mobile phones to access Google mail. Luckily for me, there’s been a lot of work already done to figure out the format and protocol, probably the best documentation is the source of the libgmailer PHP project. The downside of it being unofficial is that it keeps getting broken by Google’s changes, but there’s an active community using it who seem to patch it up again very quickly.

Yahoo actually has an official mail API, but it suffers from a couple of serious flaws. First there’s this language at the start of the documentation: "You may not use the Yahoo! Mail Web Service API to mine or scrape user data from the user’s Yahoo! account." Umm, so I can access the data but can’t ‘mine’ or ‘scrape’ it, whatever that means? Does that include creating a social graph from their mailbox? It certainly sounds like it.

Secondly, some basic functions like GetMessage to grab information about an individual email are only available to premium accounts. I’d imagine that would instantly cut down the potential audience by an order of magnitude.

MSN/Hotmail used to have a nice undocumented API through WebDAV/HttpMail. Unfortunately they shut down access to non-premium customers in apparent response to spammers. There are reports (bottom of article) that it’s still possible to use it to download messages, just not send them, but I haven’t tested that. It looks like the only alternative is screen-scraping.

This is a great example of the ‘separate data silos with unusable content’ problem that Doc Searls discussed in his Defrag talk. The user could gain a lot from allowing other services access to their mail, for example decent external mail integration onto Facebook, but it’s not in the interest any of the companies that physically hold their data to allow that.