What analyzing digital communications misses

Closenessdiagram

Greg Berry just posted a very interesting comment, touching on a question I've wrestled with.

"…lots of business and life happens off the internet (hard to believe, I
know), but even within the digital confines, there are so many
different planes of communications to track."

Probably the best example of this is your significant other or business partner. If you're often in the same room as them, you probably won't send them as many emails as a direct report who's in another office. If you rely on communication frequency for measuring closeness, you'll underrate those relationships. So how do you work around this problem?

Design your algorithms around the blindspot. Google's search results are nowhere near as good as a dedicated human researcher could produce, but that doesn't matter. They narrow it down to a couple of dozen sites you can manually check. A few bogus results or dubious rankings don't matter because they can easily be spotted and ignored. The equivalent for tools based on automated relationship analysis is giving users the option to edit the strength of relationships to correct the occasional mistake, and always giving people a chance to eyeball any decision before any action is taken by the system.

Pick the right problem domain. I'm fascinated by applications in the business world because the relationships I needed the most help with are right in the sweet spot for email. I've sketched the graph above to show roughly the communication frequencies I've experienced. For different industries and generations the lines will shift and scale, but between Bob in accounting and your boss there's probably a lot of people you exchange a lot of mails with. Stick to problems related to those folks, and email frequency will be a good approximation to closeness.

Be realistic about the results. I think the Boulder Twits communication map is the best guide to the relationships in the local tech scene, but that's mostly because it's the only one. As Gregg says, different styles of communications heavily affect the results, even if you forget about the channels it's missing. Heavy Twitter users are far more likely to end up in the center of the graph than less prolific twits. Chris Wand is entirely missing because he's not on Twitter, even though he's heavily involved in the community. As we pull in more and more channels we'll be able to produce far better analysis, and do a lot of useful things, but we'll never capture all the fractal richness of relationships within our primate packs.

Javascript, the ginger-haired stepchild of the language family

Redhead

Photo by Gold Sardine

Liz asked me yesterday what language Mailana is written in. It took me a while to think about it, but the list is C (low-level Exchange interfacing), C++ (speed-critical string processing), C# (Outlook plugin), PHP (most of the server architecture), SQL (database querying), Actionscript (Flash components) and Javascript (rich Ajaxesque browser functionality). It got me wondering why the latter gets so little credit, out of all of them it's probably my favorite to use.

I found Douglas Crawford's explanations of why it's the world's most popular, and misunderstood language rang very true, but what really caught my eye were some demos written as pure scripts:

http://www.monstropolis.org/intro8.html
http://www.monstropolis.org/intro1.html
http://www.monstropolis.org/intro3.html
http://www.monstropolis.org/intro7.html
http://www.uselesspickles.com/triangles/demo.html (There's something deeply twisted about rendering 3D triangles using CSS style tricks, but I just can't look away)

I don't know when Javascript will be welcome in polite society, but dismiss it at your peril. It's now everywhere and there's a whole generation of self-taught programmers headed your way who know nothing else.

How I built the Boulder Twits graphs

Clockmechanism

Photo by Pierre J.

I knew I wanted to build a map of how people were connected in the Boulder tech scene. The first step was accessing the raw data, in this case all the Twitter messages from the first 60 local people I'd identified. I already had a system set up to rapidly analyze large numbers of email messages for my Mailana startup. It's modular, with different import components that access mail APIs like Exchange's MAPI/RPC, Gmail's IMAP and Outlook's Object Model, all outputting a stream of messages in standard XML form. Using Twitter's API it was pretty easy to build an importer. The only wrinkle was that I had to search for @someone in the message body, and add that to the recipients field in the XML.

That whirred away for a while pulling in the complete message histories into my database, with indices created keyed on the recipients, as well as lots of other values. Sitting on top of that database I've got a Facebook App-style REST API that let me run queries like "Tell me who sent messages to who within this group of people". Running that on the Twitter messages gave me a list that conceptually looked like this:

Alice to Bob : 10 messages sent, 3 messages received
Alice to Charles: 4 messages sent, 7 received

What I actually wanted was a single number for any relationship, a measure of how strongly Alice and Bob are connected. My choice was the lower of the sent or received counts, so in the above case

Alice to Bob: Strength 3
Alice to Charles: Strength 4

I like this method for mail because it excludes bots like Facebook notification addresses that you never reply to, and penalises other sort of unequal relationships, eg ignoring famous people you might have emailed who ignore you. Not that that ever happens to me of course.

So now I had a list of all the relationships in the community, I needed to display them. I wanted something that could be interacted with inside the browser, so I built a Flash component. I'd never written any Actionscript before, but Mark Shepherd's Springgraph example was a great starting point. After a few days of wrestling with the wonders of flex I had something working.

I then wrote a PHP script that accessed the Mailana API to produce the link information, and the output it in an XML form my component could read in. I based it on the format Daniel Mclaren used for his handy Constellation Roamer plugin, since I'd used that before.

For the Boulder Twits site I didn't want to re-run the query every time to generate the XML. Though it only takes a fraction of a second to create, the system's still pre-alpha so I didn't want a production site depending on it. Instead I saved off several versions and pointed the component directly at the cached XML files. I also didn't want to require every viewer to rerun the force-directed layout, so I let each version arrange itself on my machine, saved the positions and paused the simulation by default. If you want to see the simulation running, try clicking the small play icon in the top left and drag a few people around to see the graph compensate.

I had a lot of fun putting this together. To be honest I was looking for a nice cozy code-womb to crawl into for a couple of weeks after draining my extrovert batteries through Defrag and lots of followup travel and meetings. This was just the ticket, now I'm recharged and looking forward to meeting all the people I've discovered through compiling the list!

How a graph can find missing members

Missingtwits

Social network analysis often devolves into pure eye-candy, but I wanted my graph for more than just a pretty face. I already talked about some of the patterns it reveals, but one of my goals was to uncover all the interesting folks I knew I must be missing off the list. How can a graph help with that?

I'd already analyzed who everyone I had listed talked to, to build the initial network. Next I had to analyze all of their recipients' tweets, to see if they'd replied, and how strong the connection was. By picking the most strongly connected outsiders, and placing them in the same graph, I could see where they fit in the network. In the example, their names are underlined so they stand out.

As you can see, I uncovered quite a few people like z3rr0, w1redone and technosailor who are part of the central group, as well as a lot of other well-connected twits. Once I've weeded out the non-locals, I'll update the list.

This technique is general enough to apply to any group with a partial membership you're trying to complete. For a simple case imagine finding new people to follow by analyzing strangers your friends talk to a lot. Stay tuned for more fun with this.

Will privacy through obscurity work?

Hidingdog

Photo by Angel Shark

Jud's latest post on the generation chasm in attitudes to privacy got me thinking. I'm basing my business on the theory that people will trade privacy for utility in the right conditions. Looking around, everyone from Facebook to Twitter to location-aware services like Brightkite are publicly posting all sorts of personal information. My parents get a neighbor to take the mail in when they're away so burglars won't know the house is empty. Now hundreds of thousands of people tell the world when and where they're on vacation. How come we're not all robbed?

Part of the answer is we're in the honeymoon period for the technology. Remember when every email you got was exciting because it was from a real person, not a robot responder or spam, and you could open attachments without worry? Once services go mainstream, malicious people will abuse them and the media will whip up moral panics.

Another part is the expectation that even though your information is technically public, nobody will bother tracking it down. For example, it's not easy to see conversations between people you don't know on the default Twitter interface. People's attitudes will change once the tools for mining that data improve. It's the same with company email. Everybody knows that their boss or IT admin could be reading their email, but it would be so time-consuming that most people have an expectation of privacy. This is the equivalent of the old Microsoft approach of security through obscurity. Though it has a bad reputation, it worked for a long, long time.

My prediction is we'll keep muddling through as always. There will be backlashes against the complete openness of the current web services as stalkers and spammers attack, but being lost in the crowd isn't a terrible strategy. There will be new social conventions, as we figure out a consensus on what's safe to put online, and new access controls, hopefully based on implicit information like who you've communicated with.

What does the BoulderTwits graph mean?

I just added a new feature to my directory of Boulder tech Twitter users; a map of the social network.

It shows who talks to who. A line between two people means that they've sent public replies to each other. The thicker the line, the more often they've exchanged tweets.

To turn this into a map, I use an automatic technique called force-directed layout to pull people with strong connections to each other closer. That means that groups of people who talk amongst themselves a lot will form clusters. So what does this map show?

Notorioushardcore

There's a central cluster of people who have a lot of connections with other people in Boulder. Many of these people I've met, and they do all seem influential in the community. If you wanted to get the ball rolling on a local project, these would be the folks to talk to.

Close teams

Imulusgraph

There's a noticeable side-group outside the main cluster. These are all Imulus employees, and it's clear they're using Twitter to talk with each other, but only Bruce and George have strong connections with other Boulder twitterers. The graph algorithm doesn't know anything about their employment, what's cool is it automatically spots they're a team just based on their communications.

Hubs

Davetaylorgraph

Dave Taylor is a great example of a network hub. He has conversations with a lot of people from different groups who don't know each other well. For example he talks to both brettfromtibet and bruce, but they don't talk to each other. People like Dave are vital in social networks because they link otherwise unconnected groups. If you need an introduction to someone you don't know, Dave is a good person to help, because he's got such a diverse range of friends.

In my next post, I'll show how I uncovered some new people I'll be adding to the Boulder Twits list, thanks to graph analysis. If you're interested in learning more about the fun you can have with these sort of networks, check out Valdis Krebs' awesome gallery of case studies.

How to grow a karass

Frogs

Photo by Thomas Hawk

In Cat's Cradle, Kurt Vonnegut invents a couple of terms I really like. A karass is a community of people without formal links but who work together to get things done, whereas a granfalloon is a grouping who imagine they have something in common, but the association is actually meaningless and unproductive.

As David Cohen put it to me, a big company's marketing department is a granfalloon, your personal network is a karass. This resonated with me because that's how we made things happen in Apple. My term at the time was "a conspiracy of engineers", but the idea was to discover curious and motivated people outside my immediate team (and sometimes even in suppliers like ATI or NVidia) who wanted to see Apple achieve some goal. We'd informally talk, figure out an approach that might work, often code up a prototype, and then approach our respective managers with a joint proposal.

This is the only way I've seen innovative things get done in big firms, but it's immensely difficult to create those informal networks. It took me years of water-cooler chats, lunches, popping into people's offices and general nosiness to get as far as I did. As I thought about the expertise and external contact location technology I'm working on, I realized that Mailana is all about building tools to enable karasses. I want somebody in my old position to be able to find collaborators far more easily, and so help companies get a lot more done with the same resources.

Tools alone won't ensure these informal groups emerge. They can't be ordered into existence, they have to grow organically. What the technology can do is provide an environment they can thrive in.

Yahoo’s mail API

R2d2mailbox

Photo by A Hermida

Despite their rocky year, Yahoo still have a massive email user base, so I was very interested when they announced a new API for plugging in to their web client. I looked through the documentation, and unfortunately this release is pretty limited, though there's more to it than Gmail's latest interface. Google essentially just lets you embed normal iGoogle widgets into the mail side bar, there's no way to interact with the user's mail. Yahoo does let you trigger UI actions like bringing up a search window, populating an add event popup or composing a message, but there's no way to access any data on the messages, or perform any modifications without user involvement.

I'm sure these are all precautions to protect users from malicious plugins reading private data, as is keeping it in a limited beta restricted to a small group of developers. While I'm disappointed I won't get the chance to do all the interesting data analysis I'd like to offer, this is a step in the right direction. I expect that seeing the demand for the applications you can build even with this limited functionality will push the industry towards more open interfaces. Then I can really have some fun!

How Nepomuk plans to file your email

Tagdrawer

Photo by Indie Wench

Peter Mucha just pointed me in the direction of the NEPOMUK project. It's an EU funded semantic research effort, with a code name worthy of a Bond villain. It's very much part of the OWL/RDF approach to cracking the semantic problem, so it took me a while to dig down through the more abstract goals and understand what it actually did. It was worth the effort though, since it seems like their Soprano search engine for semantic data has enabled some interesting email functionality within the latest KDE release.

They seem to recognize that email is the ginger-haired stepchild of information management, left behind by the tools we take for granted for searching the web and files. A lot of their examples of using meta-data are related to messages; "Tell me which message a saved attachment file came from", "Find all the attachments related to these contacts", virtual mail folders, and even integrating calendar information with "Find all messages from people I've had meetings with in the last two weeks".

The downside is that while a lot of the underlying architecture to implement these cool features is provided by NEPOMUK, there's still the big and messy job of updating all the client applications to generate and use the meta-data. There's some work moving forward with the Akonadi mail client, and basic search support within KDE, but a lot of the really cool uses are still on the drawing board. There's no automatic tagging of emails or files based on content for example.

It's great to see them trying to tackle the challenge of email in innovative ways, they definitely have their eye on an interesting set of problems.

Why I love the US

Sneer

Photo by Zombizi

I ran across this UK newspaper column by Paul Carr on the Le Web conference. It's great, funny writing, but it also reminds me why America's the best place in the world to be an entrepreneur. Paul takes great glee in tearing down everyone involved, including the startups:

"… entrepreneurs from around the world each pay €1,500 to meet their
peers, demo their startups and generally try to pretend that their
businesses aren't completely and totally doomed."

Bob Sutton gathered a brilliant summary of the academic evidence that being negative about other people increases your status. One of the studies concluded “Only pessimism sounds profound. Optimism sounds superficial.”

What cynics like Paul don't get is that most of us in the startup world are well aware that the odds are against us, but we think it's worth doing anyway! US culture celebrates that risk-taking, but Britons tend to shake their heads and tell themselves it will all end in tears. Most of the time they're right. The trouble is, Google and Microsoft were crazy ideas in their time, and would have never made it without a lot of people supporting them despite the risk. By sneering, Britain guarantees they'll never build a world-class tech company.