How I built the Boulder Twits graphs

Clockmechanism

Photo by Pierre J.

I knew I wanted to build a map of how people were connected in the Boulder tech scene. The first step was accessing the raw data, in this case all the Twitter messages from the first 60 local people I'd identified. I already had a system set up to rapidly analyze large numbers of email messages for my Mailana startup. It's modular, with different import components that access mail APIs like Exchange's MAPI/RPC, Gmail's IMAP and Outlook's Object Model, all outputting a stream of messages in standard XML form. Using Twitter's API it was pretty easy to build an importer. The only wrinkle was that I had to search for @someone in the message body, and add that to the recipients field in the XML.

That whirred away for a while pulling in the complete message histories into my database, with indices created keyed on the recipients, as well as lots of other values. Sitting on top of that database I've got a Facebook App-style REST API that let me run queries like "Tell me who sent messages to who within this group of people". Running that on the Twitter messages gave me a list that conceptually looked like this:

Alice to Bob : 10 messages sent, 3 messages received
Alice to Charles: 4 messages sent, 7 received

What I actually wanted was a single number for any relationship, a measure of how strongly Alice and Bob are connected. My choice was the lower of the sent or received counts, so in the above case

Alice to Bob: Strength 3
Alice to Charles: Strength 4

I like this method for mail because it excludes bots like Facebook notification addresses that you never reply to, and penalises other sort of unequal relationships, eg ignoring famous people you might have emailed who ignore you. Not that that ever happens to me of course.

So now I had a list of all the relationships in the community, I needed to display them. I wanted something that could be interacted with inside the browser, so I built a Flash component. I'd never written any Actionscript before, but Mark Shepherd's Springgraph example was a great starting point. After a few days of wrestling with the wonders of flex I had something working.

I then wrote a PHP script that accessed the Mailana API to produce the link information, and the output it in an XML form my component could read in. I based it on the format Daniel Mclaren used for his handy Constellation Roamer plugin, since I'd used that before.

For the Boulder Twits site I didn't want to re-run the query every time to generate the XML. Though it only takes a fraction of a second to create, the system's still pre-alpha so I didn't want a production site depending on it. Instead I saved off several versions and pointed the component directly at the cached XML files. I also didn't want to require every viewer to rerun the force-directed layout, so I let each version arrange itself on my machine, saved the positions and paused the simulation by default. If you want to see the simulation running, try clicking the small play icon in the top left and drag a few people around to see the graph compensate.

I had a lot of fun putting this together. To be honest I was looking for a nice cozy code-womb to crawl into for a couple of weeks after draining my extrovert batteries through Defrag and lots of followup travel and meetings. This was just the ticket, now I'm recharged and looking forward to meeting all the people I've discovered through compiling the list!

How a graph can find missing members

Missingtwits

Social network analysis often devolves into pure eye-candy, but I wanted my graph for more than just a pretty face. I already talked about some of the patterns it reveals, but one of my goals was to uncover all the interesting folks I knew I must be missing off the list. How can a graph help with that?

I'd already analyzed who everyone I had listed talked to, to build the initial network. Next I had to analyze all of their recipients' tweets, to see if they'd replied, and how strong the connection was. By picking the most strongly connected outsiders, and placing them in the same graph, I could see where they fit in the network. In the example, their names are underlined so they stand out.

As you can see, I uncovered quite a few people like z3rr0, w1redone and technosailor who are part of the central group, as well as a lot of other well-connected twits. Once I've weeded out the non-locals, I'll update the list.

This technique is general enough to apply to any group with a partial membership you're trying to complete. For a simple case imagine finding new people to follow by analyzing strangers your friends talk to a lot. Stay tuned for more fun with this.

Will privacy through obscurity work?

Hidingdog

Photo by Angel Shark

Jud's latest post on the generation chasm in attitudes to privacy got me thinking. I'm basing my business on the theory that people will trade privacy for utility in the right conditions. Looking around, everyone from Facebook to Twitter to location-aware services like Brightkite are publicly posting all sorts of personal information. My parents get a neighbor to take the mail in when they're away so burglars won't know the house is empty. Now hundreds of thousands of people tell the world when and where they're on vacation. How come we're not all robbed?

Part of the answer is we're in the honeymoon period for the technology. Remember when every email you got was exciting because it was from a real person, not a robot responder or spam, and you could open attachments without worry? Once services go mainstream, malicious people will abuse them and the media will whip up moral panics.

Another part is the expectation that even though your information is technically public, nobody will bother tracking it down. For example, it's not easy to see conversations between people you don't know on the default Twitter interface. People's attitudes will change once the tools for mining that data improve. It's the same with company email. Everybody knows that their boss or IT admin could be reading their email, but it would be so time-consuming that most people have an expectation of privacy. This is the equivalent of the old Microsoft approach of security through obscurity. Though it has a bad reputation, it worked for a long, long time.

My prediction is we'll keep muddling through as always. There will be backlashes against the complete openness of the current web services as stalkers and spammers attack, but being lost in the crowd isn't a terrible strategy. There will be new social conventions, as we figure out a consensus on what's safe to put online, and new access controls, hopefully based on implicit information like who you've communicated with.

What does the BoulderTwits graph mean?

I just added a new feature to my directory of Boulder tech Twitter users; a map of the social network.

It shows who talks to who. A line between two people means that they've sent public replies to each other. The thicker the line, the more often they've exchanged tweets.

To turn this into a map, I use an automatic technique called force-directed layout to pull people with strong connections to each other closer. That means that groups of people who talk amongst themselves a lot will form clusters. So what does this map show?

Notorioushardcore

There's a central cluster of people who have a lot of connections with other people in Boulder. Many of these people I've met, and they do all seem influential in the community. If you wanted to get the ball rolling on a local project, these would be the folks to talk to.

Close teams

Imulusgraph

There's a noticeable side-group outside the main cluster. These are all Imulus employees, and it's clear they're using Twitter to talk with each other, but only Bruce and George have strong connections with other Boulder twitterers. The graph algorithm doesn't know anything about their employment, what's cool is it automatically spots they're a team just based on their communications.

Hubs

Davetaylorgraph

Dave Taylor is a great example of a network hub. He has conversations with a lot of people from different groups who don't know each other well. For example he talks to both brettfromtibet and bruce, but they don't talk to each other. People like Dave are vital in social networks because they link otherwise unconnected groups. If you need an introduction to someone you don't know, Dave is a good person to help, because he's got such a diverse range of friends.

In my next post, I'll show how I uncovered some new people I'll be adding to the Boulder Twits list, thanks to graph analysis. If you're interested in learning more about the fun you can have with these sort of networks, check out Valdis Krebs' awesome gallery of case studies.

How to grow a karass

Frogs

Photo by Thomas Hawk

In Cat's Cradle, Kurt Vonnegut invents a couple of terms I really like. A karass is a community of people without formal links but who work together to get things done, whereas a granfalloon is a grouping who imagine they have something in common, but the association is actually meaningless and unproductive.

As David Cohen put it to me, a big company's marketing department is a granfalloon, your personal network is a karass. This resonated with me because that's how we made things happen in Apple. My term at the time was "a conspiracy of engineers", but the idea was to discover curious and motivated people outside my immediate team (and sometimes even in suppliers like ATI or NVidia) who wanted to see Apple achieve some goal. We'd informally talk, figure out an approach that might work, often code up a prototype, and then approach our respective managers with a joint proposal.

This is the only way I've seen innovative things get done in big firms, but it's immensely difficult to create those informal networks. It took me years of water-cooler chats, lunches, popping into people's offices and general nosiness to get as far as I did. As I thought about the expertise and external contact location technology I'm working on, I realized that Mailana is all about building tools to enable karasses. I want somebody in my old position to be able to find collaborators far more easily, and so help companies get a lot more done with the same resources.

Tools alone won't ensure these informal groups emerge. They can't be ordered into existence, they have to grow organically. What the technology can do is provide an environment they can thrive in.

Yahoo’s mail API

R2d2mailbox

Photo by A Hermida

Despite their rocky year, Yahoo still have a massive email user base, so I was very interested when they announced a new API for plugging in to their web client. I looked through the documentation, and unfortunately this release is pretty limited, though there's more to it than Gmail's latest interface. Google essentially just lets you embed normal iGoogle widgets into the mail side bar, there's no way to interact with the user's mail. Yahoo does let you trigger UI actions like bringing up a search window, populating an add event popup or composing a message, but there's no way to access any data on the messages, or perform any modifications without user involvement.

I'm sure these are all precautions to protect users from malicious plugins reading private data, as is keeping it in a limited beta restricted to a small group of developers. While I'm disappointed I won't get the chance to do all the interesting data analysis I'd like to offer, this is a step in the right direction. I expect that seeing the demand for the applications you can build even with this limited functionality will push the industry towards more open interfaces. Then I can really have some fun!

How Nepomuk plans to file your email

Tagdrawer

Photo by Indie Wench

Peter Mucha just pointed me in the direction of the NEPOMUK project. It's an EU funded semantic research effort, with a code name worthy of a Bond villain. It's very much part of the OWL/RDF approach to cracking the semantic problem, so it took me a while to dig down through the more abstract goals and understand what it actually did. It was worth the effort though, since it seems like their Soprano search engine for semantic data has enabled some interesting email functionality within the latest KDE release.

They seem to recognize that email is the ginger-haired stepchild of information management, left behind by the tools we take for granted for searching the web and files. A lot of their examples of using meta-data are related to messages; "Tell me which message a saved attachment file came from", "Find all the attachments related to these contacts", virtual mail folders, and even integrating calendar information with "Find all messages from people I've had meetings with in the last two weeks".

The downside is that while a lot of the underlying architecture to implement these cool features is provided by NEPOMUK, there's still the big and messy job of updating all the client applications to generate and use the meta-data. There's some work moving forward with the Akonadi mail client, and basic search support within KDE, but a lot of the really cool uses are still on the drawing board. There's no automatic tagging of emails or files based on content for example.

It's great to see them trying to tackle the challenge of email in innovative ways, they definitely have their eye on an interesting set of problems.

Why I love the US

Sneer

Photo by Zombizi

I ran across this UK newspaper column by Paul Carr on the Le Web conference. It's great, funny writing, but it also reminds me why America's the best place in the world to be an entrepreneur. Paul takes great glee in tearing down everyone involved, including the startups:

"… entrepreneurs from around the world each pay €1,500 to meet their
peers, demo their startups and generally try to pretend that their
businesses aren't completely and totally doomed."

Bob Sutton gathered a brilliant summary of the academic evidence that being negative about other people increases your status. One of the studies concluded “Only pessimism sounds profound. Optimism sounds superficial.”

What cynics like Paul don't get is that most of us in the startup world are well aware that the odds are against us, but we think it's worth doing anyway! US culture celebrates that risk-taking, but Britons tend to shake their heads and tell themselves it will all end in tears. Most of the time they're right. The trouble is, Google and Microsoft were crazy ideas in their time, and would have never made it without a lot of people supporting them despite the risk. By sneering, Britain guarantees they'll never build a world-class tech company.

How to secure your web service

Dublincastle

Photo by Karl Randay

If you're including third-party content in your web pages, you can't stop a determined attacker. Browsers weren't designed with that scenario in mind, so by default any HTML you place on your pages has access to your site and cookies. The usual workaround for this is to scrub the external HTML on the server side to remove any Javascript, before passing it to the client.

The good news is this works pretty well, with platforms like Facebook and Myspace relying on it heavily. The bad news is it's practically impossible to make it perfect, there's so many different ways of hiding scripts inside HTML. When I was implementing my own scrubber for Google Hot Keys, I relied on the Cross-site Scripting (XSS) Cheatsheet to find cunning examples to test it against. I was dismayed when I later realized that Facebook's scrubber was still vulnerable to some of these attacks.

Google itself has struggled with XSS issues, though they've been quick with fixes, so I was very pleased to see they've just published their internal security handbook. They've got the best explanation I've seen of all the rules like the same-origin policy that are designed to safeguard users from malicious scripts. There's also a great cookbook on how to build your own content scrubber. Even better, they lay out suggestions for how to truly secure the environment with future browser features.

This cheers me up a lot. I often feel like a Cassandra when I'm pointing out how insecure the status quo is, but it reminds me a lot like the early days of Windows when security was considered a low priority, and we're still watching the repercussions of that mistake. I don't want the public to lose trust in our services because of constant exploits once we start moving more valuable data into the reach of malicious third-parties.

It looks like the evolutionary model that's served web standards so well before may come to the rescue with smart ideas like fine-grained script blocking in the browser and content security policies. Until then, learn Google's handbook by heart and keep a constant eye out for new exploits!

Look at your mail from a whole new angle with Unblab

Unblab

David Cohen recently gave the heads-up on Unblab, an intriguing new web service giving you a new interface to your email. One of the first innovations you see is their newspaper-style layout:

Unblabscreenshot

I like this way of displaying message summaries, it’s a blend of the popular preview pane and the more concise list view. I could see this translating to a search interface very well too, looking something like ManagedQ’s web service.

They’re still in private beta, so I’m not clear on how everything works, but they’re trying out some fascinating innovations, like machine learning approaches for automatically organizing your mail, “Umm-brella” which forces senders to keep it brief, and access to parts of your account for assistants.

They support a wide variety of the most popular mail services, including some like Hotmail and Yahoo that don’t offer IMAP or POP access by default. I’ll be interested to see how they’ve managed this, and how well it works. I’ve run across screen-scraping tools like Web Mail Retriever, and Yahoo now uses a nerfed version of IMAP to support the Zimbra client, but there’s been no reliable and easy way to access all the main services.

I look forward to trying Unblab out once I get my invite, it looks like a very thoughtful and daring new approach to email interfaces. If you want to follow their progress, check them out on twitter.