Who owns implicit data?

Sasquatch
Photo by Kurtz

Barney Moran recently posted a comment expressing his concerns on Lijit’s use of the statistical data it could gather from blogs that installed its widget. I checked out Lijit’s privacy policy and it’s pretty much what I expected; they’ll guard any personal information carefully, but reserve the right to do what they want with anonymous information on usage patterns. They’ve also pledged to stick to the Attention Trust’s standards for that usage data.

Barney is organizing a publishers union of bloggers, and he seems not so much concerned about privacy as he is about who’s profiting from the data. I’m biased since I’m a developer building tools to do interesting things with implicit data, but I assumed the trade I’m making when I install a widget is that the developers will get some value from the statistics, and I’ll get some extra functionality for my blog. Since they’re not putting ads in their content, the only other revenue stream I can see is aggregating my data with lots of other peoples, generating useful information and selling it on. This doesn’t hurt me, I’m not losing anything, so it feels like a win-win situation.

Josh Kopelman captured this idea in Converting data exhaust into data value. There’s a lot of ‘wasted’ data flying around the web that nobody’s doing anything with, it seems like progress to capture some of it and turn it into insights. The trouble is I would say that. I’m a wide-eyed optimist who’s excited about the positive possibilities. Barney represents a strand of thinking that we’re not really running into yet, because no implicit data gathering service has a high enough profile to register with the public at large. I’d expect there will be more people asking questions as we get more successful and better known.

So far most of the bad publicity in this area has come when people feel their privacy has been violated, as with Facebook’s Beacon program, but the ownership question hasn’t really come up. After all we expect big companies like Google and Amazon to make use of their usage statistics to do things like offer recommendations, why should an external service be more controversial? I’m not sure it will turn out to be an issue, but the Beacon controversy shows we need to have an answer ready on who we think owns that information, and explain the bargain that people are making when they use our services.

Animate your social graph with SoNIA

Networkanimation

I was lucky enough to get some time with Professor Peter Mucha of UNC this week. He’s a goldmine of information on the academic side of network visualization and analysis, and one of the projects he clued me into was SoNIA.

One of the most exciting areas for visualization is animating over time. It’s an incredibly powerful way to demonstrate changes in an easy to understand way, but it’s also very hard to do. Building the tools is tough because dealing with time massively multiplies the amount of computation you need to do, and is a very tricky user-interface challenge too. SoNIA is an ambitious attempt to provide an open-source professional tool for animating network data.

It’s sponsored by Stanford, and developed by Skye Bender-deMoll and Dan McFarland. It’s designed to take data files that describe graphs at different states in time, and give you the control to lay out and animate those networks. It’s already been used for an impressive series of studies, you should check out the included movies there if you want to get an idea of what it’s capable of. One of the best known is the study illustrated above used the software to demonstrate how your social network and obesity were correlated.

It’s freely downloadable if you want to give it a try yourself, and I’d recommend starting with Dan’s quick start guide to understand how to use it. It offers a lot of control over the underlying algorithms, but don’t be daunted by the space shuttle control-panel style of the UI, it’s possible to create some interesting results using many of the default settings. I’m looking forward to applying this to some of the communication data I’m generating from email, animation is a great way to present the time data inherent in that information.

How to find new ways to visualize information

Semafore

It’s hard to imagine a more codified visualization than a calendar. You have a grid of cells, with each row representing a week, and that’s pretty much it. That’s what my good friend Kent Oberheu set out to change. Over the past 5 years he’s produced 60 monthly calendars that show time contorting in strange paths, all embedded in his art. The regularity of a traditional calendar is lost, instead you really see the flow of time. It helps me remember that every day is unique, just like his visualizations.

I don’t think these are going to replace standard calendars, but it’s a demonstration that looking at even the most mundane information in unusual ways can start you thinking in new directions. Kent’s better known as Semafore in the design world, and after years as a designer with Apple he’s just moved to a new position at Industrial Light and Magic. All the time I’ve known him he’s been pushing into new visual territory, finding magic combinations that tickle your aesthetic nerve and manage to get something across at the same time. He’s the first person I turn to when I need to talk about designing visualizations, and the combination of my engineering skills and his design direction has worked very well.

As I’ve gone deeper into the subterranean world of information that’s held on companies’ Exchange servers, it’s obvious that part of the toolkit for navigating has to be new ways of looking at that data. Some of that can be borrowed from the recent explosion of web visualizations, such as animated tag clouds for your email content, but some of the challenges are unique. How can you do a decent view of different versions of your attachments over time for example?

That’s where the Information Aesthetics blog comes in. It was the first place Kent sent me when I started to ask him for inspiration, and it’s full of the latest and most beautiful visualizations. Some of my recent finds from there include wordle, a Java applet that lets you create very good-looking word clouds, the Information Design Patterns Cookbook and the Mount Fear physical sculptures representing London crime statistics. I don’t know if I’m going to cut up cardboard to build an art piece from the Enron emails, but all of this starts trains of thought on how to adapt these innovations to my problem space. If you’re looking for inspiration too, I’d highly recommend subscribing.

Why can’t you create a calendar from your email?

Calendar
Photo by Churl

I’m very excited about the potential for email as a data source. I’m so passionate about it, it’s hard to remember that for most people it’s a new idea, and its not obvious how it could be useful. To explain, I usually point out existing companies like Spoke or Contact Networks that pull out basic contact information, or Microsoft’s Knowledge Network experiment that automatically locates experts. But what truly gets my heart racing are all the applications that haven’t been possible without easy access to email data.

One very promising area is extracting events from messages. Dates are one of the easier entities to spot in unstructured language. PHP has a built-in function, strtotime(), that can convert most English time strings into an absolute value, even fuzzy ones like "next Thursday" or "now". Getting the rest of the information like the name of an event is tougher, but imagine a calendar view that just shows the subject line of each email at any time that’s mentioned in the body of the message. You could restrict the view to only genuine contacts (people you’d replied to at a minimum) and then with a single click transfer any true events to your appointments calendar.

So why isn’t this already implemented? Gmail has got something similar for its Gcal integration, but it’s very limited in the formats it will recognize. There’s articles out there like Learned Automatic Recognition of Appointments from Email by Lauren Paone at CMU, but as a quote from the paper puts it "Although email is ubiquitous, large and realistic email corpora are rarely available for research purposes." Lauren faced serious obstacles even running realistic tests because he didn’t have enough email to work with.

What’s stopping progress is the mind-numbing pain of first getting data to prototype with (though the Enron corpus makes that somewhat easier), but even worse, trying to integrate with any email service like Outlook/Exchange or through IMAP. Innovation is rapid in the web world because anybody can spider the public internet and offer the results through a website. A few large companies like Google and Yahoo have access to their user’s emails and so can create email-based tools, and if you can persuade users to install a desktop plugin you can do the same, but the only way to move things forward is to open up access to a lot more developers. I’ll be posting on some of my efforts in that direction soon.

How social networks control your company

Chat
Photo by Belinketeneghe

Brokerage and Closure by Ronald Burt is a must-read for anyone interested in innovation and social networks. He’s a sociologist with the Chicago Graduate School of Business who’s spent years mapping and analyzing the patterns of relationships in large companies like Raytheon. This book describes how new ideas, trust and power flow directly from these networks.

The title refers to the two forces that shape who you talk to. Closure is the technical term for how insular a group of peope are, measured by the strength of relationships between all the insiders, and the weakness of ties with outsiders. If you draw a graph of the communications within a group with high closure, you see a lot of lines between the members, and few contacts with others:
Closure
In everyday language, a cluster of people with high closure would be called a clique. They form because they have some big advantages. It’s a lot easier to trust someone you’ve no experience with if you share mutual friends, because the risk to their reputation will be severe if they let you down. The dense pattern of communications also makes sure that practices and beliefs get spread and standardized quickly throughout the group.

Large organizations are made up of many of these self-contained teams, each with their own shared experiences, ideas and ways of doing things. Brokerage is the act of bridging the gaps, or structural holes, between these groups in the network. People who have connections with multiple groups that would be otherwise unconnected are known as brokers or bridges.

Broker

They play an important role in innovation because they have the chance to introduce good ideas from one team into another, or combine partial insights from multiple groups into a new approach to a problem. They also have political advantages because they have more information about the motivations and goals of other teams, and can use that knowledge to help steer decision-making to avoid conflicts and gain support for initiatives.

Where Burt really shines is the application of this general model to the wealth of data from sociological studies within companies, together with his own personal experiences of working with large businesses. He sets out to prove 4 ‘stylized facts’ about how brokerage and closure works in practice:

Brokers do better. He uses network analysis together with personnel records to show that people who have strong connections outside their immediate team get paid more, and promoted faster.

Brokers have better ideas. Analyzing the ranking of improvements for a supply-chain management department together with the connectedness of the people suggesting them, he builds a case that the reason brokers do better is because of the quality of the ideas they come up with.

Brokerage is useless without closure. This is less of a slam-dunk, but he gathers evidence that brokers don’t help when the teams themselves are fragmented and poorly coordinated. Intuitively this makes sense, groups who can’t communicate internally won’t be able to execute even given the best ideas.

The echo chamber amplifies closure. Treating networks as information circuits ignores the primate biases that actually guide our social behavior. In particular, etiquette demands that we avoid contradicting a conversation partner when possible. This and similar habits mean that reputations are exaggerated in a feedback loop through gossip, since people you talk to will tend to agree with your assessment of someone, even if they don’t hold the same opinion. This gives the illusion of corroborating evidence for your views, and tends to tighten the bonds that bind a group together and more strongly exclude outsiders. This is a tough one to tease out from the data, but he shows that the more mutual contacts you share with someone, the stronger your opinion of them, even if that opinion disagrees with the assessments of your shared contacts.

This is vital reading for anyone dealing with social networks because of the applications of these theories to the design of our tools. At the start he talks about the delusion that having lots of contacts in a network adds value, when instead the really valuable connections are those outside your immediate group, and how this is where businesses like LinkedIn and Tacit should be focusing their efforts.

I’m particularly interested because most of my work has been aimed at making brokerage easier and faster. Defrag Connector was about establishing initial trust between conference attendees by revealing mutual friends. I’m analyzing email to reveal the existing communication networks, and identify good candidates for brokerage contacts because they’re experts in a helpful area, or have external contacts that would be useful. Most of his data comes from self-reported surveys of who people talk to, I’d love to run some of his work against my large company email data sets. He mentions Valdis Krebs in the foreword, but I was disappointed I didn’t see any references to his work deriving networks from implicit communication data.

Burt is writing for an academic audience, so he presents a lot of the primary data backing up his arguments, which can make it a tough read for generalists like me. He’s got a readable style though, and I love some of the anecdotes that pop up throughout, such as the quote from a manager explaining that when analyzing improvement ideas "that were either too local in nature, incomprehensible, vague or too whiny, I didn’t rate them."

Death of a startup

Graveyard
Photo by Auchinoon

My old roommate Dave taught me snowboarding, and one thing he said stuck with me: "If you don’t fall down at least once every day, you’re not pushing yourself hard enough". (He also comforted me with the claim that "chicks dig scars" after I impaled my leg on a fencepost on my first day out.) One of the things I’ve found liberating here in the US compared to England is that it’s possible to fail without being labeled as a failure. On that topic Bob Sutton has a post on why "Am I a success or a failure?" is the wrong question to ask.

I’ve never been through the death-throes of a startup, but Visual Sciences, a games startup I worked at for four years, collapsed in a painful bankruptcy throwing a lot of good friends out of work. Andrew Hyde laments the sense of shame that still comes when you’re involved in a failed business, and like me wishes there were more post-mortems out there to help us all learn. Nick Napp, founder of the promising Disruptor Monkey, has taken that up that challenge with a post explaining what happened to the company. It’s tough because it’s an emotionally charged topic, and there’s always details that have to remain private, but he’s done a great job covering what he’s learnt. Now I guess it’s up to me to pick one of my own professional failures and return the favor.

Easily create gorgeous graphs with the Google Charts API

I’ve looked at a lot of ways to create graphs dynamically on the web. PHP/SWF charts are fantastic if you want a beautiful results, a lot of options, and interactivity, but they require flash, which both limits the platforms that can use them, and can result in slower loading. For better compatibility, you need something that generates images on the fly.

I’d investigated using jpgraph, but the results looked really ugly and it takes up precious cycles on your own server. Then I discovered a free Google web service that generates images on the fly for you, the Charts API. The pictures above are examples of the high-quality results it produces, with clean fonts, nice 3D and most importantly antialiasing. The API is incredibly simple to use, you just pass in the data and options as parameters to the URL. You don’t even need to register or get a key. Here’s the URLs for the two images:

http://chart.apis.google.com/chart?cht=p3&chs=480×200&chd=s:Hellob&chl=May|June|July|August|September|October http://chart.apis.google.com/chart?cht=lc&chd=s:pqokeYONOMEBAKPOQVTXZdecaZcglprqxuux393ztpoonkeggjp&chco=676767&chls=4.0,3.0,0.0&chs=480×200&chxt=x,y&chxl=0:|1|2|3|4|5|1:|0|50|100&chf=c,lg,90,76A4FB,0.5,ffffff,0|bg,s,EFEFEF

While it’s easy to get started with this style, it does have some downsides. Since the data is encoded as part of the URL, there’s a hard limit on how many points you can have since some systems choke on URLs over 2000 characters long. The API also doesn’t support as many styles or options as PHP/SWF, and no animations is possible.

Despite those disclaimers, this is an amazing tool, and I’ll be having a lot of fun with it. One of my favorite features is the map graph type, which lets you easily specify just colors and states or countries, and it generates an image showing that on a simple map. It would be insanely easy to create some geographic data visualizations using it if you’ve got interesting data. Here’s an example of the results:

http://chart.apis.google.com/chart?chco=f5f5f5,edf0d4,6c9642,365e24,13390a&chd=s:fSGBDQBQBBAGABCBDAKLCDGFCLBBEBBEPASDKJBDD9BHHEAACAC&chf=bg,s,eaf7fe&chtm=usa&chld=NYPATNWVNVNJNHVAHIVTNMNCNDNELASDDCDEFLWAKSWIORKYMEOHIAIDCTWYUTINILAKTXCOMDMAALMOMNCAOKMIGAAZMTMSSCRIAR&chs=440×220&cht=t