New version of the Constellation Roamer Flash network visualizer

Constellation
Photo by Don McCrady

I came across Constellation Roamer a few days ago, and was impressed with what Daniel Mclaren had produced. His responsiveness since then has been great, he immediately addressed some of the minor points in my review, like the missing price on the site and expanded documentation. Now he’s released a new version with a lot of extra features. The Javascript support he offers could be very useful for some of my work, it will allow the display of extra information in the main page whenever the user selects a node for example.

It’s great to see him offering this Flash component, and the level of service he’s shown so far makes me very confident you’ll get strong support from him after the purchase. I’m looking forward to using it myself, and I’m sure it will be useful for a lot of other sites too.

How to visualize networks through Flash

Constellation

I recently came across Constellation Roamer by Asterisq, written by Daniel Mclaren. It’s a very interesting flash-based tool for visualizing connected graphs within a browser. There’s a demo here if you’re interested in checking it out yourself.

I’ve not used it in any real world tests so far, but from my exploration it seems pretty good for getting a visualization up and running. It doesn’t compare with the hard-core scientific graph layout packages obviously, and it probably won’t scale to thousands of connections, but as long as you keep the data sets limited it performs well. There doesn’t seem to be much customization possible, the focus is on simplicity, and I couldn’t find any reference documentation. For example I couldn’t see a way to alter individual edge weights, or tweak the overall simulation parameters like friction and time-steps. That means I can’t use it for the massive simulations of Outlook Graph, but I am able to display a small social network painlessly.

The Asterisq site was a bit vague on the total cost for a single site license, so I had to go through most of the purchase steps to find out the $550 total. All-in-all could be a real time-saver if you want to build a small graph visualization into your site.

You can create beautiful charts with PHP/SWF

Swfchart1
Swfchart2

Swfchart3

I’m getting to the point where I need to display some of the information I’m analysing visually, as part of a web service. I’d looked at jpgraph, which seems to be the best known server-side graph creation tool, and their gallery of example charts made me feel very sad. Here’s an example:
Jpgraph

I knew there had to be other alternatives out there, and I was over the moon when I discovered PHP/SWF charts. It’s a flash-based charting system, and it’s got antialiasing, good-looking fonts, transparency, 3D, shadows, animation, interactivity and a very simple API. You can see three examples from their gallery above. It’s letting me create visuals that I can be proud of. The only downside is that it requires a browser capable of running flash, which excludes my iPhone. It’s free for the basic version, and you can get a single-domain license for $45 which lets you do a bit more customization. It’s designed for PHP, but actually internally converts the PHP array arguments to XML before handing them to the flash movie, so you can use it with any scripting language.

Do your taxes with implicit data

Turboscreenshot

Quicken’s TurboTax is the slickest and deepest online app I’ve used. I’ve been a fan since 2003, and they just kept getting better. One thing that stood out this year was the unobtrusive but clear integration of their help forums into every page you’re working with. There’s a sidebar that has the most popular questions for the current section, ordered by view popularity. It’s applying Web 2.0-ish techniques, using page views to rank user-generated content, but for once it’s solving a painful problem. Maybe I’m just old, but I feel sad when I see all the great work teams are doing to solve mild consumer itches like photo organization that are already over-served, and my doctor’s practice still runs on DOS.

It was fascinating to read John Doerr’s thoughts on how Intuit was built, from his introduction to Inside Intuit. I’ve never managed to computerize my household finances (Liz has an amazing Excel setup that has to be seen to be believed) but their focus on customers has shone through all my encounters with them. It’s great to see they keep looking for ways to use the new techniques to improve their services, Microsoft could learn a lot from them. I know they sent someone to Defrag last year, so maybe I’ll see some more implicit web techniques when I do my 08 taxes?

Turboanswer

How X1 approaches enterprise search

Skyscrapers
Photo by 2Create

X1 are best known for their desktop search tool, but they offer an enterprise-wide solution that tries to integrate a lot of different data sources to allow searches that cover all of a company’s information. It mostly sounds very similar to Google’s search appliance, but they do have an interesting architecture that includes an Exchange component. It uses server-side MAPI, which limits it to Exchange 2003 and earlier unless you download the optional MAPI components for 2007. There’s also no mention of hooking into the event model, so I would be curious to know how much of a lag there is between a message arriving, and it being indexed. For my email search I’m working on Exchange Web Services support, since that’s the supported 2007 API to replace MAPI, and trying to get real-time access to the data by hooking into the Exchange event model.

It sounds as if they’re focusing more on the enterprise side of the world, after a recent change of management and a switch to a paid model for their desktop client. Back in November they mentioned signing up 60 large companies as customers for their enterprise service, which sounds promising, especially alongside their 40,000 desktop downloads at $50 each.

Visualizing the banking crisis

Bankvisual

The web gives us an amazing opportunity to use animation in visualizations. Showing change over time graphically, and allowing users to absorb and interact by pausing and scrubbing in the timeline, lets you comprehend a lot more information than a static image can give. You can show an animation on TV, but that doesn’t give the viewers a chance to pause, rewind and really understand what’s happening. Of course, just as designing a good 2D picture to show information is a lot tougher than outputting a textual list, working out how to get across information in animation takes a lot of skill. That’s why I’m so impressed by these visualizations of bank’s mortgage liabilities.

The two charts show how many of the main banks’ mortgages are in trouble, either over 90 days delinquent on payments (the usual cutoff for the start of the foreclosure process) or the charged-off (aka written-down) value on all their mortgages. What’s fascinating is seeing sudden explosion in both measures of trouble in the last few quarters as you play back the animation. It makes the magnitude of the shock very clear, and explains why so many financial folks have been freaking out, far better than seeing the same figures in a static graph. Overall it does a good job of communicating some complex information in a very compact form.

The graphs themselves are written in Flex, and are examples of the Boomerang data visualization technology that the OSG group has developed for internal business intelligence applications. On the main site they have some slightly more complex and flexible versions of the same charts. They’re doing very interesting work with their projects like Savant and Hardtack to break down the barriers between the data silos that exist within most businesses. They seem to be approaching the problems with very modern techniques, using RSS and other tools that allow easy mashing-up of data from legacy systems. I’ll be interested to hear if they’ve looked at using email as a source too.

If you’re interested in more of the financial-nerd details of the mortgage meltdown, my favorite source is the Calculated Risk blog. Their analysis of the primary data on housing is invaluable.

Why play is the killer app for implicit data

Playing
Photo by RoyChoi

I recently ran across PMOG, and damn, I wish I’d thought of that! It’s a "Passive Multiplayer Online Game" that you play by surfing the web with their extension installed. You get points for each site you visit, and you can use those points to leave either gifts or traps for other players. There’s also user-created missions or paths that involve visiting other sites.

Why is this so interesting? It’s is a fantastic way to get permission to do interesting things with people’s browsing information. They get in-game rewards for sharing so there’s real reciprocity, you’re not just an evil corporation harvesting click-streams. Games are a great way to get people involved with a process too, with the instant rewards and status hierarchies that they generate. Even better, they’re free for the provider, all you have to do is provide fun and compelling rules. This means it’s a lot more likely they’ll be willing to provide detailed information about sites they’re visiting.

What could this mean in practice?

Site descriptions. You could get points for writing a short website description. It could be structured so that useful descriptions earn more points.

Comments. Contributing to a discussion attached to a website, through the extension interface, could also earn you points.

Rating. Simply giving a Stumbleupon style thumbs up or down to a site could get you a small amount of points too.

Provide information about yourself. By interacting with a site, whether it’s leaving a surprise or adding meta-information, you’re making a connection with it. You can mark that connection in a profile page, and build up a rich set of favorites.

This is the first compelling application I’ve seen that could persuade large numbers of users to happily share their browsing habits with other people. There’s only 4500 users right now, but I’ll be surprised if that doesn’t grow. Games are incredibly powerful motivator, if you can tap into the human instinct for play you’ll be amazed at how much work people will put in to achieving the goals set by the system.

World War II and the Implicit Web

Spitfire

Traffic analysis is a field of espionage, focused on learning about the enemy by looking at their communication patterns without having to understand the content. Here’s some examples from the Wikipedia entry:

  1. Frequent communications — can denote planning.
  2. Rapid, short, communications — can denote negotiations.
  3. A lack of communication — can indicate a lack of activity, or completion of a finalized plan
  4. Frequent communication to specific stations from a central station — can highlight the chain of command.
  5. Who talks to whom — can indicate which stations are ‘in charge’ or the ‘control station’ of a particular network. This further implies something about the personnel associated with each station.
  6. Who talks when — can indicate which stations are active in connection with events, which implies something about the information being passed and perhaps something about the personnel/access of those associated with some stations.
  7. Who changes from station to station, or medium to medium — can indicate movement, fear of interception.

Some of these might sound familiar to anyone interested in analysing implict data. Number 4 sure sounds a lot like PageRank. The others can all be applied to any communications where you know the time, sender and recipients. Email content isn’t encrypted, but computers can’t full understand natural language so it might as well be, so anything we can gather from the external characteristics is invaluable. There’s obviously a lot we could learn from the work that’s been done over the years.

Unfortunately it’s been exclusively the territory of government intelligence services, and they don’t publish too many papers. Some of the most useful work I’ve found has been declassified World War II reports, but even there cryptanalysis tends to get the most coverage. Probably the most fascinating I found was the post-mortem report produced on the British TA work with German signals. It’s not very enlightening about the techniques they used, but the management recommendations they make are startling relevant for a modern tech company, once you get past the formal language:

"The policy of recruiting personnel for T.A. should emphasize the quality of personnel, not the quantity. Conforming to the usual pattern of history, in the beginning of such an undertaking as T.A., there is available only a very limited number of experienced people. Therefore, from the whole field of possible T .A. functions only the most useful and urgent should be undertaken. As the exploitation of these functions progresses, other possible functions will be recognised by the small but able original staff. Their suggestions for organisational changes and expansion should be encouraged and taken seriously. Only from operational experience can possible new functions be properly evaluated in the first instance. Once operational opinion is agreed that certain specific undertakings should be added, the additional personnel should be, as far as possible, chosen and trained by those who have the operational responsibility. … A wholesale creation of a T.A. staff with the a priori idea of providing a sufficient quantity of people to exhaust the field of T.A. … is wasteful and operationally inefficient."

History shows that small motivated teams usually beat the lumbering man-month monstrosities that large companies keep trying to assemble. I guess that’s a lesson they learnt back in 1942!

If implicit data’s so great, why did DirectHit die?

Tombstone

DirectHit was a technology that aimed to improve search results by promoting links that people both clicked on, and spent time looking through. These days we’d probably describe it as an attention data algorithm, which places it firmly in the implicit web universe. It was launched to great excitement in the late 90’s, but it never achieved its promise. There was some talk of it lingering on in Ask’s technology, but if so it’s a very minor and unpromoted part. If the implicit web is the wave of the future, why did DirectHit fail?

Feedback loops.
People will click on the top result three or four times more often than the second one. That means that even a minor difference in the original ranking system between the top result and the others will be massively exaggerated if you weight by clicks. This is a place where the click rate is driven by the external factor of result ranking, rather than the content quality that you’re hoping to rate. This is a systematic error that’s common whenever you present the user with an ordered list of choices. For example, I’d bet that people at the top of a list of Facebook friends in a drop-down menu are more likely to be chosen than those further down. Unless you randomize the order you show lists, which is pretty user-unfriendly, it’s hard to avoid this problem.

Click fraud. Anonymous user actions are easy to fake. There’s an underground industry devoted to clever ways of pretending to be a user clicking on an ad. The same technology (random IP addresses, spoofed user agents) could easily be be redirected to create faked attention data. In my mind, the only way to avoid this is to have some kind of trusted user identification associated with the attention data. That’s why Amazon’s recommendations are so hard to fake, you need to not only be logged in securely but spend money to influence them. It’s the same reason that Facebook are pushing so hard for their Beacon project, they’re able to generate attention data that’s linked to a verified person.

It’s a bad predictor of quality. Related to the feedback loop problem, whether someone clicks on a result link and how much time they spend there don’t have a strong enough relationship to whether the page is relevant. I’ll often spend a lot of time scrolling down through the many screens of ads on Expert Exchange on the off-chance they have something relevant (though at least they no longer serve up different results to Google). If I do that first and fail to get anything, and then immediately find the information I need on the next result link I click, should the time spent there be seen as a sign of quality, or just deliberately poor page design. This is something to keep in mind when evaluating attention data algorithms everywhere. You want to use unreliable data as an indicator and helper (eg in this case you could show a small bar next to results showing the attention score, rather than affecting the ranking), not as the primary controlling metric.

SEO Theory has an in-depth article on the state of click management that I’d recommend if you’re interested in more detail on the details of the fraud that when on when DirectHit was still alive.

Want to see a fresh approach to automated social analysis?

Hannes

I recently discovered Johannes Landstorfer’s blog after he linked to some of my articles. He’s a European researcher working on his thesis on "socially aware computers", exploring the new realms that are opened up once you have automated analysis of your social relationships based on your communications. There are some fascinating finds, like using phone bills to visualize your graph, or reflecting the uncertainty of the result of all our analysis by using deliberately vaguely posed avatars. His own work is intriguing too, he’s got a very visual approach to the field, which generates some interesting user-interface ideas. I’m looking forward to seeing more of what he’s up to.