Trying to sell a release with no new features

"There’s really nothing to it. There’s no story, so it’s really hard to say anything."

This video of a friend-of-a-friend desperately trying to find something good to say about the latest release of his software brought back memories of trade shows past. When the powers-that-be want to bump up the version number, but don’t synchronize that with any actual development schedules, you end up trying to find something, anything, to demonstrate.

Visualizing the banking crisis

Bankvisual

The web gives us an amazing opportunity to use animation in visualizations. Showing change over time graphically, and allowing users to absorb and interact by pausing and scrubbing in the timeline, lets you comprehend a lot more information than a static image can give. You can show an animation on TV, but that doesn’t give the viewers a chance to pause, rewind and really understand what’s happening. Of course, just as designing a good 2D picture to show information is a lot tougher than outputting a textual list, working out how to get across information in animation takes a lot of skill. That’s why I’m so impressed by these visualizations of bank’s mortgage liabilities.

The two charts show how many of the main banks’ mortgages are in trouble, either over 90 days delinquent on payments (the usual cutoff for the start of the foreclosure process) or the charged-off (aka written-down) value on all their mortgages. What’s fascinating is seeing sudden explosion in both measures of trouble in the last few quarters as you play back the animation. It makes the magnitude of the shock very clear, and explains why so many financial folks have been freaking out, far better than seeing the same figures in a static graph. Overall it does a good job of communicating some complex information in a very compact form.

The graphs themselves are written in Flex, and are examples of the Boomerang data visualization technology that the OSG group has developed for internal business intelligence applications. On the main site they have some slightly more complex and flexible versions of the same charts. They’re doing very interesting work with their projects like Savant and Hardtack to break down the barriers between the data silos that exist within most businesses. They seem to be approaching the problems with very modern techniques, using RSS and other tools that allow easy mashing-up of data from legacy systems. I’ll be interested to hear if they’ve looked at using email as a source too.

If you’re interested in more of the financial-nerd details of the mortgage meltdown, my favorite source is the Calculated Risk blog. Their analysis of the primary data on housing is invaluable.

Search massive XML datasets with MarkLogic

Bookpile
Photo by GeorgMayer

After my last post on the MarkMail project, I heard back from MarkLogic’s Jason Hunter with some more information about the underlying implementation. Almost all the capabilities are provided by the MarkLogic database server, which seems to offer an impressive set of features. I initially had some trouble finding the technical information on their main site, since it’s mostly geared towards of high-end content publishers who are the main users of the system, but then I came across their developer center.

What is MarkLogic Server? is probably the best place to start for an overview of what they offer. Essentially they differ from a standard relational database by accepting comparatively unstructured data, without a rigid schema, and focusing on the great search and retrieval performance you need for any publishing system. As they’ve demonstrated with MarkMail, this makes a good interface for large collections of email too. The technology has been battle-tested, deployed in situations dealing with terabytes of data and with the ability to run in a distributed cluster so you can scale performance to cope with heavy loads.

As well as the MarkMail demonstration, they also have the Executive Pay Check site that lets you do a live search on 14A filings to see the salaries of leaders at public companies. This is interesting mostly because it’s doing a good job coping with some theoretically structured, but in practice quite messy, source data, with inconsistent naming and formatting for the tables holding the filings. It would require some heavy massaging to get this into a traditional relational database, but MarkLogic seems to be a lot better at handling that sort of problem.

There’s a free community version of the engine available for download, so I’ll be experimenting with it when I have the chance. An active developer community has grown up around the product over the last few years, with lots of documentation I’m absorbing to help my understanding. I’m surprised that it isn’t better known in the search community, it seems like it offers some unique features that would let you build an interesting search engine for all kinds of rich content.

Snow in Boulder

Snow

I flew into Denver last night, and caught the last snow storm of the year (according to the locals’ guess). I’m loving it, even though it made the drive to Boulder an adventure. I met some friends for dinner at The Kitchen, somewhere I’d wanted to visit after hearing about its slow food philosophy. It must be tough to get local produce at this time of year, but the potato fennel soup and herb gnocci I had was great winter fuel. Even better were the garlic fries, which seemed like they might show the British influence, since they were the closest approximation to chips I’ve found over here. Thick, with a crispy surface, but a soft and fluffy interior like mashed potatoes, they really hit the spot.

This morning was beautiful, with snow making the trees look more like coral. Josh and Rob from Eventvue joined me for breakfast at Burnt Toast. Colorado has the best breakfast places, me and Liz still speak in awed tones of the meal had last year at Dozens. Burnt Toast didn’t disappoint, with a light and fluffy breakfast burrito and very friendly service. I got to hear all about Josh and Rob’s startup adventures. Their determination to fight past all the inevitable problems was truly impressive, and now they’ve got a completed first release to show for it.

See where they filmed MASH

Mashtruck

I spent today leading a crew fixing up a trail near the old film set for M*A*S*H. Inside what’s now Malibu Creek State Park, anybody who’s seen the show will instantly recognise the chaparral-covered hillsides that doubled as Korea in the opening credits. Even better, there’s actually a couple of army vehicles left over from the filming! A wildfire swept through during the original series and there wasn’t enough time to get them out, so now they’re part of the landscape.

If you want to see it yourself, it’s pretty easy to hike or bike to. It’s around 2.5 miles in, with a few hundred feet elevation gain. Here’s a map showing how to get there:

View Larger Map

You start off at Malibu Creek State Park. Take the Las Virgenes Road exit south from the 101, or from the PCH head north up Malibu Canyon Road, which turns into Las Virgenes. The entrance is well signed, and it costs around $5 to park. I recommend the lower lot as it’s usually less crowded, and is closer to the trail head.

Take the Crags Road trail from the bottom of the parking lot. Stay on the fire road for around half a mile, and you’ll find yourself near the visitors center. The road continues, with the steepest uphill section of the route. You’ll pass Century Lake on your left once you reach the summit. In the summer it’s a relief to detour to the waters edge, and walk along the lake’s shore. If you look on the other side you’ll see some trees that look out of place. Those redwoods and pines were planted when the area was a privately-owned resort, and their shade is a good shelter from the heat if you make it that far around the lake on the Forest Trail.

To get to the MASH site itself, keep straight on along Crags Road for 1.5 miles. You shouldn’t have any trouble spotting the location, the rusting Jeep and ambulance are right beside the trail. If you’re biking in and want more of a challenge you can continue a quarter of a mile and take the strenuous Bulldog trail that branches off to the left. Be prepared for some relentless uphill climbing for several miles.

Hikers can check out the Lost Cabin trail we worked on today. It heads south off the road a little before the set. Thanks to a great crew of volunteers rounded up by REI, we cleared a lot of the overgrowth, and repaired the tread where it had washed out, so it should be a lot more pleasant than it used to be. It dead-ends after around a mile, but it goes through some great scenery. Here’s a shot from where we stopped for lunch:

Mashlandscape

What’s the best way to search large amounts of email?

Markmailscreenshot

MarkMail is a really interesting demonstration site for MarkLogic’s technology. They host archives of a number of development mailing lists for projects like Apache and Perl. You can search within each list, and the results are presented in a three panel format.

Markscreenshot2

The left panel shows you the frequency of the search terms over time, and suggests some different ways to narrow your search by focusing on subsets of the list or particular contributors who mention the term frequently. The middle panel is more like a conventional results page, listing links to all the matching messages. It also offers the ability to reorder the results by date instead of relevance. The right section shows you the content of the message, and other matching messages from around the same time.

I like this interface a lot, it’s the best presentation of time in search results that I’ve seen, combining the information offered by Google Trends with all the facilities of a normal search. I’m a big believer in using a horizontal split for previews too.

Beyond presentation, they also offer a lot of advantages over a web search engine in their understanding of mail messages. They allow you to search on subjects, authors, for unquoted text and can ignore boiler-plate material like disclaimer sigs and checkin notices. Much like Krugle focusing on function names, they can also use their knowledge of the structure to offer more relevant general results by giving more weight to the subject line than text in the body when working out the relevance of a result. This gives them an advantage over Google searching the same content as a web archive, since it has no idea what the significance or importance of any of the parts of each page are. Anyone who’s ever tried to do a mailing list search for "thread" through Google will know that it can be hard if the archive interface includes any interface elements that use thread to refer to topic-browsing, such as "Next in thread". As an example, here’s a Google search on the postgresql archives for thread where 2 of the top 3 results are for thread interface references. By contrast, all of the MarkMail results for the same search cover discussion of threads in the body of the message.

Under the hood they’re using an interesting mix of technology. On their blog, Jason Hunter posted a presentation covering the nuts and bolts of how they’ve built their search engine. Like me, they’ve gone the route of defining an XML format to store the messages in.
Markslide1

Markslide3

I’m currently using XML for an interchange format, but was going down the standard relational/mysql route for my database. MarkMail is completely powered using the XQuery database language, backed up by data stored in XML rather than converted to some processed database format. I couldn’t find any information on the technology they use to implement this (Saxon?), but it would be a lot simpler to do a single conversion to XML, and then operate on it, rather than trying to do input and output conversions from mysql. Fascinating stuff, I’ll have to see if I can get any more information from the team.

How you can parse XML with PHP

Text
Photo by Dean Terry

I love XML, not because it’s an inherently beautiful format (it’s inelegant in a lot of ways, like why do we have both attributes and character data?) but because for once we have a sensible and widely supported standard in the computing world. The power of this shows when you want to parse an XML file in PHP. Support is built in by default, powered by the ExPat library. For small files you can use the SimpleXML wrapper that creates an object from the XML, but I need to parse large amounts of XML so I didn’t want to keep all of that information in memory. Instead I’m hooking directly into the ExPat event interface, which calls back to the client when tags and other data objects are encountered, and requires the caller to retain and assemble any information it wants to extract.

I’ve included the code below, and here’s a zip file of the example code together with a test XML file. It’s an expanded version of the example from the PHP manual, with the addition of character data handling and the storage of some data during the parsing. It takes the input XML file and outputs an indented version of all tags, showing any character data associated with each tag.

<?php
$file = "example.xml";
$depth = array();
$currenttagname = array();
$currenttagvalue = array();

function onStartElement($parser, $name, $attrs)
{
    global $depth;
    global $currenttagname;
    global $currenttagvalue;

    for ($i = 0; $i < $depth[$parser]; $i++) {
        echo "  ";
    }
    echo "$name\n";
    $depth[$parser]++;

    $currentdepth = $depth[$parser];

    if ($currenttagname[$parser]==null)
        $currenttagname[$parser] = array();

    if ($currenttagvalue[$parser]==null)
        $currenttagvalue[$parser] = array();

    $currenttagname[$parser][$currentdepth] = $name;
    $currenttagvalue[$parser][$currentdepth] = $value;
}

function onEndElement($parser, $name)
{
    global $depth;
    global $currenttagname;
    global $currenttagvalue;

    $currentdepth = $depth[$parser];

    $storedname = $currenttagname[$parser][$currentdepth];
    $storedvalue = $currenttagvalue[$parser][$currentdepth];

    for ($i = 0; $i < $depth[$parser]; $i++) {
        echo "  ";
    }
    echo $storedname;
    if ($storedvalue!="")
        echo " = " . $storedvalue;
    echo "\n";

    $depth[$parser]--;
}

function onCharacterData($parser, $data)
{
    global $depth;
    global $currenttagvalue;

    if ($currenttagvalue[$parser]==null)
        return; // ignore character data outside of tags

    // ignore new lines
    $data = str_replace("\n", "", $data);
    $data = str_replace("\r", "", $data);

    $currentdepth = $depth[$parser];

    $currenttagvalue[$parser][$currentdepth] .= $data;
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "onStartElement", "onEndElement");
xml_set_character_data_handler($xml_parser, "onCharacterData");
if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}
?>
<html>
<head><title>PHP XML Parsing Example</title></head>
<body><pre>
<?php

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        die(sprintf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser)));
    }
}
xml_parser_free($xml_parser);
?>
</pre></body>
</html>

What should you look for in predictive SF?

Future
Photo by Dark Matter

Brad Feld just mentioned that he’s reading and watching as much science fiction as he can, to open his mind to the potential future. I’m a sci fi addict, and the same thought was on my mind after seeing Rick Segal’s post on the machines being built to scan books. Rainbow’s End from a few years ago largely centered on the competition to scan as many paper libraries as possible, with Google and other companies in a race to grab all that juicy data as quickly as they could. They resort to putting the books through a giant tree-shredder, photographing the pieces as they fly through the air, and then using algorithms much like those used to reconstruct the Stasi files to piece the complete text back together. It makes a lot of engineering sense, but also makes the book-lover in me squirm.

Vernor Vinge is one of the best authors if you’re looking for plausible future technology. As a computer science professor, his extrapolation is grounded in a deep knowledge of the present. For instance, I love his passing mention of the nanobots in A Deepness in the Sky having a software stack with Unix buried deep at the bottom. I’d never thought it through before, but it makes total sense that we’ll always keep adding more layers, and over thousands of years you’ll need code archaeologists to understand the depths. He also has some sobering characters who were todays programming hotshots, but ended up like buggy-whip makers or steelworkers once the technology left them behind. A good reminder to keep up your 401k.

What should you look for in SF if you want an insight into the future? My top criteria is that everyone should have jobs. There’s a dividing line between the utopian fiction of ideas, and those where people have the same problems as we do today. Putting the technology in the hands of ordinary folks with jobs and families forces the author to tackle questions of practicality and usability. Philip K Dick and Vinge and William Gibson all focus on this everyday world, and you get to see technology that’s both useful and plausible. I love Iain M. Banks, but his characters mostly live in a nerd-rapture utopia where everything is free and people only work for fun. If there’s no constraints, you end up with cool but never-in-our-lifetime technology like the knife missiles.

Another tip is to look for short story collections. There’s a lot higher idea-to-words ratio in the short form, and it allows writers to focus on a sketch of a corner of the world, rather than getting lost in the details of how the whole system works. For a regular dose I recommend a subscription to Interzone, probably the best SF magazine around.

Hike Big Sky Trail in Simi Valley

Simipeak
Photo by vw_huntsinger

After growing up somewhere flatter than Kansas, I love any terrain that rises above head-height. The Simi Hills aren’t glamorous or well-known, but they’re full of fascinating canyons and offer some great views. They’re also very off the beaten track, with miles and miles of wild land in the triangle between Filmore, Simi and Santa Clarita.

A small part of that has just been turned into a housing development. Thankfully that led to the dedication of a new park and the installation of a trail system, mostly along some old fire roads. Big Sky Trail is a four mile loop that me, Liz and Thor took for the first time last week. Here’s a map:

The upper section, which we took first, heads steeply uphill, and then follows the ridgeline for a couple of miles. You end up with some great cliff-top spots to enjoy the view out over the valley, all the way to Boney Mountain in the Santa Monicas if it’s clear. The way back winds alongside a stream, through some beautiful oak groves. There’s small parking lots next to most of the spots where the trail crosses the small streets of the development. Me and Liz hiked from a small one off Erringer. It looks like the trail might extend west past Erringer too, but we didn’t check out that side.

Now I need to figure out how to climb deeper into the hills, I can’t leave that much unexplored territory sitting on my doorstep!

Why play is the killer app for implicit data

Playing
Photo by RoyChoi

I recently ran across PMOG, and damn, I wish I’d thought of that! It’s a "Passive Multiplayer Online Game" that you play by surfing the web with their extension installed. You get points for each site you visit, and you can use those points to leave either gifts or traps for other players. There’s also user-created missions or paths that involve visiting other sites.

Why is this so interesting? It’s is a fantastic way to get permission to do interesting things with people’s browsing information. They get in-game rewards for sharing so there’s real reciprocity, you’re not just an evil corporation harvesting click-streams. Games are a great way to get people involved with a process too, with the instant rewards and status hierarchies that they generate. Even better, they’re free for the provider, all you have to do is provide fun and compelling rules. This means it’s a lot more likely they’ll be willing to provide detailed information about sites they’re visiting.

What could this mean in practice?

Site descriptions. You could get points for writing a short website description. It could be structured so that useful descriptions earn more points.

Comments. Contributing to a discussion attached to a website, through the extension interface, could also earn you points.

Rating. Simply giving a Stumbleupon style thumbs up or down to a site could get you a small amount of points too.

Provide information about yourself. By interacting with a site, whether it’s leaving a surprise or adding meta-information, you’re making a connection with it. You can mark that connection in a profile page, and build up a rich set of favorites.

This is the first compelling application I’ve seen that could persuade large numbers of users to happily share their browsing habits with other people. There’s only 4500 users right now, but I’ll be surprised if that doesn’t grow. Games are incredibly powerful motivator, if you can tap into the human instinct for play you’ll be amazed at how much work people will put in to achieving the goals set by the system.