Heading back to Blighty

747
Photo by Caribb

Tomorrow me and Liz are flying to Heathrow, and the first couple of days we’ll be exploring London. During the day I’ll be introducing her to Kew Gardens, one of my favorite places on the globe, full of the most wonderful plants that the British Empire could plunder. We’ll be returning to London Walks in the evening, we’ve now done pretty much every tour they offer on previous visits, so we’ll be trying some for the second time. If you have a ghoulish streak, I highly recommend the Jack the Ripper walk, it’s pretty chilling to visit the sites of the murders, and even the pubs the victims were picked up from.

After that I’ll be in a happy whirlwind of family gatherings in my Cambridgeshire village, followed by a week in a cottage on the south-west coast of Ireland. Our main goal is to see some rain, after the last 6 months of SoCal heat. I just want to see some green grass that doesn’t rely on sprinklers.

I’ll be checking my email when I can, and probably squeezing in some development while we’re traveling, but there won’t be many updates here for a couple of weeks. My twitter account may get a bit more use if I’m able to SMS, but we’ll see if that’s too much of a technical challenge for me.

Come to Defrag!

Defrag08header_01

I’ve had a lot of new visitors recently interested in my browsing history experiments. If you’re seriously excited about the possibilities of this sort of implicit data analysis, you really need to join me at the Defrag conference at the start of November. I’ve already blogged about how much I got out of last year, but it’s the only place I’ve found where everyone just gets the possibilities of this stuff. You’ll be rubbing shoulder with everyone from technologists and journalists, to potential customers and investors, all very into figuring out where we can take these ideas.

Eric has just extended the early bird pricing until the end of August, and with my ‘pete1’ code you’ll get an extra $100 [Now with an extra $200 thanks to my speaker’s code and with no time limit, thanks Eric!] off on top. I hope to see you there.

How to pull browsing history from the image cache

Tracks
Photo by PigDump

I was trying to think of ways to make the browser history hack more useful. One of the limitations is that you can only tell if a user has been to an exact URL. So you can tell if someone’s recently been to the main New York Times page at http://nytimes.com/ but that won’t match if they went directly to http://nytimes.com/somestory.html . You can partially work around this by testing a lot of popular internal links (eg all the stories from the front page) but this is a lot harder.

That got me wondering if there was some common property that all the pages on a site are likely to share, something that leaves a trace I can test for. Most websites have a logo image that’s used on most of their pages, and I realized that if I could tell if an image was cached by the browser, I’d have proof that the user had visited some page there recently. How could I tell if an image was cached? Well, if it is in the cache, it should take a lot less time to create it than if it has to be fetched from the network. I gave this idea a quick test, and found that cached images were indeed created synchronously in Javascript, whereas uncached ones took some time. Rather than doing any complex callbacks, I checked the .complete property of each image immediately after creation, and rather to my surprise, this seemed reliable. Here’s an example of it in action, checking for a few common sites:

You can download the full example from http://funhousepicture.com/imagecachetest/imagecachetest.html, but here’s the heart of the test:

function isImageLoaded(image)
{
    return !((image.naturalHeight == 0 || image.naturalWidth == 0 || image.complete == false));
}

function isImageInCache(url)
{
    var image = new Image();
    image.src = url;
    return isImageLoaded(image);
}

There’s plenty of limitations to this approach. For one thing, the test itself pollutes the cache by loading all the images it’s testing, so you can only reliably run this once. All subsequent reloads will show every tested site as having been visited, until you clear your cache. I think I could fix this using cookies to hold the results after the first time, but I haven’t implemented that yet. You also have to identify a common image across the range of pages you’re testing, and with redesigns that URL is likely to change every few months at least. It’s also highly-dependent on how long an image remains in the cache.

It’s exciting to be able to pull out this sort of history information, it’s a good complement to the link style checking, and brings some of the possibilities of the implicit web a little closer to realization.

Santa Monica Mountain trailheads now on Google Maps

Trailheadmap

It took her several days, but Liz has just finished off her map of the trailheads in the Santa Monica mountains. There’s descriptions for each of the locations, describing the trails they lead to, how much parking there is, nearby campsites, which agency owns the land and if bikes or horses are allowed. This was originally going to be just so she could easily link to the meeting points for trailwork from the SMMTC website, but it’s turned into a great resource for anyone who’s interested in getting out into the mountains.

I’m really proud of what she’s accomplished, and it demonstrates how Google’s map-building application opens the door to anyone building rich maps, in a way that just wasn’t possible before. Maybe this will help a few more people discover the beautiful wilderness we have on our doorstep here in LA.

How to speed up the history testing hack

Speedometer
Photo by Abed Dodokh

The original browser history Javascript ran very slowly in Internet Explorer. When it needed to check thousands of sites, like for the gender test or my tag cloud, it could take several minutes. If it was going to be generally useful, I needed to speed it up a lot. The first thing I did was move the test link creation over to the server side, so there was a prebaked html div containing all the links, rather than building it on the fly. This didn’t make much difference though, so I started poking at the testing code. What I found was that switching from array accessing to go through all the links towards grabbing the next sibling of an element seemed to make a massive difference. I’ve included the function below, and it now only takes a couple of seconds to check thousands of URLS:

function getVisitedSites()
{
    var iframe = document.getElementById(‘linktestframe’);

    var visited = [];

    var isIE = iframe.currentStyle;
    if (isIE)
    {
        currentNode = iframe.firstChild;
        while (currentNode!=null)
        {
            if (currentNode.nodeType==1)
            {               
                var displayValue = currentNode.currentStyle["display"];
                if (displayValue != "none")
                    visited.push(currentNode.innerHTML);
            }
            currentNode = currentNode.nextSibling;            
        }
    }
    else
    {
        var defaultView = document.defaultView;
        var functionGetStyle = defaultView.getComputedStyle;

        currentNode = iframe.firstChild;
        while (currentNode!=null)
        {
            if (currentNode.nodeType==1)
            {       
                var displayValue = functionGetStyle(currentNode,null).getPropertyValue("display");
                if (displayValue != "none")
                    visited.push(currentNode.innerHTML);
            }
            currentNode = currentNode.nextSibling;
        }
    }

    return visited;
}

Where to go if you want startup inspiration

Startuptweetlogo
I’m a comparative late-comer to Twitter, but I’ve started to get hooked. One of things that pleasantly surprised me is how useful it can be. You can ask questions, or respond to them, and generally do the flea-picking off each others backs that’s required to keep relationships alive, all through a very zen interface.

As someone who reads the back of cereal packets if there’s nothing else to hand, I try to direct my reading addiction into useful channels, mostly towards sources of startup advice and inspiration over the last few years. This has meant personal blogs like Brad’s, Fred’s, Don’s, or topic-based ones like VentureHacks or AskTheVC. The trouble is blog posts are time-consuming, which means there’s a big barrier to passing on a quick link, so posts only happen occasionally. That’s where Sam Huleatt has stepped in, with a use for Twitter I’d never thought of.

His new startuptweet stream is collecting a massive number of videos, stories and blog posts on things that startups care about, like a Stanford introduction to the VC process or Paul Graham discussing how to motivate great hackers. He’s already posted a large number of high-quality resources in just a few days, and I’m hopeful that the ease of posting will make it possible for him to keep up the pace. Check out the full site, and start following!

The insanity of retention policies

Crazyface

Photo by 0range Country Girl

I was doing some more research into other companies doing enterprise document analysis, and the combination of staring at this page from PSS Systems and having just finished Bleak House made me step back and realize what a fundamentally dumb idea retention policies for legal reasons are.

As Dickens describes it:

The one great principle of the English law is to make business for itself.  There is no other principle distinctly, certainly, and consistently maintained through all its narrow turnings.  Viewed by this light it becomes a coherent scheme and not the monstrous maze the laity are apt to think it.  Let them but once clearly perceive that its grand principle is to make business for itself at their expense, and surely they will cease to grumble.

Retention policy is a euphemism for deletion policy. Emails over a certain age are deleted, even from backups, usually after 6 or 12 months. The sole reason for this is so that if you’re sued, you aren’t able to hand over older documents, and there’s no question that you deleted them specifically out of a guilty conscience, it’s just your blanket policy. As one of Dicken’s lawyers says:

Being in the law, I have learnt the habit of not committing myself in writing.

There’s no good technical reason for deleting old emails. You’ve made those backup tapes, it’s actually more work to make sure that old ones are destroyed. You also have to make sure you do keep any messages that relate to currently active lawsuits, which is where PSS Systems comes in by semantically analyzing documents to spot those that might be needed in discovery.

Email is the collective memory of an organization, and removing old emails is deliberate corporate amnesia. It’s needed because so many recent court cases have hinged on ‘incriminating’ memos, and with thousands of messages written every day, it’s almost certain that somebody’s dry sarcasm could be painted as deadly serious in front of a jury.

Why does this matter? You’re losing the history of the company. Unless you have explicitly copied them, all those old conversations and attachments you might need to refer back to one day are gone. It’s like putting a back-hoe through an archaeological site, you can never get that information back. Just like archeology, I’m convinced that there will be new techniques in the future that can pull more information out of that data than we can today. Old email should be an asset, not a liability. Unfortunately as long as the legal climate keeps companies terrified of a losing the litigation lottery, they’ll keep deleting.

Just a good little pointless thing?

Lavalamp
Photo by Wahj

Robert posted a comment on my BrainCloud post saying that "its a good little pointless thing thats always fun". That’s a pretty fair description for what it does right now, it’s basically a lava lamp for the internet. So why am I so interested in the technology behind it?

The promise of the implicit web is based on knowing information about your users without requiring them to manually enter it. It seems silly that you have to type in all your friends to Facebook when your email inbox makes it pretty clear who most of them are. If I knew which products you’d bought, or which sites you’d visited, I could figure out which to recommend in the future.

There’s a pretty wide consensus that there’s lots of interesting applications we could write based on data like that. The trouble is security concerns make it almost impossible to gather it unless you’re the owner of a well-used site. Amazon can offer recommendations because they have information on all their customers buying habits. No startup can build that application or anything like it without the data, so there’s a barrier to entry that favors the big incumbents.

One approach to get over the barrier is breaking out of the security sandbox with a browser extension. Medium is taking that route, and offering some interesting new search tools thanks to all the data they can gather. It’s really, really hard to get people to install anything though, which makes it a time-consuming and expensive route to follow.

That’s why my eyes lit up when I saw Mike’s social history hack. For the first time, there’s a way of gathering some implicit data without either being a big site owner or requiring installation. There isn’t a killer app for it yet, but I’m hopeful once we all poke at the technique’s limitations, we can figure out some compelling uses.