I just received an update on my green card application, which is now over 4 months overdue. It appears I’m stuck in the background check backlog. As I’ve been a legal resident here for over 6 years, and the only difference with the green card is that I could change jobs and travel a lot more easily, it seems unlikely that this is helping national security. It appears that over 100,000 people are stuck with delays of over a year, so I seem to be at the end of a very long queue.
Author Archives: Pete Warden
What’s the secret to Amazon’s SIPs algorithm?

The statistically improbable phrases that Amazon generates from a book’s contents seem like they’d be useful to have for a lot of other text content, such as emails or web pages. In particular, it seems like you could do some crude but useful automatic tagging.
There’s no technical information available on the algorithm they use, just a vague description of the results it’s trying to achieve. They define a SIP as "a phrase that occurs a large
number of times in a particular book relative to all Search Inside!
books".
The obvious implementation of this for a word or series of words in a candidate text is
- Calculate how frequently the word or phrase occurs in the current text, by dividing the number of occurrences by the total number of words in the text. Call this Candidate Frequency.
- Calculate the frequency of the same word of phrase in a larger reference set of set, to get the average frequency that you’d expect it to appear in a typical text. Call this Usual Frequency.
- To get the Unusualness Score for how unusual a word or phrase is, divide the Candidate Frequency by the Usual Frequency.
In practical terms, if a word appears often in the candidate text, but appears rarely in the reference texts, it will have a high value for Candidate Frequency and a low Usual Frequency, giving a high overall Unusualness Score.
This isn’t too hard to implement, so I’ve been experimenting using Outlook Graph. I take my entire collection of emails as a reference corpus, and then for every sender I apply this algorithm to the text of their emails to obtain the top-scoring improbable phrases. Interestingly, the results aren’t as compelling as Amazon’s. A lot of words that intuitively aren’t very helpful showing up near the top.
I have found a few discussions online from people who’ve attempted something similar. Most useful were Mark Liberman’s intial thoughts on how we pick out key phrases, where he discusses using "simple ratios of observed frequencies to general expectations", and how they will fail because "such tests will pick out far too many words and phrases whose expected frequency over the span of text in question is nearly zero". This sounds like a plausible explanation for some of the quality of the results I’m seeing.
In a later post, he analyzes Amazon’s SIP results, to try and understand what it’s doing under the hood. The key thing he seems to uncover is that "Amazon is limiting SIPs to things that are plausibly phrases in a linguistic sense". In other words, they’re not just applying a simplistic statistical model to pick out SIPs, they’re doing some other sorting to determine what combinations of words are acceptable as likely phrases. I’m trying to avoid that sort of linguistic analysis, since once you get into trying to understand the meaning of a text in any way, you’re suddenly looking at a mountain of hairy unsolved AI problems, and at the very least a lot of engineering effort.
As a counter-example, S Anand applied the same approach I’m using to Calvin and Hobbes, and got respectable-looking results for both single words and phrases, though he too believes that "clearly Amazon’s gotten much further with their system".
There are some other explanations for the quality of the results I’m getting so far. Email is a very informal and unstructured medium compared to books. There’s a lot more bumpf, stuff like header information that creeps into the main text that isn’t intended for humans to understand. Emails can also be a lot less focused on describing a particular subject or set of concepts, a lot closer to natural speech with content-free filler such as ‘hello’ and ‘with regards’. It’s possible too that trying to pull out keywords from all of a particular person’s sent emails is not a solvable problem, that there’s too much variance in what any one person discusses.
One tweak I found that really improved the quality was discarding any word that only occurs once in the candidate text. That seems to remove some of the noise of junk words, since the repetition of a token usually means it’s a genuine word and not just some random characters that have crept in.
Another possible source of error is the reference text I’m comparing against. Using all emails has a certain elegance, since it’s both easily available in this context, and will give personalized results for every user, based on what’s usual in their world. As an alternative, whilst looking at a paper on Automatically Discovering Word Senses, I came across the MiniPAR project, which includes a word frequency list generated from AP news stories. It will be interesting to try both this and the large Google corpus as the reference instead, and see what difference that makes.
I’m having a lot of fun trying to wrestle this into a usable tool, it feels very promising, and surprisingly neglected. One way of looking at what I’m trying to do is as the inverse of the search problem. Instead of asking ‘Which documents match the terms I’m searching for?’, I’m trying to answer ‘Which terms would find the document I’m looking at in a search?’. This brings up a lot of interesting avenues with search in general, such as suggesting other searches you might try based on the contents of results that seem related to what you’re after. Right now though, it feels like I’m not too far from having something useful for tagging emails.
As a final note, here’s an example of the top single-word results I’m getting for an old trailworking friend of mine:
The anti-immigration one is surprising, I don’t remember that ever coming up, but the others are mostly places or objects that have some relevance to our emails.
One thing I always find incredibly useful, and the reason I created Outlook Graph in the first place, is transforming large data sets into something you can see. For the SIPs problem, the input variables we’ve got to play with are the candidate and reference frequencies of words. Essentially, I’m trying to find a pattern I can exploit, some correlation between how interesting a word is and the values it has for those two. The best way of spotting those sort of correlations is to draw your data as a 2D scatter graph and see what emerges. In this case, I’m plotting all of the words from a senders emails over the main graph, with the horizontal axis the frequency in the current emails, and the vertical axis representing how often a word shows up in all emails.
You can see there’s a big log jam of words in the bottom left that are rare in both the candidate text, and the background. Towards the top-right corner are the words that are frequent in both, like ‘this’. The interesting ones are towards the bottom right, which represents words frequent in the current text, but infrequent in the reference. These are things like ‘trails’, ‘work’ or ‘drive’ that are distinctive to this person’s emails.
Should you cross the chasm or avoid it?

I recently came across a white paper covering Ten Reasons High-Tech Companies Fail. I’m not sure that I agree with all of them, but the discussion of continuous versus discontinuous innovation really rang true.
Crossing the Chasm is a classic bible for technology marketers, focused on how to move from early adopters to the early majority in terms of the technology adoption lifecycle. It describes the gap between them as a chasm because what you need to do to sell to the mainstream is often wildly different than what it takes to get it adopted by customers who are more open to change.
What the white paper highlights is that this ‘valley of death’ in the adoption cycle only happens when the technology requires a change of behavior by the customer, in his terms is discontinuous. Innovations that don’t require such a change are continuous. They don’t have such a chasm between innovators and the majority because the perceived cost of behavior changes is a large part of the mainstreams resistance to new technology.
This articulates one of my instincts I’ve been trying to understand for a while. I was very uncomfortable during one of the Defrag open sessions on adopting collaboration tools, because everyone but me seemed to be in the mode of ‘How do we get these damn stubborn users to see how great our wikis, etc are?’. They took it as a given that the answer to getting adoption was figuring out some way to change users’ behavior. My experience is that changing people’s behavior is extremely costly and likely to fail, and most of the time if you spend enough time thinking about the problem, you can find a way to deliver 80% of the benefits of the technology through a familiar interface.
This is one of the things I really like about Unifyr, they take the file system interface and add the benefits of document management and tagging. It’s the idea behind Google Hot Keys too, letting people keep searching as they always have done, but with some extra functionality. It’s also why I think there’s a big opportunity in email, there’s so much interesting data being entered through that interface and nobody’s doing much with it. Imagine a seamless bridge between a document management system like Documentum or Sharepoint and all of the informal emails that are the majority of a company’s information flow.
Of course, there are some downsides to a continuous strategy. It’s harder to get early adopters excited enough to try a product that on the surface looks very similar to what they’re already using. They’re novelty junkies, they really want to see something obviously new. You also often end up integrating into someone else’s product, which is always a precarious position to be in.
Another important complication is that I don’t think interface changes are always discontinuous. A classic example is the game Command and Conquer. I believe a lot of their success was based on inventing a new UI that people felt like they already knew. Clicking on a unit and then clicking on something else and having them perform a sensible action based on context like moving or attacking just felt very natural. It didn’t feel like a change at all, which drove the game’s massive popularity.
I hope to be able to discuss a more modern example of an innovative interface that feels like you already know it, as soon as some friends leave stealth mode!
Krugle’s approach to the enterprise
I’ve been interested in Krugle ever since I heard Steve Larsen speak at Defrag. They’re a specialized search company, focused on returning results that are useful for software developers, including code and technical documentation. What caught my interest was that they had a product that solved a painful problem I know well from my own career; where’s our damn code for X? Large companies accumulate a lot of source code over the years. The holy grail of software development has always been reuse, but with the standard enterprise systems it can be more work to find old code that solves a problem than to rewrite it from scratch.
Their main public face is the open-source search site at krugle.org. Here you can run a search on both source code and code-related public web pages. I did a quick test, looking for some reasonably tricky terms from image processing, erode and dilate. Here’s what Krugle finds, and for comparison here’s the same query on Google’s code search. Krugle’s results are better in several ways. First, they seem to understand something about the code structure, so they’ve focused on source that has the keywords in the function name and shows definition of the function. Most of Google’s hits are constants or variable names, which are a lot less likely to be useful. Krugle also shows more relevant results for the documentation, under the tech pages tab. A general Google web search for the same terms throws up a lot more pages that aren’t useful to developers. Finally, Krugle knows about projects, so you can easily find out more about the context of a piece of source code, rather than just viewing it as an isolated file as you do with Google’s code search.
Krugle have also teamed up with some big names like IBM and Sourceforge, to offer specialized search for the large public repositories of code that they control. Unfortunately, I wasn’t able to find the Krugle interface directly through Sourceforge’s site, and their default code search engine seems fairly poor, producing only two irrelevant results for erode/dilate. Using the Krugle Sourceforge interface produces a lot more, it seems like a shame that Sourceforge don’t replace theirs with Krugle.
So, they have a very strong solution for searching public source code. Where it gets interesting is that the same problem exists within organizations. Their solution is a network appliance server that you install inside your intranet, tell it where your source repositories are, and it provides a similar search interface to your proprietary code. I find the appliance approach very interesting, Inboxer take a similar approach for their email search product, and of course there’s the Google search appliance.
It seems like a lot of developers must be searching for a solution for the code-finding problem, because it is so painful. It also seems like an easy sell to management, since they’ll understand the benefits of reusing code rather than rewriting it. I wonder how the typical sale is made though? I’d imagine it would have to be an engineering initiative, and typically engineering doesn’t have a discretionary budget for items like this. They do seem to have a strong presence at technical conferences like AjaxWorld, which must be good for marketing to the development folks at software businesses.
Overall, it seems like a great tool. I think there’s a lot to learn from their model for anyone who’s trying to turn specialized search software into a product that businesses will buy.
An easy way to install your Firefox extension
Firefox’s biggest selling point is its security. Unfortunately for third-party developers, this means that users have to do several awkward steps before they can install a Firefox extension from an internet site, to protect them against malicious code. The best way to avoid this is to get your extension on the main add-ons site, addons.mozilla.org, since that’s trusted by default and your users won’t have to navigate any tricky security dialogs. There are some issues with this though. Since it requires a vetting process it can take weeks to months to get an extension added. It’s also possible that your extension doesn’t meet the criteria for inclusion if it’s specific to a particular product or niche market, rather than something that’s appropriate for the general public.
If you do need to install from your own site instead, you’ll need a way of guiding your users through the security process, and I’ll cover a technique I’ve found effective. Firefox extensions are packaged in .xpi files, which under the hood are just zip files with a special layout. To start installation, you just need to create a link to the .xpi file on your site, and Firefox will recognize the type when the user clicks on it. Because the site won’t have the right security privileges, the first thing the user sees will be this security warning at the top of their window, and installation will be blocked:
To restart installation, the user has to click on edit options, which brings up this dialog:
They then have to click on ‘Allow’, and click on the install link again once the dialog has closed. As you can imagine, it’s easy to lose users along the way with this multi-step process. I’ve found that providing a visual aid to guide them through it seems to help, using Javascript to draw an arrow pointing to the ‘Edit options’ button and providing brief instructions next to it:
I can’t claim credit for the idea, I first saw it with me.dium‘s extension, but I ended up writing my own version for GoogleHotKeys before it was accepted onto the official Mozilla site. It works by intercepting the install link mouse-click, revealing the guide at the top of the page and then trying to install the extension through scripting, which brings up the security warning it points to. Here’s a link to an example page showing the code in action
(you’ll need an image like this
for it too), and I’ve included the code below. You’re free to reuse this for your own projects.
<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"><title>PeteSearch</title></head><body bgcolor="#eeeeee">
<script type="application/x-javascript">
<!–
function installInitialTry (aEvent)
{
showInstallEnable();
return attemptInstall(aEvent);
}
function attemptInstall(aEvent)
{
var params = {
"PeteSearch": {
URL: aEvent.target.href,
IconURL: aEvent.target.getAttribute("iconURL"),
toString: function () { return this.URL; }
}
};
InstallTrigger.install(params);
return false;
}
function showInstallEnable()
{
if ((document==null)||(document.getElementById==null))
return;
var content = document.getElementById("click_here_content")
if (content!=null)
return;
var placeholder = document.getElementById("click_here_placeholder");
placeholder.innerHTML =
"<table align=\"center\" bgcolor=\"#ffffff\" border=\"0\" width=\"100%\" id=\"click_here_content\"><tbody>"+
"<tr>"+
"<td align=\"right\"><p><font size=\"+2\"><b>Click here to enable installation<br>"+
"and then click <a href=\"http://petesearch.com/petesearch.xpi\" iconurl=\"iconsmall.png\" onclick=\"return attemptInstall(event);\">here</a> to install"+
"</b></font>"+
"</p></td>"+
"<td width=\"116\" align=\"right\">"+
"<img width=\"116\" height=\"165\" src=\"clickhere.png\"></td>"+
"</tr>"+
"<tr>"+
"<td align=\"right\"><p><font size=\"+2\"><b></p></td>"+
"</tr>"+
"</tbody></table>";
}
–>
</script>
<div id="click_here_placeholder">
</div>
<div align="center">
<a href="http://petesearch.com/petesearch.xpi" iconurl="iconsmall.png" onclick="return installInitialTry(event);">
install
</a>
</div>
</body></html>
Killer whales in Los Angeles?
On the way back from a night camping on Santa Cruz, I was lucky enough to see a pod of Orca. I’m always amazed by the natural wonders you can find on your doorstep here in LA.
IslandPackers are running special whale-watching half-day trips until April. Orcas are unusual, but there’s often grey whales, and I’ve almost never stepped off the boat without seeing a good crop of dolphins and sea-lions. The crew are always a lot of fun, especially when they go into great depth on the fauna’s sex-lives. Dolphins mate up to ten times a day, apparently! Do make sure to dress up warm. Even in the summer when it’s in the 80’s on land, I’ve got very chilly on the boat.
AFK – Christmas
Christmas looms, with a short camping trip to Santa Cruz Island the next day, so I’ll be away from keyboard until the end of the week. Look forward to some posts on Firefox extension installation, Krugle and how to hike to the highest point in the Santa Monicas. Until then, bake some cookies!
Here’s a quick way to organize Outlook attachments
Outlook Attachment Processor from MAPILab lets you save out all your email attachments to disk, and replaces them with links in the messages. I find it a lot easier to search and organize documents as objects on the file system than when they’re embedded in emails, and this add-in makes it painless to move them over. It’s got a large array of options, but they’re well-explained and have good defaults, so it doesn’t feel too much like the space shuttle control panel.
The most important options cover which messages and attachments are converted to local files, and where they end up. I like the way this addin focuses on solving a single painful problem, but with a lot of flexibility and depth for customizing that solution. It’s obviously been heavily driven by user feedback.
Of course, there’s a downside to saving all your attachments like this; the links break when you move to a different machine. There’s an ‘Update Links’ tool to change them to a new location to solve this problem, but it shows that separating your attachments from the source PST does add some complications. You can try the add-in free for 30 days with fully-functional trial version, and costs $24 to buy a single-user license.
MAPILab offer a range of other Microsoft plugins, including a couple of tools for Exchange. They employ 25 people, which shows the engineering effort that solid plugins like these require, and that there’s market demand for their solutions. They explicitly spell out their strategy as targeting narrow problems, leaving larger companies to "focus on the creation of platforms and technological foundations".
One of the problems I’m interested in solving is making document collaboration through email less painful. Attachment Processor and some of their other tools like File Fetch and File Send Automatically are solving parts of what makes it so awkward. What I’d like to see is a more comprehensive system that offers the advantages of a wiki without having to force people away from sharing documents through email. It seems like an Exchange extension that turned attachments into links to Sharepoint documents, like Attachment Processor does for the local filesystem, would be an interesting direction to go down.
Fancy a trail with oil bubbling from the ground?
Towsley Canyon is a park on the edge of Santa Clarita, just off the 5 freeway. It’s a lovely place, full of beautiful flowers in the springtime, but my favorite features are the small oil seeps scattered amongst the hills. Once a commercial oil field, there’s now just a few rusty derricks and some natural springs of oil slowly bubbling to the surface. I never believed the title sequence of the Beverly Hillbillies could be real, but you can see it here! There’s also a miniature version of the Zion narrows, with a small but spectacular canyon carved out by occasional heavy flooding.
The main trail there is a 5 mile loop, that climbs about 800 feet. It’s got sections with a 30% grade on the east side, so I recommend taking the counter-clockwise direction where there’s some gentler uphill grades thanks to some switchbacks. It’s popular with bikers and runners and is a good place to take your dogs, though you’ll need them leashed. Here’s a map showing the main loop, and a shorter variation you can take.
To get there from LA, drive north up the 5 until the Calgrove exit, and turn left at the stop sign at the end of the ramp. About 1/4 mile along that road you’ll see a sign for the Ed Davis Park on the right. You can either park on the street lot which is free, or drive in a few hundred yards and pay $5 for one of the interior parking areas. Head north along the fire road, going past the visitor’s center and ranger’s accomodation. You’ll pass a concrete dam, and then go through the Narrows on the streambed, and eventually hit a spot where the trail turns uphill. After this, there’s some well-planned switchbacks, but it’s still hard work getting towards the top. Luckily, the trail is kept in great condition by a dedicated crew of volunteers. Liz and I just returned from a day working with them, and came home with some scrumptious fruitcake as a Christmas gift!
It should be fairly clear which way to go as you hike along the trail, there’s not much vegetation growing in, the tread is in good shape, and there’s few unofficial side-trails to confuse you. Be careful if you go in the summer, there’s little shade and can get extremely hot, so make sure you bring plenty of water. My favorite time to visit is the spring, thanks to the cool weather and wonderful wildflowers that appear after the rains, including some gorgeous Chocolate Lilies.
You’ll reach a peak of around 2200 feet, and then start heading downhill fairly gently. After around a mile, it will start to get a bit steeper, culminating in a 30% grade section (marked on the map) that seems to head straight down the hill. Thankfully it eases up after that, and you’ll soon pass the largest oil seep, usually with lots of sticks left poked in it by curious children. After that, you’re less than a mile back to the parking lot.
Are Whitehouse emails wide open to hackers?

When I heard about the deletion of the Whitehouse emails back in April, and Karl Rove’s use of a private email account, my first thought was ‘wow, they must really struggle to keep that secure’. It’s not often my technical research leads to a question of national security, but it turns out they don’t struggle, they just leave a large part of their email system unsecured!
Emails that travel outside of an organization to a private email account like Karl’s go through an unencrypted, plain text transport system, SMTP. In simple terms, a text document is passed from server to server until it reaches its destination. In theory, anybody who’s sitting on the network can see the contents of those messages. Normally, this isn’t a big issue, since emails are low value (typically not containing credit card numbers or other information valuable to hackers) and there’s so many flying around, just being in the right place to sniff it and picking an interesting one out from the noise is tough.
David Gewirtz, a techie who runs OutlookPower magazine, has spent months researching the technical aspects of the Whitehouse’s email use. He’s now published a book, and it’s scary reading for anyone who cares about America’s security. You can read extracts from it at this site, but I recommend looking through the original articles too. Start with "Prepare to be freaked out" to understand how serious the consequences of their poor technology decisions could be. This isn’t a partisan or crazy conspiracy book, email is something that every Executive in the last 20 years has made serious mistakes with, and David ends with recommendations on how to improve the current dire situation.
Buy the book, but here’s a full list of the related articles:
- Technical analysis: the White House email controversy
- The White House email controversy: who runs GWB43.COM?
- The White House email controversy: a detour into mob journalism
- The White House email controversy: the nightmare scenario
- The White House email controversy: an archiving plan only FEMA could love
- ‘Deep Mail’ on the White House email controversy
- The White House email controversy: migrating from Notes to Outlook
- The White House email controversy: why does Karl Rove keep losing his BlackBerry?
- The White House email controversy: help us find those missing messages
- The White House email controversy: a historical perspective
- The White House email controversy: prepare to be freaked out
- The White House email controversy: understanding the root causes
- The White House email controversy: our formal recommendations
- The White House email controversy: the final questions








