What can you learn from traditional indexing?


I’m a firm believer in studying the techniques developed over centuries by librarians and other traditional information workers. One of the most misunderstood and underrated processes is indexing a book. Anybody who’s spent time trying to extract information from a reference book knows that a good index is crucial, but it’s not obvious the work that goes into creating one.

I’m very interested in that process, since a lot of my content analysis work, and search in general, can be looked at as trying to generate a useful index with no human intervention. That makes professional indexers views on automatic indexing software very relevant. Understandably they’re a little defensive, since most people don’t appreciate the skill it takes to create an index and being compared to software is never fun, but their critiques of automated analysis apply more generally to all automated keyword and search tools.

  • Flat. There’s no grouping of concepts into categories and subheadings.
  • Missing concepts. Only words that are mentioned in the text are included, there’s no reading between the lines to spot ideas that are implicit.
  • Lacking priorities. Software can’t tell which words are important, and which are incidental.
  • No anticipation. A good index focuses on the terms that a reader is likely to search for. Software has no way of telling this (though my work on extracting common search terms that lead to a page does provide some of this information).
  • Can’t link. Cross-referencing related ideas makes the navigation of an index much easier, but this requires semantic knowledge.
  • Duplication. Again, spotting which words are synonyms requires linguistic analysis, and isn’t handled well by software. This leads to confusing double entries for keywords.

It’s a wild, wild web

While browsing my visitor logs, I came across viewfour.com. It’s an interesting site, it does something similar to my old SearchMash Java applet and ManagedQ’s much more advanced engine, displaying live previews of search results. It does suffer from a problem with frame-busting sites unfortunately, for example this search for Pete Warden winds up with the toolfarm preview taking over the parent frame. That was a big reason why you either need some decent script-blocking code, or deploy it as a browser extension where you can prevent child frames from taking control.

I was curious to discover that there weren’t any organic reviews for the site that I could find, and the copyright was 2005. Most of the Google results pointed to download pages. It also includes a link to ViewSmart, a spyware/malware blocker, which seemed like an odd combination to go with a search engine. In fact, the only user-created review I found in the first few pages was this negative one from a spyware information site. I don’t recommend paying too much attention to anonymous posters, but if you do try out the search site, it would be prudent to avoid the additional download until I can find out more information about it. I’ll see if I can get more information directly from the author, SSHGuru.

How do you access Exchange server data?


Like standards, the wonderful thing about Exchange APIs is that there’s so many to choose from. This page from Microsoft is designed to help you figure out which one you should use, and I count over 20 alternatives!

I need something that’s server based, not a client API, so that does help narrow down the selection a little. MAPI is a venerable interface, and still used by Outlook to communicate with the server, but unfortunately MS has dropped server-side support for it on Exchange 2007. It is possible to download an extension to enable it, but using a deprecated technology doesn’t feel like a long-term solution. CDOEx is another interface that’s been around for a while, and it’s designed for server code, but it too is deprecated.

Microsoft’s current recommendation is to switch all development to their new web service API. This looks intriguing, since it makes the physical location of the code that interfaces with the server irrelevant, but I’m wary that it will hit performance problems when accessing the large amounts of data that I typically work with. It seems mostly designed with clients in mind, and they typically have an incremental access pattern where they’re only touching small amounts of data at a time. Another issue is adoption of Exchange 2007. My anecdotal evidence is that many organizations are still running with older versions, and even Microsoft’s Small Business Server package still uses 2003. Since it’s likely that the old Exchange versions will be around for a while, that makes it tricky to rely on an interface that’s only supported in the very latest update.

Complaints are the best measure of success


I enjoyed Mitchell Ashley’s post about his joy when QA found the first bug in a new product. It’s an unfakeable sign that your software is far enough along to be testable and that you’ve got a system in place for testing it. As he says, there’s always bugs, and if you aren’t finding them early then you’re not looking right. That always leads to a very nasty time at the end of the development cycle.

Along similar lines, the best way to tell if released software has potential is whether anyone complains. I’m not advocating deliberately annoying your customers, but all software has problems. If nobody complains, it doesn’t mean it’s perfect, it just means nobody sees enough value in the software to want to see it fixed. It’s a serious investment of time to complain, almost everybody will just keep quiet and stop using an application that has a serious flaw. Someone who complains must really care and believe in what you’re trying to do. I love being at the point where you get complaints, it means you’ve created something people feel passionately about. Most software never gets that far.

It’s also great motivation for the team when they know that there’s a real person out there who will be delighted by the latest bug fix because it addresses their problem. Anything you can do to reinforce a relationship between engineers and passionate users pays massive dividends in innovation. User groups and site visits are great for this. Otherwise it’s too easy to lose sight of the goal of what you’re doing in the fog of technical details. Looking at which complaints you’re addressing with your engineering is a great way of ensuring you’re actually doing something that will be useful for your customers.

Skiing in Utah


I’m literally skiing in Utah for the next three days, but Seth’s analogy has been on my mind. Yesterday I was asked why I’m building these mail tools. As a knowledge worker, email is central to my life, and I know how many painful problems there are related to sharing information. I can see a massive pool of mail data sitting there unused. I want to build systems to bridge that gap. There’s so many opportunities I feel like a kid in a candy store, or a skier in Utah.

How to stop reinventing the wheel


Someone recently pointed me towards Tacit Software as a company I’d be interested in. Their team has created a system to automatically catalog expertise within an organization. The question they’re trying to answer is ‘Who can I ask about X?’, and the goal is to prevent redundant work within an organization. They have an interface where employees can ask a question, and the software will try to identify the best people to answer it.

They offer two different products, Illumio which is based on a desktop client and ActiveNet, which is centrally deployed. Illumio works a lot like a desktop search system, analysing all the files on a user’s computer including documents, emails and contacts, to identify areas of expertise. ActiveNet is similar, but looks at the data stored globally on the organization’s servers to figure out who knows about what.

One interesting approach they’re using to demonstrate Illumio’s potential are the public web groups they’ve set up. To join, you download Illumio and it analyzes your interests. You can then participate in their groups to ask and answer questions on topics ranging from sports to business.

An area they’ve obviously spent a lot of time on is safeguarding users’ privacy. The process they use for answering questions involves getting permission from the people it decides are experts on the topic before any identifying information is returned to the questioner. Privacy is a big concern, but this does seem a bit unwieldy compared to the Knowledge Network approach where experts pre-approve what information is going to be exposed, and it’s then available for easy browsing and searching by other employees.

Their case studies show they’ve deployed in some large organizations and report some impressive satisfaction figures. Their descriptions of hotspots where they see a lot of redundant work are illuminating too, they’ve focused on procurement, research and new project proposals. This definitely fits with my experiences, though I’ve spent most of my time on the research side.

What can an externally-facing social network do for a business?


The Haystack system from Cerado is a social network tool for businesses with a twist. The audience for the network is people outside the company who want to talk about something, and would like to find the right person within the business to approach. Cerado have published a short ebook explaining the uses of this, from sales to m&a, but for me the key insight is that there’s a big difference between using an anonymous email address or phone number as the initial way of contacting a business, and having a named person to talk to.

People don’t have relationships with organizations, they have relationships with individuals. Organizations don’t have any memory, ability to trade favors, or pictures of grandchildren to swap. In my professional life, I’m able to get information and assistance from large companies through the individuals I know who work at them. It’s rare that I’m able to get help through the public forums or mailing lists because the questions I’m stuck on tend to be complex and tricky. Since developer support is bombarded with questions from inexperienced developers, you have to go through the dance of proving that the machine is plugged in to the wall socket and caps lock is off before they’ll dig into your issue. With individuals, I have the credibility for them to assume I’ve done the obvious stuff, and cut to the chase of looking at the issue I’m presenting. They remember the context of the larger system I’m working in, so I don’t have to explain that every time. And since I’ve reciprocated and helped them in the past with issues related to my company, they’re willing to spend some time assisting me.

Haystack is about making it easier to build these sort of relationships with individuals inside companies, for sales contacts, business partnerships and anything else that requires communication with an organization. They publish profiles of employees, with photos, tags indicating their areas of expertise and contact details. From a customer perspective, I like this a lot more than the usual bland contact page, it adds a human face to the organization and lowers the barrier for me to contact them. I’d have more confidence I’d reach someone who’d be able and willing to answer my questions, rather than a random intern I’d have to persuade to escalate me. I could see this being useful if I wanted to buy from a company, get a job there, sell something to them, ask a technical question or talk about a business partnership.

It does require a change in the way most organizations operate though. The usual practice is that there’s designated people who are gatekeepers to the outside world, both to control the flow of information to make sure nothing outside of policy is leaked, and to protect employees from being distracted from their internal work. The gatekeepers also derive a lot of power and prestige from their control of the communication channels, so they tend to have a vested interest in keeping them limited. I think that organizations would be better off relaxing the current systems, but the spectre of lawsuits and information leakage makes it a tough sell in any group where avoiding blame is the top priority.

Cerado also run a great blog, another great way to add a human face to a company, and one that runs into the same worries.