What can you learn from traditional indexing?

Book

I’m a firm believer in studying the techniques developed over centuries by librarians and other traditional information workers. One of the most misunderstood and underrated processes is indexing a book. Anybody who’s spent time trying to extract information from a reference book knows that a good index is crucial, but it’s not obvious the work that goes into creating one.

I’m very interested in that process, since a lot of my content analysis work, and search in general, can be looked at as trying to generate a useful index with no human intervention. That makes professional indexers views on automatic indexing software very relevant. Understandably they’re a little defensive, since most people don’t appreciate the skill it takes to create an index and being compared to software is never fun, but their critiques of automated analysis apply more generally to all automated keyword and search tools.

  • Flat. There’s no grouping of concepts into categories and subheadings.
  • Missing concepts. Only words that are mentioned in the text are included, there’s no reading between the lines to spot ideas that are implicit.
  • Lacking priorities. Software can’t tell which words are important, and which are incidental.
  • No anticipation. A good index focuses on the terms that a reader is likely to search for. Software has no way of telling this (though my work on extracting common search terms that lead to a page does provide some of this information).
  • Can’t link. Cross-referencing related ideas makes the navigation of an index much easier, but this requires semantic knowledge.
  • Duplication. Again, spotting which words are synonyms requires linguistic analysis, and isn’t handled well by software. This leads to confusing double entries for keywords.

Do you know about Microsoft’s social network for businesses?

Holdinghands

Last year Microsoft released a ‘technical preview’ of their Knowledge Network add-on for Sharepoint. It aims to solve two problems; finding other employees who can help me with a particular subject (expertise search) and locating colleagues who have contacts with someone I’m trying to reach (connection search). It works by analyzing email to identify both the social network, based on who emails whom, and figures out expertise by looking at the contents of emails.

Unfortunately their preview period is now closed, and you can’t download the beta version any more. I assume it’s been promoted to a feature in an upcoming Sharepoint release. It is very interesting to read the customer comments they release in their blog. "I thirst for more information, quality information, about the people in my company", "identifying
and accessing expertise within the organization and uncovering
connections across the supply chain are critical elements of
competitive advantage
", "
Knowledge in modern organizations isn’t just 80% undocumented, it’s 95% invisible". This all fits with my experience of working in large companies, and backs up my instinct that there’s real unsatisfied demand for solutions that offer ‘people search’ within an organization. It’s also extremely interesting to look through their solutions for privacy control, which boil down to some fine-grained control of what’s exposed to whom, a bit like Facebook’s privacy settings.

It’s also surprising that even Microsoft are relying heavily on a client-side component for this, which makes sense from a ‘get up and running quickly’ point of view, but is a massive barrier for adoption. I’ll be keeping a careful eye for any news on future developments.

How’s discovery different from search?

Telescope_2

A lot of the really interesting services out there are about discovery, rather than search. Since they’re both ways of finding content, it’s worth looking at what makes them different.

Search is goal-oriented, discovery is about the journey. It’s the difference between going to the hardware store to get a Phillips screwdriver, and browsing the travel section in a book store. Rather than having very specific criteria in mind for what you want, you’re using indirect clues to help you find something that will meet your general needs. For a travel book that might include whether you’ve seen it mentioned in a review, if you’ve enjoyed the author’s work before, if it has an attractive cover, if there’s praise from people you trust, if your hairdresser mentioned the location, if you’d seen a documentary on the area, or if it happened to be sticking out from the shelf a little more than the others.

Search is solitary, discovery is social. Most of the factors behind buying a travel book are about interactions you’ve had with other people. Digg is one example of trying to emulate some of those traditional mechanisms for finding popular items. Facebook’s addictiveness is all about being tapped into the pulse of your social circle, not with a particular goal in mind, but just to keep up with the context and doing the equivalent of picking fleas off each other’s backs (now that’s a Facebook App idea!). The power of me.dium comes from the injection of social context into browsing. It restores the cues we’re used to in the physical world, so we can judge locations by seeing where both our friends and strangers hang out.

Search is about universal answers, discovery is customized. Though Google talks about searching the sites your friends frequent, I think that functionality will be much more useful for discovery. It’s not likely that your social circle will be visiting the most authoritative sites on all the specific questions you’ll want to get definitive answers on, the sample size will just be too small. Instead, finding out which sites on general topics are popular amongst your circle would be a lot more interesting. Discovery is a lot more about your personal taste, and that’s something you likely share with your friends a lot more than the general population.

Discovery is a background task. Often you’ve got some general interests that you want to know more about, but you’re not actively taking steps to find out more. Instead you’re keeping an ear to the ground while you get on with other activities. This is where applying some social network techniques to the workplace can be really interesting. Seeing updates on what your colleagues are up to will often trigger some thoughts or connections on topics you’re interested in, and lead to discussions you wouldn’t have had otherwise. That could be the real killer app for social networks in the enterprise.

For more thoughts on this, there was a fascinating discussion over on the Lightspeed blog a few months back.

A search cloud for this site


ajax  api  bho  c  crossloop  crossloop review  cruz island  dom  dom in  dom in c  error  example  facebook  facebook api  facebook footprints  find self  firefox  heat load dlls find  hiking santa  hiking santa cruz  hiking santa cruz island  how  how to  ie  ie dom  ie dom in  ie dom in c  in  in c  jolla valley  jolla valley campground  la jolla  la jolla valley  la jolla valley campground  managedq  outlook  outlook api  review  santa cruz  santa cruz island  search  server  socket  socket server  the  to  valley campground  wix  wix heat load dlls find  write 


After using statcounter to track around 800 visits, I’ve put together a search cloud for this blog.

I’m still getting a lot of hits for my early review of Crossloop, thanks largely to the success that Mrinal and the gang have had with their free screen-sharing tool. I’ve seen a lot of people looking for more information on Facebook programming, that’s shown pretty clearly in the cloud too. I get the feeling that most developers are still fairly cagey about sharing information, or just plain too busy to post the collection of tutorials and tips that usually grows up around a platform.

For the Outlook API, it’s almost the opposite problem. There’s loads of resources out there, many going back to the mid 90’s, but almost nothing that gives a broad introduction to all the different technologies you can use to work with Exchange and Outlook.

It’s been great to see my local camping guides reaching so many people too. It’s a real long-tail endeavour, but it’s a good feeling to know that anyone looking for camping spots in the Santa Monicas can now find out on the web, whereas before the information just wasn’t there.

As I gather more statistics, it’ll be interesting to do some per-page clouds. Since the whole blog covers a range of subjects, it’s hard to get an idea if you can get good hints on its meaning from this cloud. Looking at visits to particular pages more focused on a topic would be a better test.

See what Google thinks your site is about with a search cloud

Searchcloud
If you want to know which search terms are most likely to find your site, I’ve uploaded a PHP library that creates search clouds from your logs. To use it include searchcloud.php and call create_search_cloud(), passing in the location of your log file, the name of your site, the number of tags to produce and the min/max font sizes in percentages. You’ll be returned a string containing the HTML for the cloud. Here’s an example:

echo create_search_cloud("visitlogs_petewarden.txt", "petewarden.com", 50, 50, 250);

You can see it working on this example page based on statistics from my old open-source image processing site, which I’ve also included with the library for testing purposes.

Based on the examples I’ve tried, my hypothesis that the most frequent search terms are a good approximation for the meaning of the site holds up. If you take the top 8 terms from the petewarden.com cloud, you get "after effects", "plugins", "effects", "after", "how to", "install". "how to install", "petes plugins". 4 of them would be good tags or taxonomy categories for the content, and on inspection the use of more sophisticated rejection of duplicates and stop words would help increase that ratio. I’ll be interested to hear how this works on some of your sites.

 

You can turn search on its head

Upsidedown

A search engine is designed to take some keywords, and return web pages that match them. What fascinates me is that that mapping of words to pages could easily be done in reverse. Given a particular web page, tell me what keywords are most likely to find it. My hunch is that that set of words, maybe presented as a tag cloud, would give a pretty good summary of what the page is about.

The closest example I’ve found out there is this blog entry. It’s got what appears at first to be a fairly random list of keywords, but digging into them, it looks like Darrin is a Vancouver-based Titanic fan who’s posted about the beautiful agony art project and has done a lot of wedding posts.

What’s really interesting about this is that the search terms that show up aren’t just based on textual frequency within the site, they’re also the product of how often people search for particular words at all. Essentially it’s giving a lot more weight to terms people actually care about, rather than just all terms that are statistically improbable.

At the moment the only way to implement this is to process an individual site’s visitor logs to pull out the frequency of keyword searches that lead to a visit. However search engines know the historical frequency of particular queries terms up front, so it would be possible for them to take an arbitrary new page and simulate which searches would be likely to land on it. You could do something similar for a mail message, essentially you’d be filtering statistically improbable phrases to get statistically improbable and interesting phrases instead.

How to use corporate data to identify experts

Yoda

Nick over at the Disruptor Monkey blog talks about how their FindaYoda feature has proved a surprise hit. This is a way of seeing who else has a lot of material with a keyword you’re looking for, and its success backs up one of the hunches that’s driving my work. I know from my own experience of working in a large tech company that there’s an immense amount of wheel-reinventing going on just because it’s so hard to find the right person to talk to.

As a practical example I know of at least four different image comparison tools that were written by different teams for use with automated testing, with pretty much identical requirements. One of the biggest ways I helped productivity was simply by being curious about what other people were working on and making connections when I heard about overlap.

One of the tools I’d love to have is a way to map keywords to people. It’s one of the selling points of Krugle’s enterprise code search engine. Once you can easily search the whole company’s code you can see who else has worked with an API or algorithm. Trampoline systems aim to do something similar using a whole company’s email store, they describe it as letting you discover knowledge assets. I’m trying to do something similar with my automatic tag generation for email.

It’s not only useful for the people on the coal face, it’s also a benefit that seems to resonate with managers. The amount and cost of the redundant effort is often clearer to them than to the folks doing the work. Since the executives are the ones who make the purchasing decisions, that should help the sales process.

How can you solve organizational problems with visualizations?

Inflow

Valdis Krebs and his Orgnet consultancy have probably been looking at practical uses of network analysis longer than anyone. They have applied their InFlow software to hundreds of different cases, with a focus on problem solving within commercial organizations, but also looking at identifying terrorists, and the role of networks in science, medicine, politics and even sport.

I am especially interested in their work helping companies solve communication and organizational issues. I’ve had plenty of personal experience with merged teams that fail to integrate properly, wasted a lot of time reinventing wheels because we didn’t know a problem had already been solved within the company and been stuck in badly configured hierarchies that got in the way of doing the job.To the people at the coal-face the problems were usually clear, but network visualizations are a very powerful tool that could have been used to show management the reality of what’s happening. In their case studies, that seems to be exactly how they’ve used their work, as a navigational tool for upper management to get a better grasp on what’s happening in the field, and to suggest possible solutions.

Orgnet’s approach is also interesting because they are solving a series of specialized problems with a bespoke, boutique service, whereas most people analyzing company’s data are trying to design mass market tools that will solve a large problem like spam or litigation discovery with little hand-holding from the creators of the software. That gives them unique experience exploring some areas that may lead to really innovative solutions to larger problems in the future.

You should check out the Network Weaving blog, written by Valdis, Jack Ricchiuto and June Holley. Another great thing about their work is that their background is in management and organizations, rather than being technical. That seems to help them avoid the common problem of having a technical solution that’s looking for a real-world problem to solve!

Ratchetsoft responds

Joe Labbe of Ratchetsoft sent a very thoughtful reply to the previous article, here’s an extract that makes a good point:

The bottom line is the semantic problem becomes a bit more
manageable when you break it down into its two base components: access and
meaning. At RatchetSoft, I think we’ve gone a long way in solving the access
issue by creating a user-focused method for accessing content by leveraging
established accessibility standards (MSAA and MUIA).

To your point, the meaning issue is a bit more challenging. On that front, we
shift the semantic coding responsibility to the entity that actually reaps the
benefit of supplying the semantic metadata. So, if you are a user that wants to
add new features to existing application screens, you have a vested interest in
supplying metadata about those screens so they can be processed by external
services. If you are a publisher who has a financial interest in exposing data
in new ways to increase consumption of data, you have a strong motivation to semantically
coding your information.

That fairer matching between the person who puts in the work to mark the semantic information and who benefits feels like the key to making progress.

Can the semantic web evolve from the primordial soup of screen-scraping?

Ratchetxscreenshot

The promise of the semantic web is that it will allow your computer to understand the data on a web page, so you can search, analyze and display it in different forms. The top-down approach is to ask web-site creators to add information about the data on a page. I can’t see this ever working, it just takes too much time for almost no reward to the publisher.

The only other two alternatives are the status quo where data remains locked in silos or some method of understanding it without help from the publisher.

A generic term for reconstituting the underlying data from a user interface is screen-scraping, from the days when legacy data stores had to be converted by capturing their terminal output and parsing the text. Modern screen-scraping is a lot trickier now that user interfaces are more complex since there’s far more uninteresting visual presentation information that has to be waded through to get to the data you’re after.

In theory, screen-scraping gives you access to any data a person can see. In practice, it’s tricky and time-consuming to write a reliable and complete scraper because of the complexity and changeability of user interfaces. To produce the end-goal of an open, semantic web where data flows seamlessly from service to service, every application and site would need a dedicated scraper, and it’s hard to see where the engineering resources to do that would come from.

Where it does get interesting is that there could be a ratchet effect if a particular screen-scraping service became popular. Other sites might want to benefit from the extra users or features that it offered, and so start to conform to the general layout, or particular cues in the mark-up, that it uses to parse its supported sites. In turn, those might evolve towards de-facto standards, moving towards the end-goal of the top-down approach but with incremental benefits at every stage for the actors involved. This seems more feasible than the unrealistic expectation that people will expend effort on unproven standards in the eventual hope of seeing somebody do something with them.

Talking of ratchets leads me to a very neat piece of software called Ratchet-X. Though they never mention the words anywhere, they’re a platform for building screen-scrapers for both desktop and web apps. They have tools to help parse both Windows interfaces and HTML, and quite a few pre-built plugins for popular services like Salesforce. Screen-scrapers are defined using XML to specify the location and meaning of data within an interface, which holds out the promise that non-technical users could create their own for applications they use. This could be a big step in the evolution of scrapers.

I’m aware of how tricky writing a good scraper can be from my work parsing search results pages for Google Hot Keys, but I’m impressed by the work Ratchet have done to build a platform and SDK, rather than just a closed set of tools. I’ll be digging into it more deeply and hopefully chatting to the developers about how they see this moving forward. As always, stay tuned.