Defrag: How taxonomies meet folksonomies; or the role of semantics on the web


Karen Schneider gave a talk drawing on the centuries of experience that the library community has in classifying and organizing information, and the relationship between those formal taxonomies and tagging approaches like She’s made her slides available here.

She started by laying out the relevance of libraries, with a look at community college students library usage. They’re checking out roughly the same number of books as a decade ago, but they now check out many ebooks too, as well as accessing databases, so overall usage has actually increased. Community college students are 49% of the total undergraduate population in the US, and they’re poorer and work longer hours outside of college than average. They’re a very demanding audience, so their heavy usage demonstrates that libraries are providing an efficient and useful service.

She then tackled a few librarian stereotypes. I’ve been a library-blog lurker for years, so I didn’t need any convincing, but she got some laughs with Donna Reed the spinster librarian from the alternate world in Its a Wonderful Life. I was disappointed that Giles was missed out, but you can’t have everything.

The next point was a quick demonstration of some typical library software, and how awful it was. The presentation was essentially the same as a card catalog, very static and uninvolving. Doc Searls had talked the day before about data in general being trapped in disconnected silos, in hard-to-use formats, and library systems suffer from the exact same problems.

WorldCat is a universal library catalog that lets you find books and other items in libraries near you. It’s still based on the old card-index model of marked (edit- MARC, that makes more sense, thanks Karen!) data, but it’s a big step forward because it’s linking together a lot of different libraries’ data sources.

There’s some issues that traditional taxonomies have been wrestling with for a long time, that are also problems for the newer technologies. Authority control is the process of sorting out terms which could be ambiguous, for example by adding a date or other suffix to a name to make it clear which person is referred to. Misspelling is another area that librarians have spent a lot of time developing methods to cope with. Stemming is problematic enough in English, but she discussed Eastern European languages that have even tougher word constructions. Synonyms are another obstacle to finding the results you need. She showed a example where the tags covering the same wireless networking technology included "wifi", "wi-fi", "802.11", "802.11b". Phrase searching is something that library data services have been handling for a lot longer than search engines. And finally, libraries have been around for long enough that anachronisms have become an issue, something that tagging systems have not had to cope with. Until the 90’s, the Library of Congress resisted changing any of its authoritative terms, such as afro-american or water closet, even though they’d become seriously out-dated.

Disambiguation or authority control is something that taxonomies are very good at. The creators of the system spot clashes, and figure out a resolution to them. Worldcat Identities is a good example of the power of this approach. Interestingly, Wikipedia is very good at this too, as the disambiguation page for ‘apple’ shows. She believes this is the result of a very strict, well-patrolled community where the naming is held to be extremely important, and believes the value of the naming is under-appreciated.

Another strong point of traditional cataloging approaches is the definitions. Wikipedia seems to have informally developed a convention where the first paragraph of any entry is actually a definition too.

Having somebody with in-depth expertise and authority on a subject do a centralized classification can be extremely efficient. She gave the example of a law library in California that as an excellent tagging scheme, but I couldn’t find the reference unfortunately. (edit- Here it is, from the Witkin State Law Library ).

Library catalogs have an excellent topic scheme, they have a good hierarchy for organizing their classifications, which is still something that folksonomies are  trying to catch up with, using ideas like facets.

These are all areas where folksonomies can learn from taxonomies, but there’s plenty of ideas that should flow the other way too. One of the strengths of tagging is that it’s really easy to understand how to create and search with tags. The same can’t be said for the Dewey decimal system. Cataloging in a library involves following a very intimidating series of restrictions. Tagging doesn’t frighten off your workforce like this.

In the short term, tagging is satisficing, and trumps the ‘perfection’ of a traditional taxonomy. In 2006, the Library of Congress was proud to report they’d cataloged 350,000 items with 400 catalogers. That works out to only about 3.5 records per day!

Tagging is also about more than just description. It’s a method for discovery and rediscovery, use and reuse, with your own and other peoples bookmarks. Folksonomies produce good meta-data; some people seem concerned that 90% of flickr photos fall within six facets, but this is actually a good reflection of the real world.

It seems like library conferences are a lot more advanced than defrag in handling tags, since there’s a formal declaration of a tag for each event in advance, and then that’s used by everybody involved. There wasn’t anything this well-publicized for Defrag, and the one chosen, ‘defragcon’, caused a sad shake of the head, since that’s not future-proof for next years conference, and we’ll end up having to change our tags to add a year suffix.

She also brought up the point that basic cataloging and classification techniques seem to be instinctive, and not restricted to a highly-trained elite of catalog librarians. We all tend to pick around four terms to classify items.

There is a common, but useless critique of folksonomies; that personal tags pollute them. This is a useless criticism because it’s easy for systems to filter them out. A more real problem is the proliferation of tags over time, which ends up cluttering up any results. There’s also the tricky balance between splitters and lumpers, where too finely divided categories give ‘onesie’ results where every item is unique, or overly broad classes where the signal of the results you want are overwhelmed by the noise of irrelevant items.

There’s some examples of ‘uber-folksonomies’, which take the raw power of distributed classification, and apply a layer of hierarchy on top. Wikipedia is the best-known example, and its greatest strength is how well-patrolled the system is. LibraryThing is a system that lets you enter and tag all the books in your personal library. The Danbury library actually uses the information people have entered in LT to recommend books for their patrons who search online, as well as using pre-vetted tags to indicate the categories each book belongs to. The Librarians Internet Index is another well-patrolled classification system for websites (though it looked a surprisingly sparse when I checked it out). The Assumption College for Sisters has been using to classify its library. Karen pointed out that it’s hard to imagine anyone more trustworthy than a nun librarian! Thunder Bay Public Library has also been busy on

A deep lesson from the success of folksonomies is that great things can be achieved if people want to get involved. We need to incentivize that activity, and she used the phrase ‘handprints and mirrors’. She didn’t expand on the mirrors part, but I took that to mean the enjoyment people took from looking at a reflection of themselves in their work. We all want to feel like we’ve left some kind of handprint on society, so any folk-based system should reflect that desire too.

She only took one question, asking how libraries are doing? She replied that they’re an example of the only great non-commercial third-space. She also gave examples about how people like to be in that space when they’re dealing with information, even if they’re not there for the books.

2 responses

  1. This is a startlingly hi-fi summary of my talk ! The only comments I have to make are that the “marked data” is actually MARC data (Machine Readable Cataloging) though I sort of prefer “marked data…” Also, in my slides I link to the delicious collections I mention.
    You are right that I didn’t expand on “mirrors” — I should have built on the point made earlier in the conference that people like to see their own work, though you intuited that quite well.
    Marti Hearst could probably correct some of my faux-science. I think she said that six was the usual limit of what people apply.
    This is a great summary, and I’ll link to it!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: