Can the semantic web evolve from the primordial soup of screen-scraping?


The promise of the semantic web is that it will allow your computer to understand the data on a web page, so you can search, analyze and display it in different forms. The top-down approach is to ask web-site creators to add information about the data on a page. I can’t see this ever working, it just takes too much time for almost no reward to the publisher.

The only other two alternatives are the status quo where data remains locked in silos or some method of understanding it without help from the publisher.

A generic term for reconstituting the underlying data from a user interface is screen-scraping, from the days when legacy data stores had to be converted by capturing their terminal output and parsing the text. Modern screen-scraping is a lot trickier now that user interfaces are more complex since there’s far more uninteresting visual presentation information that has to be waded through to get to the data you’re after.

In theory, screen-scraping gives you access to any data a person can see. In practice, it’s tricky and time-consuming to write a reliable and complete scraper because of the complexity and changeability of user interfaces. To produce the end-goal of an open, semantic web where data flows seamlessly from service to service, every application and site would need a dedicated scraper, and it’s hard to see where the engineering resources to do that would come from.

Where it does get interesting is that there could be a ratchet effect if a particular screen-scraping service became popular. Other sites might want to benefit from the extra users or features that it offered, and so start to conform to the general layout, or particular cues in the mark-up, that it uses to parse its supported sites. In turn, those might evolve towards de-facto standards, moving towards the end-goal of the top-down approach but with incremental benefits at every stage for the actors involved. This seems more feasible than the unrealistic expectation that people will expend effort on unproven standards in the eventual hope of seeing somebody do something with them.

Talking of ratchets leads me to a very neat piece of software called Ratchet-X. Though they never mention the words anywhere, they’re a platform for building screen-scrapers for both desktop and web apps. They have tools to help parse both Windows interfaces and HTML, and quite a few pre-built plugins for popular services like Salesforce. Screen-scrapers are defined using XML to specify the location and meaning of data within an interface, which holds out the promise that non-technical users could create their own for applications they use. This could be a big step in the evolution of scrapers.

I’m aware of how tricky writing a good scraper can be from my work parsing search results pages for Google Hot Keys, but I’m impressed by the work Ratchet have done to build a platform and SDK, rather than just a closed set of tools. I’ll be digging into it more deeply and hopefully chatting to the developers about how they see this moving forward. As always, stay tuned.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: