Cross-domain Choices

Horror

There have been a lot of different ways tried of fetching web pages from third-party sites. The biggest division is between server-based methods, and those that run purely on the client.

Server methods

Server-based methods all rely on the privileged status of requests that come from the same domain as the web-page the script is on. The usual security restriction is to only allow data to be read from the script’s domain, so these route external page requests through the script’s host. They use the server as a proxy, acting as a middleman passing requests on from the client to the external site, and then passing the results back to the client. Jason levitt has a good article on some different ways of implementing server proxies, but they all have a lot in common.

  • No client setup needed
  • The proxy will work with almost any browser, and it’s very painless for the end-user

  • You need to configure your server
  • All of the methods involve some degree of either fiddling with Apache config files, or setting up CGI scripts to do the redirection. This can be a problem if you don’t have the access or experience to set that up on the server.

  • You can only access what your server can see through its connection
  • This helps security, because it means there’s no chance of a malicious peek at intranet servers, and you don’t have access to the user’s cookies. It can be a problem though if you want to check the availability and contents of a site as it appears to the user. For example with SearchMash, I wanted to bypass cloaking, and deal with what the user sees, rather than what the server gives search engines.

  • All traffic goes through your server
  • I’ve seen arguments that this is a good thing ethically, because you’re sharing the bandwidth pain with the site you’re fetching from. It seems a bit inelegant and wasteful though, since you’re using more network resources than if the fetch was being handled directly, and I’d prefer to handle throttling explicitly, rather than relying on a server’s bandwidth. SearchMash has no throttling, it’s something I’ll need to consider if traffic grows.

Client methods

There are two existing ways to do cross-domain fetches without using a server proxy; signed scripts and FlashXMLHttpRequest. I haven’t used either, since they both had limitations that made them unsuitable for SearchMash, but I’ll summarize what I understood from my research.

FlashXMLHttpRequest

For full information on FlashXMLHttpRequest, check out Julian Couvreur’s blog. It uses Flash’s HTTP library to do the fetching. It’s a great package, it has a JavaScript API that’s just like XMLHttpRequest, Flash is available on almost all machines, and there’s no scary security warnings that the user has to click through. Flash will only fetch from sites that explicitly allow it though, so mashing from arbitrary domains is not possible. This makes security less of a headache, but didn’t support the sort of access I needed for SearchMash.

Signed Scripts

I didn’t find a definitive article on signed scripts, but here’s a few of the articles that I found useful. Using signed scripts allows cross-domain page fetches using the standard XMLHttpRequest API, but it brings up a ‘do you trust this signed script?’ security dialog, and only works on Mozilla browsers, which ruled it out for me.

MashProxy

This blog has most of the information on my implementation. It’s very similar in concept to FlashXMLHttpRequest, but using Java’s HTTP functions rather than Flash’s. The big difference for my purposes is that it supports fetching from any web site, which makes security a lot harder to implement, but opens up a lot of unctionality. It requires a certificate, brings up a security window and has an API that’s not familiar to users of XMLHttpRequest. Java doesn’t seem to be as widely deployed as Flash, but it’s on around 80-90% of desktops. The current implementation doesn’t work with Java 1.1 (it uses some simple Swing thread functions), but I hope to back-port it in the future. It does also require a visible applet to run in IE, though this can be small.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: