There have been a lot of different ways tried of fetching web pages from third-party sites. The biggest division is between server-based methods, and those that run purely on the client.
Server-based methods all rely on the privileged status of requests that come from the same domain as the web-page the script is on. The usual security restriction is to only allow data to be read from the script’s domain, so these route external page requests through the script’s host. They use the server as a proxy, acting as a middleman passing requests on from the client to the external site, and then passing the results back to the client. Jason levitt has a good article on some different ways of implementing server proxies, but they all have a lot in common.
- No client setup needed
The proxy will work with almost any browser, and it’s very painless for the end-user
- You need to configure your server
All of the methods involve some degree of either fiddling with Apache config files, or setting up CGI scripts to do the redirection. This can be a problem if you don’t have the access or experience to set that up on the server.
- You can only access what your server can see through its connection
This helps security, because it means there’s no chance of a malicious peek at intranet servers, and you don’t have access to the user’s cookies. It can be a problem though if you want to check the availability and contents of a site as it appears to the user. For example with SearchMash, I wanted to bypass cloaking, and deal with what the user sees, rather than what the server gives search engines.
- All traffic goes through your server
I’ve seen arguments that this is a good thing ethically, because you’re sharing the bandwidth pain with the site you’re fetching from. It seems a bit inelegant and wasteful though, since you’re using more network resources than if the fetch was being handled directly, and I’d prefer to handle throttling explicitly, rather than relying on a server’s bandwidth. SearchMash has no throttling, it’s something I’ll need to consider if traffic grows.
There are two existing ways to do cross-domain fetches without using a server proxy; signed scripts and FlashXMLHttpRequest. I haven’t used either, since they both had limitations that made them unsuitable for SearchMash, but I’ll summarize what I understood from my research.
I didn’t find a definitive article on signed scripts, but here’s a few of the articles that I found useful. Using signed scripts allows cross-domain page fetches using the standard XMLHttpRequest API, but it brings up a ‘do you trust this signed script?’ security dialog, and only works on Mozilla browsers, which ruled it out for me.
This blog has most of the information on my implementation. It’s very similar in concept to FlashXMLHttpRequest, but using Java’s HTTP functions rather than Flash’s. The big difference for my purposes is that it supports fetching from any web site, which makes security a lot harder to implement, but opens up a lot of unctionality. It requires a certificate, brings up a security window and has an API that’s not familiar to users of XMLHttpRequest. Java doesn’t seem to be as widely deployed as Flash, but it’s on around 80-90% of desktops. The current implementation doesn’t work with Java 1.1 (it uses some simple Swing thread functions), but I hope to back-port it in the future. It does also require a visible applet to run in IE, though this can be small.