Browser Compatibility

Error
One of the design goals of SearchMash was to work in a lot of different browsers, which is why I tried to avoid using browser-specific technology.

Both the LiveConnect technology that allows Java and JavaScript to call each other, and the Javascript document object model (DOM) that I use to edit HTML in a page, are well supported by the major browsers. The biggest hurdle is having a recent version of Java installed. I’ve seen figures of around 80-90% of desktops having some version of Java, but I haven’t seen a break-down by version, I suspect many of those are 1.1, which is why I hope to back-port MashProxy to that version. It’s a respectable deployment, but not as high as Flash (I’ve seen 98% figures for some version of that).

My main development platform is OS X, so Safari and Firefox get the most testing, and SearchMash works without problems on both. I’ve also tried some of the other Mac browsers, such as Opera or IE 5.1, but these are not widely used, so I haven’t spent the time to ensure SearchMash works on them. I did hit some quirks in Safari’s LiveConnect implementation, since Java strings don’t seem to be converted to JavaScript ones when they’re passed back to the script, but stay as wrapped Java objects. This means you can call Java string functions on them from the script, but JavaScript functions don’t work. Since there’s some overlap in JavaScript and Java’s string functions, it took me a while to figure out what was happening, but once I did, I was able to force a conversion by creating a new JavaScript string variable from the one that was passed back.

Since most of the world uses a PC, making sure that SearchMash works on Firefox and Internet Explorer was a priority. I only have one machine with Windows XP available, but I’ve been able to make sure it runs on IE 6 and 7, and on Firefox 1.5.

I’ve found IE’s Java to JavaScript connection to be a bit more picky than the other browsers. I’ve just found a bug that can cause the applet to crash when I change pages in IE for example, it looks like using an out-of-date JavaScript DOM object can cause a security exception.

I’ve done no Linux testing, because I don’t have a system set up to run it. I didn’t hit any big differences between Firefox on the Mac and Windows, so I’m hopeful it will just work on Linux/Firefox too.

Supported Browsers

  • Internet Explorer 6.0 (Windows)
  • Internet Explorer 7.0 (Windows)
  • Safari 2.0 (OS X)
  • Firefox 1.5 (Windows and OS X)

Untested

  • Internet Explorer 5.0 (Windows)
  • Opera (Windows)
  • Netscape
  • No Linux testing has been done

Tested, but not working

  • Internet Explorer 5.1 (OS X)
  • Opera (OS X)

Bugs and features

Ladybird
One of the nice things that sourceforge offers is a feature and bug tracking system. You can check out the bugs here and the features here, or click on the links from the main project page.

I’ve started things off with some bugs and features I’ve had on my mental list for a while, but you should feel free to jump in and add your own requests.

Bugs

  • Make MashProxy Java 1.1 compatible
  • MashProxy doesn’t really need to use any new Java features, but it relies on a couple just because it was written in a 1.4 environment. I’d like to remove those dependencies, so it’ll run even on really old versions of Java.

  • Previewing missing sites on IE stops window
  • I noticed this while using my parent’s PC on vacation, I normally develop on a Mac/Safari, where I don’t see the problem. Moving the mouse over a missing link, invoking the preview, causes the preview window to stop responding to any further requests to show pages, even valid ones.

Features

  • Improve appearance of status indication
  • I think either icons or specially formatting of the title would be better than the current (found) or (missing) text that’s added after each link.

  • Provide ask.com as an alternative to google
  • I normally only use google, but it seems like it wouldn’t be too hard technically to parse ask.com’s results too, and offer users a choice.

  • Check for search terms in the page
  • This was one of the big features I wanted to help my searching, but that I ran out of time to implement before the first release. It would catch out sites that do ‘cloaking’ (showing one set of results to google to get the search terms, but another to normal users).

  • Show multiple search pages
  • Ten search results to a page sometimes feels a bit stingy, and it seems like it wouldn’t be too hard to concatenate multiple pages of results, one above the other. I’m not sure how many to show at once, but I’d probably try four pages, and forty results, and see how that feels.

Getting a Certificate

Certificate
As I mention in my post on building your own MashProxy applet, you’ll need to sign the applet you build with an RSA-Signed Certificate. Once you’ve got a certificate, the process of signing is fiddly but pretty well documented, so I’m going to focus on acquiring one.

Self-signed

For testing purposes, using a self-signed certificate is good enough, and creating them is easy. The downside is that there’s no verification of any information you put in the certificate, for example you could claim that you’re Bill Gates at Microsoft. The point of signing the applet is that they’re a guarantee that the code is from a known and verified person and organization, since self-signed certificates don’t offer that guarantee, many browsers won’t run them, or will only run them after the user clicks ok on scary security dialogs.

Trusted Third Parties

Firms like Verisign, Thawte and others are what is known as ‘Trusted Third Parties’ (TTPs). They do the work of checking that people who want a certificate are actually who they claim to be, by checking phone numbers, addresses and official documentation, and once they’re satisfied, they’ll issue a certificate containing that information. This certificate is itself signed by their certificate, which will be shipped along with all browsers. The chain of trust is that the browser publisher believes in the TTP’s procedures, so that anyone they sign is also treated with a higher level of trust.

In practice this means less scary security warnings, and the ability to run on even high security settings.

The downside is that the TTP’s checking procedures can take a long time, and need a lot of documentation, and are also fairly costly (several hundred dollars for a year). I used Verisign, and I was very happy with their service, though they’re not the cheapest. Cynthia Klocke dealt with my order very efficiently, if you mail me, I can give you her contact details. Be aware, you’ll need a registered business name, a number in the phone book for that business that they can reach you at, they don’t register individuals, though I’ve heard Thawte will. Here’s a quick description of Verisign’s procedures.

Cross-domain Choices

Horror

There have been a lot of different ways tried of fetching web pages from third-party sites. The biggest division is between server-based methods, and those that run purely on the client.

Server methods

Server-based methods all rely on the privileged status of requests that come from the same domain as the web-page the script is on. The usual security restriction is to only allow data to be read from the script’s domain, so these route external page requests through the script’s host. They use the server as a proxy, acting as a middleman passing requests on from the client to the external site, and then passing the results back to the client. Jason levitt has a good article on some different ways of implementing server proxies, but they all have a lot in common.

  • No client setup needed
  • The proxy will work with almost any browser, and it’s very painless for the end-user

  • You need to configure your server
  • All of the methods involve some degree of either fiddling with Apache config files, or setting up CGI scripts to do the redirection. This can be a problem if you don’t have the access or experience to set that up on the server.

  • You can only access what your server can see through its connection
  • This helps security, because it means there’s no chance of a malicious peek at intranet servers, and you don’t have access to the user’s cookies. It can be a problem though if you want to check the availability and contents of a site as it appears to the user. For example with SearchMash, I wanted to bypass cloaking, and deal with what the user sees, rather than what the server gives search engines.

  • All traffic goes through your server
  • I’ve seen arguments that this is a good thing ethically, because you’re sharing the bandwidth pain with the site you’re fetching from. It seems a bit inelegant and wasteful though, since you’re using more network resources than if the fetch was being handled directly, and I’d prefer to handle throttling explicitly, rather than relying on a server’s bandwidth. SearchMash has no throttling, it’s something I’ll need to consider if traffic grows.

Client methods

There are two existing ways to do cross-domain fetches without using a server proxy; signed scripts and FlashXMLHttpRequest. I haven’t used either, since they both had limitations that made them unsuitable for SearchMash, but I’ll summarize what I understood from my research.

FlashXMLHttpRequest

For full information on FlashXMLHttpRequest, check out Julian Couvreur’s blog. It uses Flash’s HTTP library to do the fetching. It’s a great package, it has a JavaScript API that’s just like XMLHttpRequest, Flash is available on almost all machines, and there’s no scary security warnings that the user has to click through. Flash will only fetch from sites that explicitly allow it though, so mashing from arbitrary domains is not possible. This makes security less of a headache, but didn’t support the sort of access I needed for SearchMash.

Signed Scripts

I didn’t find a definitive article on signed scripts, but here’s a few of the articles that I found useful. Using signed scripts allows cross-domain page fetches using the standard XMLHttpRequest API, but it brings up a ‘do you trust this signed script?’ security dialog, and only works on Mozilla browsers, which ruled it out for me.

MashProxy

This blog has most of the information on my implementation. It’s very similar in concept to FlashXMLHttpRequest, but using Java’s HTTP functions rather than Flash’s. The big difference for my purposes is that it supports fetching from any web site, which makes security a lot harder to implement, but opens up a lot of unctionality. It requires a certificate, brings up a security window and has an API that’s not familiar to users of XMLHttpRequest. Java doesn’t seem to be as widely deployed as Flash, but it’s on around 80-90% of desktops. The current implementation doesn’t work with Java 1.1 (it uses some simple Swing thread functions), but I hope to back-port it in the future. It does also require a visible applet to run in IE, though this can be small.

Building it yourself

Hammer

MashProxy is an open source project, released under the GPL. You’ll need CVS to build your own copy of the Java applet if you want to run it on your own website, since my applet is hardcoded to only work on my domains. The sourceforge project page gives directions on how to get the code.

If you just want to experiment on your own machine, then life’s a lot easier. Download these files:

If you have them all in the same directory, and open index.html in your favorite browser, you should see the usual start page appear.

You can then start playing with searchgui.js, making alterations, or even write a whole new page structure and script using the mashproxy applet.

You won’t be able to run the applet on your own website though, for security reasons I didn’t want code I’d certified as safe being scripted by anyone else. To distribute it yourself, you’ll need to get a certificate, build the java project yourself, and sign it. That’s a complex process, and will have to wait for a future entry!

SearchMash’s implementation

Sparkplug

SearchMash is the JavaScript that actually does something interesting with the MashProxy applet. It’s also a pretty small piece of code, only 423 lines long, and that’s with a deliberately verbose style.

The page it works on is made up of three frames, the left one for the search results, the large right one for the preview of pages in those results, and the small far right one to hold the applet. The far right one is needed because some browsers require that any applet be visible before it can be run, for security reasons.

The script , searchgui.js, is included by the head of the main document, the one that defines the frames. There’s a fair amount of jumping through hoops to allow script communication between the different frames, one of the universal browser security measures is preventing frames getting access to others from different domains.

The script does some onLoad voodoo to get around the explorer patent restrictions that MS implemented, to prevent the user having to click on the applet to activate it, but the real work doesn’t start until the applet has signalled to the script that it’s loaded. This is done by calling SB_NotifyAppletLoaded_Forward() in the applet’s frame, which then calls SB_NotifyAppletLoaded() in the main script.

This triggers the fetch of the starting search page. One pattern I use a lot to invoke script actions inside functions called from the applet iis setTimeout() with a short duration. This is mostly to get around problems I ran into when I was calling back into the applet from script functions invoked by the applet. In a couple of places, they are a bit more of a fudge, and used as a way of trying to make sure that parsing or loading is complete before continuing. This second usage is pretty fragile, and should be replaced with explicit status checking.

The SB_StartSearch() function looks at the input URL, and if there’s any arguments, constructs a google search link that passes them on. If there’s no arguments, the default google start page is used. The URL argument checking is there so that external plugins, like the search bars in Firefox and Safari, can call SearchMash directly. I had to tighten up the argument passing though, previously I allowed the full URL to be specified there, now I make sure it’s a google search one.

SB_GetSearch() sends a request for the search URL, and the script waits until it’s recieved.

The applet calls back SB_PageRequestDone_Forward() in it’s frame, which calls SB_PageRequestDone() in searchgui.js. The first thing the function does is some magic to convert the returned objects from Java strings to JS ones. This only seems necessary in Safari, otherwise the objects only respond to Java string methods, which are close enough to JS that debugging why some JS ones were failing got very confusing.

The URL is checked to see if it’s a google search link, which should end up in the search frame, or an external page that was returned as part of the search results. The check is done in SB_IsSearchLink(): if it’s from a google domain, then it’s a search result.

If it is destined for the search frame, then a &lt base &gt tag pointing to the original page is added to the HTML, since we’ll be dumping the contents of the page into our own frame with a mashproxy.com location, we need to let the browser know to resolve the relative paths back to the right location.

The next step is a bit odd. As a security step, browsers don’t let you monkey with the DOM of a frame once you’ve loaded it, even if it’s the same domain, so I have to add a script into the HTML to do several things:

  • Add some text after external links in the page, indicating if they exist
  • Work out if a URL is present in the search page
  • If the mouse is over an external link, invoke the preview frame
  • If it’s an external link, fetch its contents to see if they exist
  • If the user clicks on a search navigation link (next page, etc), open it through SearchMash
  • If the user submits a new search through a form, route that through SearchMash too

After the script is added, document.write() is called to setup the frame, and the SB_StartDocumentParsing() function is called to start the customization of the search page and checking of external links. This is one of the more dubious uses of setTimeout(), using an onload handler would be better, but I didn’t want to interfere with the script in the search page.

Once an external link is returned from the applet, it goes through the same SB_PageRequestDone() as search pages, but since it’s not from a google domain, it gets handled in the second branch. The first thing it does is check that the external link is present in the current search page, if it isn’t, it doesn’ t do any processing on it. This is both to improve performance, and to make it harder if an external script finds a way to request pages.

If it is found, the status text next to the link in the search page is set, and the contents of the page are stored for later use in the preview frame.

The last piece of functionality is invoked when the user moves her mouse over an external link. The event handler in the search page calls SB_SetPreviewFrame(), and this either sets the location of the frame to the external link, if it hasn’t been loaded by the status checking yet, or if the contents have been stored, writes them into the frame.

This step is the one most vulnerable to abuse, since I’m writing external HTML into a frame with my domain’s privileges. To protect against malice, t’he SB_RemoveScripts() goes through the contents before they’re added, and removes any scripts, using a regex based blacklist.

MashProxy’s implementation

Cog

MashProxy is a very small Java applet, only 385 lines long and 12Kb when it’s compiled. It’s job is to recieve URLs from the JavaScript of the page it’s running in, and return the HTML contents of those pages.

The first thing it has to do is call back the page it’s contained in when it’s initialized, so that the script knows it’s safe to start asking it to fetch pages. Calling an applet before it’s initialized doesn’t work, and this was the most reliable way to discover once it’s active. I also tried checking alive state and other methods, but this was the most reliable across different browsers.

It’s in the init function that the current document’s location is checked, and if it isn’t on the whitelist, or being run from a local file, the applet silently fails.

If it’s determined that it’s ok to run, the applet stores the current JavaScript window and document objects, which it will use for future callbacks, and then calls the SB_NotifyAppletLoaded_Forward() JavaScript function in the current document. This is hardcoded to this function name, it’s got a _Forward suffix because the SearchMash implementation runs the applet in a frame, and so the JSWindow calls back into that frame, and the function is just a stub that calls the real code in the main frame.

It also starts up a seperate thread to handle incoming requests. This is because callbacks into signed applets from JavaScript don’t get the same security privileges, so it’s necessary to just pass the information onto a different thread from such a function, and have that trusted thread do the actual work.

As far as I can tell from Sun’s comments on the need for this, it is to prevent accidental exposure of signed applet’s functions to JavaScript, and it does ensure that you have to explicitly enable any trusted code before you can call it. However, it may be that they were hoping to prevent JavaScript invoking trusted Java code at all, which is not the effect they achieved, they just made the code to do it more complex.

The main entry point to the applet is the pageRequest() function. This takes a URL, and a string containing the JavaScript function name to call back. For now I’ve disabled being able to set the callback, and hardcoded the call back to SB_PageRequestDone_Forward() in the applet’s frame, to make it tougher to use the applet for malicious third-party scripts loaded in other frames.

The callback then pushes the arguments into member variables of the applet, and signals the trusted thread to handle a request by waking it. The trusted thread reads in the arguments, and does the actual work.

It first fires off a header request using HTTPURLConnection to the web page to get the status code, and quickly discover if it’s missing or moved. After that, it requests the full page. There’s also a timeout thread on these that kicks in after 20 seconds of no response, and sets the status to HTTP_CLIENT_TIMEOUT.

When the results are returned, the pageRequestDone() function is called, either with null for the contents if the page couldn’t be found, or a BufferedReader representing the contents, the source URL and the status code. The reader is converted into a string, and the JavaScript callback function (hardcoded currently to SB_PageRequestDone_Forward) is called with the results.

The applet is able to handle multiple concurrent requests because of its threading model. One quirk is that I had to include a JSObject jar to enable the ‘liveconnect’ functionality I needed to be able to call back and forth between JavaScript and Java.