Browser Compatibility

Error
One of the design goals of SearchMash was to work in a lot of different browsers, which is why I tried to avoid using browser-specific technology.

Both the LiveConnect technology that allows Java and JavaScript to call each other, and the Javascript document object model (DOM) that I use to edit HTML in a page, are well supported by the major browsers. The biggest hurdle is having a recent version of Java installed. I’ve seen figures of around 80-90% of desktops having some version of Java, but I haven’t seen a break-down by version, I suspect many of those are 1.1, which is why I hope to back-port MashProxy to that version. It’s a respectable deployment, but not as high as Flash (I’ve seen 98% figures for some version of that).

My main development platform is OS X, so Safari and Firefox get the most testing, and SearchMash works without problems on both. I’ve also tried some of the other Mac browsers, such as Opera or IE 5.1, but these are not widely used, so I haven’t spent the time to ensure SearchMash works on them. I did hit some quirks in Safari’s LiveConnect implementation, since Java strings don’t seem to be converted to JavaScript ones when they’re passed back to the script, but stay as wrapped Java objects. This means you can call Java string functions on them from the script, but JavaScript functions don’t work. Since there’s some overlap in JavaScript and Java’s string functions, it took me a while to figure out what was happening, but once I did, I was able to force a conversion by creating a new JavaScript string variable from the one that was passed back.

Since most of the world uses a PC, making sure that SearchMash works on Firefox and Internet Explorer was a priority. I only have one machine with Windows XP available, but I’ve been able to make sure it runs on IE 6 and 7, and on Firefox 1.5.

I’ve found IE’s Java to JavaScript connection to be a bit more picky than the other browsers. I’ve just found a bug that can cause the applet to crash when I change pages in IE for example, it looks like using an out-of-date JavaScript DOM object can cause a security exception.

I’ve done no Linux testing, because I don’t have a system set up to run it. I didn’t hit any big differences between Firefox on the Mac and Windows, so I’m hopeful it will just work on Linux/Firefox too.

Supported Browsers

  • Internet Explorer 6.0 (Windows)
  • Internet Explorer 7.0 (Windows)
  • Safari 2.0 (OS X)
  • Firefox 1.5 (Windows and OS X)

Untested

  • Internet Explorer 5.0 (Windows)
  • Opera (Windows)
  • Netscape
  • No Linux testing has been done

Tested, but not working

  • Internet Explorer 5.1 (OS X)
  • Opera (OS X)

Bugs and features

Ladybird
One of the nice things that sourceforge offers is a feature and bug tracking system. You can check out the bugs here and the features here, or click on the links from the main project page.

I’ve started things off with some bugs and features I’ve had on my mental list for a while, but you should feel free to jump in and add your own requests.

Bugs

  • Make MashProxy Java 1.1 compatible
  • MashProxy doesn’t really need to use any new Java features, but it relies on a couple just because it was written in a 1.4 environment. I’d like to remove those dependencies, so it’ll run even on really old versions of Java.

  • Previewing missing sites on IE stops window
  • I noticed this while using my parent’s PC on vacation, I normally develop on a Mac/Safari, where I don’t see the problem. Moving the mouse over a missing link, invoking the preview, causes the preview window to stop responding to any further requests to show pages, even valid ones.

Features

  • Improve appearance of status indication
  • I think either icons or specially formatting of the title would be better than the current (found) or (missing) text that’s added after each link.

  • Provide ask.com as an alternative to google
  • I normally only use google, but it seems like it wouldn’t be too hard technically to parse ask.com’s results too, and offer users a choice.

  • Check for search terms in the page
  • This was one of the big features I wanted to help my searching, but that I ran out of time to implement before the first release. It would catch out sites that do ‘cloaking’ (showing one set of results to google to get the search terms, but another to normal users).

  • Show multiple search pages
  • Ten search results to a page sometimes feels a bit stingy, and it seems like it wouldn’t be too hard to concatenate multiple pages of results, one above the other. I’m not sure how many to show at once, but I’d probably try four pages, and forty results, and see how that feels.

Getting a Certificate

Certificate
As I mention in my post on building your own MashProxy applet, you’ll need to sign the applet you build with an RSA-Signed Certificate. Once you’ve got a certificate, the process of signing is fiddly but pretty well documented, so I’m going to focus on acquiring one.

Self-signed

For testing purposes, using a self-signed certificate is good enough, and creating them is easy. The downside is that there’s no verification of any information you put in the certificate, for example you could claim that you’re Bill Gates at Microsoft. The point of signing the applet is that they’re a guarantee that the code is from a known and verified person and organization, since self-signed certificates don’t offer that guarantee, many browsers won’t run them, or will only run them after the user clicks ok on scary security dialogs.

Trusted Third Parties

Firms like Verisign, Thawte and others are what is known as ‘Trusted Third Parties’ (TTPs). They do the work of checking that people who want a certificate are actually who they claim to be, by checking phone numbers, addresses and official documentation, and once they’re satisfied, they’ll issue a certificate containing that information. This certificate is itself signed by their certificate, which will be shipped along with all browsers. The chain of trust is that the browser publisher believes in the TTP’s procedures, so that anyone they sign is also treated with a higher level of trust.

In practice this means less scary security warnings, and the ability to run on even high security settings.

The downside is that the TTP’s checking procedures can take a long time, and need a lot of documentation, and are also fairly costly (several hundred dollars for a year). I used Verisign, and I was very happy with their service, though they’re not the cheapest. Cynthia Klocke dealt with my order very efficiently, if you mail me, I can give you her contact details. Be aware, you’ll need a registered business name, a number in the phone book for that business that they can reach you at, they don’t register individuals, though I’ve heard Thawte will. Here’s a quick description of Verisign’s procedures.

Cross-domain Choices

Horror

There have been a lot of different ways tried of fetching web pages from third-party sites. The biggest division is between server-based methods, and those that run purely on the client.

Server methods

Server-based methods all rely on the privileged status of requests that come from the same domain as the web-page the script is on. The usual security restriction is to only allow data to be read from the script’s domain, so these route external page requests through the script’s host. They use the server as a proxy, acting as a middleman passing requests on from the client to the external site, and then passing the results back to the client. Jason levitt has a good article on some different ways of implementing server proxies, but they all have a lot in common.

  • No client setup needed
  • The proxy will work with almost any browser, and it’s very painless for the end-user

  • You need to configure your server
  • All of the methods involve some degree of either fiddling with Apache config files, or setting up CGI scripts to do the redirection. This can be a problem if you don’t have the access or experience to set that up on the server.

  • You can only access what your server can see through its connection
  • This helps security, because it means there’s no chance of a malicious peek at intranet servers, and you don’t have access to the user’s cookies. It can be a problem though if you want to check the availability and contents of a site as it appears to the user. For example with SearchMash, I wanted to bypass cloaking, and deal with what the user sees, rather than what the server gives search engines.

  • All traffic goes through your server
  • I’ve seen arguments that this is a good thing ethically, because you’re sharing the bandwidth pain with the site you’re fetching from. It seems a bit inelegant and wasteful though, since you’re using more network resources than if the fetch was being handled directly, and I’d prefer to handle throttling explicitly, rather than relying on a server’s bandwidth. SearchMash has no throttling, it’s something I’ll need to consider if traffic grows.

Client methods

There are two existing ways to do cross-domain fetches without using a server proxy; signed scripts and FlashXMLHttpRequest. I haven’t used either, since they both had limitations that made them unsuitable for SearchMash, but I’ll summarize what I understood from my research.

FlashXMLHttpRequest

For full information on FlashXMLHttpRequest, check out Julian Couvreur’s blog. It uses Flash’s HTTP library to do the fetching. It’s a great package, it has a JavaScript API that’s just like XMLHttpRequest, Flash is available on almost all machines, and there’s no scary security warnings that the user has to click through. Flash will only fetch from sites that explicitly allow it though, so mashing from arbitrary domains is not possible. This makes security less of a headache, but didn’t support the sort of access I needed for SearchMash.

Signed Scripts

I didn’t find a definitive article on signed scripts, but here’s a few of the articles that I found useful. Using signed scripts allows cross-domain page fetches using the standard XMLHttpRequest API, but it brings up a ‘do you trust this signed script?’ security dialog, and only works on Mozilla browsers, which ruled it out for me.

MashProxy

This blog has most of the information on my implementation. It’s very similar in concept to FlashXMLHttpRequest, but using Java’s HTTP functions rather than Flash’s. The big difference for my purposes is that it supports fetching from any web site, which makes security a lot harder to implement, but opens up a lot of unctionality. It requires a certificate, brings up a security window and has an API that’s not familiar to users of XMLHttpRequest. Java doesn’t seem to be as widely deployed as Flash, but it’s on around 80-90% of desktops. The current implementation doesn’t work with Java 1.1 (it uses some simple Swing thread functions), but I hope to back-port it in the future. It does also require a visible applet to run in IE, though this can be small.

Vacation

Swan
I’m back in the UK for the next week, my first trip home in three years! We’ve just headed up to my parents in Cambridge, after two days in London. The visit to the US embassy for the routine visa stuff was the usual nightmare of arbitrary bureaucracy, but I survived. Liz spent the afternoon in Regent’s Park, and we went on a guided walk around London in the evening. She’s got some photos up at http://lizbaumann.com/Britain2006.html, the hotel room is really something to be seen, very Austin Powers.

It’s raining and blowing a gale here at the moment, and I’m loving it after the complete lack of weather in LA! A walk around the my home village of Over was muddy, but looking around the 800 year old church and graveyard always gives me a real sense of wonder.

After a weekend of marmite, tea and roast dinners here at my parents, we’re heading up to Keswick in the lake district for a few days of hiking and warm beer. My brother, sister and sister-in-law are all coming too, and it’ll be the first time we’ve gone away together since we were kids.

Building it yourself

Hammer

MashProxy is an open source project, released under the GPL. You’ll need CVS to build your own copy of the Java applet if you want to run it on your own website, since my applet is hardcoded to only work on my domains. The sourceforge project page gives directions on how to get the code.

If you just want to experiment on your own machine, then life’s a lot easier. Download these files:

If you have them all in the same directory, and open index.html in your favorite browser, you should see the usual start page appear.

You can then start playing with searchgui.js, making alterations, or even write a whole new page structure and script using the mashproxy applet.

You won’t be able to run the applet on your own website though, for security reasons I didn’t want code I’d certified as safe being scripted by anyone else. To distribute it yourself, you’ll need to get a certificate, build the java project yourself, and sign it. That’s a complex process, and will have to wait for a future entry!

Why the hell are you sucking my bandwidth for your mashup?

Angry
That’s a good question.

The informal rules behind what’s acceptable use of someone else’s web server are clear if you write a new browser. Nobody complained when firefox came along, because there’s real people reading the content that the server owners are paying to send.

The rules are also well understood if you write a new robot to crawl the web, they should tread very lightly indeed, respect the robots.txt file, and keep some delays in between fetches, so as to avoid slowing down the server for the real traffic.

SearchMash is somewhere in between these two extremes. Originally, it was a pure browser. It is still entirely user-directed, so there’s a good chance that the bandwidth is going towards your target audience. On the other hand, an entire page of search results will be fetched at once, so it’s not as user directed as if they’d directly clicked on your link.

I do avoid fetching anything but the main HTML until the user requests a preview of the page, to keep the bandwidth demands as small as possible, so no images are requested.

I know not everyone will agree that it’s a net benefit, so I’ve made sure that the User-Agent header is always set to MashProxy for all requests, so servers can easily block my traffic. I considered a whitelist system too, since that would also prevent intranet access, but could see no practical way of that gaining adoption.

Security wishlist

Notebook

Writing SearchMash, there’s a lot of security features I wish I had access to.

  • One way access to frames
  • My big headache is running external, untrusted HTML in a frame that has full access to my applet. There should be a way to set a frame with programmatic content, but not allow any scripts in it access to other frames. I believe that the security=restricted frame attribute might allow this in IE, but it’s not supported by the other browsers.

  • Disabling scripting for frames
  • I’d be happy if I could turn off scripting entirely for a particular frame. This is what my blacklisting code does, but it seems like it would be a lot easier and more robust to do it at the browser level.

  • Turn off cookies for Java
  • I don’t want my page fetches to send cookies, but there doesn’t seem to be any way to disable this when running an applet inside the browser.

  • Signed scripts
  • These apparently allow JavaScript scripts the same privileges as signed Java applets. I say apparently because they’re only supported by the netscape family, so since I care about supporting IE, I haven’t tried them. If MS supported signed scripts, it would remove the need for a Java runtime. I don’t think it would solve any of the real security isses though.

    Most of these requests are for more finegrained control over the security restrictions within the browser. It’s mostly things that are exposed to the user as global switches anyway (like disabling cookies or JavaScript), so it doesn’t seem like it should be problematic to allow increased restrictions when required. I think they would make writing a secure Ajax app a lot easier.

SearchMash’s implementation

Sparkplug

SearchMash is the JavaScript that actually does something interesting with the MashProxy applet. It’s also a pretty small piece of code, only 423 lines long, and that’s with a deliberately verbose style.

The page it works on is made up of three frames, the left one for the search results, the large right one for the preview of pages in those results, and the small far right one to hold the applet. The far right one is needed because some browsers require that any applet be visible before it can be run, for security reasons.

The script , searchgui.js, is included by the head of the main document, the one that defines the frames. There’s a fair amount of jumping through hoops to allow script communication between the different frames, one of the universal browser security measures is preventing frames getting access to others from different domains.

The script does some onLoad voodoo to get around the explorer patent restrictions that MS implemented, to prevent the user having to click on the applet to activate it, but the real work doesn’t start until the applet has signalled to the script that it’s loaded. This is done by calling SB_NotifyAppletLoaded_Forward() in the applet’s frame, which then calls SB_NotifyAppletLoaded() in the main script.

This triggers the fetch of the starting search page. One pattern I use a lot to invoke script actions inside functions called from the applet iis setTimeout() with a short duration. This is mostly to get around problems I ran into when I was calling back into the applet from script functions invoked by the applet. In a couple of places, they are a bit more of a fudge, and used as a way of trying to make sure that parsing or loading is complete before continuing. This second usage is pretty fragile, and should be replaced with explicit status checking.

The SB_StartSearch() function looks at the input URL, and if there’s any arguments, constructs a google search link that passes them on. If there’s no arguments, the default google start page is used. The URL argument checking is there so that external plugins, like the search bars in Firefox and Safari, can call SearchMash directly. I had to tighten up the argument passing though, previously I allowed the full URL to be specified there, now I make sure it’s a google search one.

SB_GetSearch() sends a request for the search URL, and the script waits until it’s recieved.

The applet calls back SB_PageRequestDone_Forward() in it’s frame, which calls SB_PageRequestDone() in searchgui.js. The first thing the function does is some magic to convert the returned objects from Java strings to JS ones. This only seems necessary in Safari, otherwise the objects only respond to Java string methods, which are close enough to JS that debugging why some JS ones were failing got very confusing.

The URL is checked to see if it’s a google search link, which should end up in the search frame, or an external page that was returned as part of the search results. The check is done in SB_IsSearchLink(): if it’s from a google domain, then it’s a search result.

If it is destined for the search frame, then a &lt base &gt tag pointing to the original page is added to the HTML, since we’ll be dumping the contents of the page into our own frame with a mashproxy.com location, we need to let the browser know to resolve the relative paths back to the right location.

The next step is a bit odd. As a security step, browsers don’t let you monkey with the DOM of a frame once you’ve loaded it, even if it’s the same domain, so I have to add a script into the HTML to do several things:

  • Add some text after external links in the page, indicating if they exist
  • Work out if a URL is present in the search page
  • If the mouse is over an external link, invoke the preview frame
  • If it’s an external link, fetch its contents to see if they exist
  • If the user clicks on a search navigation link (next page, etc), open it through SearchMash
  • If the user submits a new search through a form, route that through SearchMash too

After the script is added, document.write() is called to setup the frame, and the SB_StartDocumentParsing() function is called to start the customization of the search page and checking of external links. This is one of the more dubious uses of setTimeout(), using an onload handler would be better, but I didn’t want to interfere with the script in the search page.

Once an external link is returned from the applet, it goes through the same SB_PageRequestDone() as search pages, but since it’s not from a google domain, it gets handled in the second branch. The first thing it does is check that the external link is present in the current search page, if it isn’t, it doesn’ t do any processing on it. This is both to improve performance, and to make it harder if an external script finds a way to request pages.

If it is found, the status text next to the link in the search page is set, and the contents of the page are stored for later use in the preview frame.

The last piece of functionality is invoked when the user moves her mouse over an external link. The event handler in the search page calls SB_SetPreviewFrame(), and this either sets the location of the frame to the external link, if it hasn’t been loaded by the status checking yet, or if the contents have been stored, writes them into the frame.

This step is the one most vulnerable to abuse, since I’m writing external HTML into a frame with my domain’s privileges. To protect against malice, t’he SB_RemoveScripts() goes through the contents before they’re added, and removes any scripts, using a regex based blacklist.

Script blocking updated

Stop
I’ve checked in some changes to the SearchMash JavaScript, intended to entirely remove any script content from displayed pages. Previously I was just removing &lt script&gt tags, now I look for javascript: urls, eval calls and events (onload, etc). I’ve also added some ‘canonicalization’ (there should be a better word for that) steps to try and avoid some of the common workarounds for regex blocking, like inserting newlines or spaces.

I also changed over the way I work out if a link in a results page is a search link (and so should open in the left frame). This change should restrict the pages that get opened as search results to those on the google domain, previously it would have been possible for someone to set up a http://www.notgoogle.com domain and I’d open it there.

For both of these changes, I switched over to using regular expressions, since that’s a lot easier to understand than my previous logic.

I feel a bit better about the security of running external html in my domain now, I might try and get back to some feature work after this.

Now, I think I need to spend my remaining weekend with a margarita and a hot tub, and prepare for my real job tomorrow.