Building it yourself

Hammer

MashProxy is an open source project, released under the GPL. You’ll need CVS to build your own copy of the Java applet if you want to run it on your own website, since my applet is hardcoded to only work on my domains. The sourceforge project page gives directions on how to get the code.

If you just want to experiment on your own machine, then life’s a lot easier. Download these files:

If you have them all in the same directory, and open index.html in your favorite browser, you should see the usual start page appear.

You can then start playing with searchgui.js, making alterations, or even write a whole new page structure and script using the mashproxy applet.

You won’t be able to run the applet on your own website though, for security reasons I didn’t want code I’d certified as safe being scripted by anyone else. To distribute it yourself, you’ll need to get a certificate, build the java project yourself, and sign it. That’s a complex process, and will have to wait for a future entry!

Why the hell are you sucking my bandwidth for your mashup?

Angry
That’s a good question.

The informal rules behind what’s acceptable use of someone else’s web server are clear if you write a new browser. Nobody complained when firefox came along, because there’s real people reading the content that the server owners are paying to send.

The rules are also well understood if you write a new robot to crawl the web, they should tread very lightly indeed, respect the robots.txt file, and keep some delays in between fetches, so as to avoid slowing down the server for the real traffic.

SearchMash is somewhere in between these two extremes. Originally, it was a pure browser. It is still entirely user-directed, so there’s a good chance that the bandwidth is going towards your target audience. On the other hand, an entire page of search results will be fetched at once, so it’s not as user directed as if they’d directly clicked on your link.

I do avoid fetching anything but the main HTML until the user requests a preview of the page, to keep the bandwidth demands as small as possible, so no images are requested.

I know not everyone will agree that it’s a net benefit, so I’ve made sure that the User-Agent header is always set to MashProxy for all requests, so servers can easily block my traffic. I considered a whitelist system too, since that would also prevent intranet access, but could see no practical way of that gaining adoption.

Security wishlist

Notebook

Writing SearchMash, there’s a lot of security features I wish I had access to.

  • One way access to frames
  • My big headache is running external, untrusted HTML in a frame that has full access to my applet. There should be a way to set a frame with programmatic content, but not allow any scripts in it access to other frames. I believe that the security=restricted frame attribute might allow this in IE, but it’s not supported by the other browsers.

  • Disabling scripting for frames
  • I’d be happy if I could turn off scripting entirely for a particular frame. This is what my blacklisting code does, but it seems like it would be a lot easier and more robust to do it at the browser level.

  • Turn off cookies for Java
  • I don’t want my page fetches to send cookies, but there doesn’t seem to be any way to disable this when running an applet inside the browser.

  • Signed scripts
  • These apparently allow JavaScript scripts the same privileges as signed Java applets. I say apparently because they’re only supported by the netscape family, so since I care about supporting IE, I haven’t tried them. If MS supported signed scripts, it would remove the need for a Java runtime. I don’t think it would solve any of the real security isses though.

    Most of these requests are for more finegrained control over the security restrictions within the browser. It’s mostly things that are exposed to the user as global switches anyway (like disabling cookies or JavaScript), so it doesn’t seem like it should be problematic to allow increased restrictions when required. I think they would make writing a secure Ajax app a lot easier.

SearchMash’s implementation

Sparkplug

SearchMash is the JavaScript that actually does something interesting with the MashProxy applet. It’s also a pretty small piece of code, only 423 lines long, and that’s with a deliberately verbose style.

The page it works on is made up of three frames, the left one for the search results, the large right one for the preview of pages in those results, and the small far right one to hold the applet. The far right one is needed because some browsers require that any applet be visible before it can be run, for security reasons.

The script , searchgui.js, is included by the head of the main document, the one that defines the frames. There’s a fair amount of jumping through hoops to allow script communication between the different frames, one of the universal browser security measures is preventing frames getting access to others from different domains.

The script does some onLoad voodoo to get around the explorer patent restrictions that MS implemented, to prevent the user having to click on the applet to activate it, but the real work doesn’t start until the applet has signalled to the script that it’s loaded. This is done by calling SB_NotifyAppletLoaded_Forward() in the applet’s frame, which then calls SB_NotifyAppletLoaded() in the main script.

This triggers the fetch of the starting search page. One pattern I use a lot to invoke script actions inside functions called from the applet iis setTimeout() with a short duration. This is mostly to get around problems I ran into when I was calling back into the applet from script functions invoked by the applet. In a couple of places, they are a bit more of a fudge, and used as a way of trying to make sure that parsing or loading is complete before continuing. This second usage is pretty fragile, and should be replaced with explicit status checking.

The SB_StartSearch() function looks at the input URL, and if there’s any arguments, constructs a google search link that passes them on. If there’s no arguments, the default google start page is used. The URL argument checking is there so that external plugins, like the search bars in Firefox and Safari, can call SearchMash directly. I had to tighten up the argument passing though, previously I allowed the full URL to be specified there, now I make sure it’s a google search one.

SB_GetSearch() sends a request for the search URL, and the script waits until it’s recieved.

The applet calls back SB_PageRequestDone_Forward() in it’s frame, which calls SB_PageRequestDone() in searchgui.js. The first thing the function does is some magic to convert the returned objects from Java strings to JS ones. This only seems necessary in Safari, otherwise the objects only respond to Java string methods, which are close enough to JS that debugging why some JS ones were failing got very confusing.

The URL is checked to see if it’s a google search link, which should end up in the search frame, or an external page that was returned as part of the search results. The check is done in SB_IsSearchLink(): if it’s from a google domain, then it’s a search result.

If it is destined for the search frame, then a &lt base &gt tag pointing to the original page is added to the HTML, since we’ll be dumping the contents of the page into our own frame with a mashproxy.com location, we need to let the browser know to resolve the relative paths back to the right location.

The next step is a bit odd. As a security step, browsers don’t let you monkey with the DOM of a frame once you’ve loaded it, even if it’s the same domain, so I have to add a script into the HTML to do several things:

  • Add some text after external links in the page, indicating if they exist
  • Work out if a URL is present in the search page
  • If the mouse is over an external link, invoke the preview frame
  • If it’s an external link, fetch its contents to see if they exist
  • If the user clicks on a search navigation link (next page, etc), open it through SearchMash
  • If the user submits a new search through a form, route that through SearchMash too

After the script is added, document.write() is called to setup the frame, and the SB_StartDocumentParsing() function is called to start the customization of the search page and checking of external links. This is one of the more dubious uses of setTimeout(), using an onload handler would be better, but I didn’t want to interfere with the script in the search page.

Once an external link is returned from the applet, it goes through the same SB_PageRequestDone() as search pages, but since it’s not from a google domain, it gets handled in the second branch. The first thing it does is check that the external link is present in the current search page, if it isn’t, it doesn’ t do any processing on it. This is both to improve performance, and to make it harder if an external script finds a way to request pages.

If it is found, the status text next to the link in the search page is set, and the contents of the page are stored for later use in the preview frame.

The last piece of functionality is invoked when the user moves her mouse over an external link. The event handler in the search page calls SB_SetPreviewFrame(), and this either sets the location of the frame to the external link, if it hasn’t been loaded by the status checking yet, or if the contents have been stored, writes them into the frame.

This step is the one most vulnerable to abuse, since I’m writing external HTML into a frame with my domain’s privileges. To protect against malice, t’he SB_RemoveScripts() goes through the contents before they’re added, and removes any scripts, using a regex based blacklist.

Script blocking updated

Stop
I’ve checked in some changes to the SearchMash JavaScript, intended to entirely remove any script content from displayed pages. Previously I was just removing &lt script&gt tags, now I look for javascript: urls, eval calls and events (onload, etc). I’ve also added some ‘canonicalization’ (there should be a better word for that) steps to try and avoid some of the common workarounds for regex blocking, like inserting newlines or spaces.

I also changed over the way I work out if a link in a results page is a search link (and so should open in the left frame). This change should restrict the pages that get opened as search results to those on the google domain, previously it would have been possible for someone to set up a http://www.notgoogle.com domain and I’d open it there.

For both of these changes, I switched over to using regular expressions, since that’s a lot easier to understand than my previous logic.

I feel a bit better about the security of running external html in my domain now, I might try and get back to some feature work after this.

Now, I think I need to spend my remaining weekend with a margarita and a hot tub, and prepare for my real job tomorrow.

MashProxy’s implementation

Cog

MashProxy is a very small Java applet, only 385 lines long and 12Kb when it’s compiled. It’s job is to recieve URLs from the JavaScript of the page it’s running in, and return the HTML contents of those pages.

The first thing it has to do is call back the page it’s contained in when it’s initialized, so that the script knows it’s safe to start asking it to fetch pages. Calling an applet before it’s initialized doesn’t work, and this was the most reliable way to discover once it’s active. I also tried checking alive state and other methods, but this was the most reliable across different browsers.

It’s in the init function that the current document’s location is checked, and if it isn’t on the whitelist, or being run from a local file, the applet silently fails.

If it’s determined that it’s ok to run, the applet stores the current JavaScript window and document objects, which it will use for future callbacks, and then calls the SB_NotifyAppletLoaded_Forward() JavaScript function in the current document. This is hardcoded to this function name, it’s got a _Forward suffix because the SearchMash implementation runs the applet in a frame, and so the JSWindow calls back into that frame, and the function is just a stub that calls the real code in the main frame.

It also starts up a seperate thread to handle incoming requests. This is because callbacks into signed applets from JavaScript don’t get the same security privileges, so it’s necessary to just pass the information onto a different thread from such a function, and have that trusted thread do the actual work.

As far as I can tell from Sun’s comments on the need for this, it is to prevent accidental exposure of signed applet’s functions to JavaScript, and it does ensure that you have to explicitly enable any trusted code before you can call it. However, it may be that they were hoping to prevent JavaScript invoking trusted Java code at all, which is not the effect they achieved, they just made the code to do it more complex.

The main entry point to the applet is the pageRequest() function. This takes a URL, and a string containing the JavaScript function name to call back. For now I’ve disabled being able to set the callback, and hardcoded the call back to SB_PageRequestDone_Forward() in the applet’s frame, to make it tougher to use the applet for malicious third-party scripts loaded in other frames.

The callback then pushes the arguments into member variables of the applet, and signals the trusted thread to handle a request by waking it. The trusted thread reads in the arguments, and does the actual work.

It first fires off a header request using HTTPURLConnection to the web page to get the status code, and quickly discover if it’s missing or moved. After that, it requests the full page. There’s also a timeout thread on these that kicks in after 20 seconds of no response, and sets the status to HTTP_CLIENT_TIMEOUT.

When the results are returned, the pageRequestDone() function is called, either with null for the contents if the page couldn’t be found, or a BufferedReader representing the contents, the source URL and the status code. The reader is converted into a string, and the JavaScript callback function (hardcoded currently to SB_PageRequestDone_Forward) is called with the results.

The applet is able to handle multiple concurrent requests because of its threading model. One quirk is that I had to include a JSObject jar to enable the ‘liveconnect’ functionality I needed to be able to call back and forth between JavaScript and Java.

Security

Lock

No security is perfect, but there’s a number of important restrictions built into MashProxy and SearchMash.

The security in MashProxy is based around the Java applet being digitally signed. This signature confirms that the code has not been tampered with, and was written by the author. I’m using a certificate from Verisign, which I had to give solid proof of my identity to obtain.

This proof of identity means that it’s very easy to track someone down who did write malicious code. It reduces the decision of ‘Do I want to run this code on my machine?’ to ‘Do I trust the author of the code?’.

I’m an established and reputable open source programmer, as a search on my name will verify, and I’ve been distributing widely used executable code (such as FreeFrame plugins for many years with no security problems. This should give you some assurance that my intentions are good.

The other question is whether I’ve done a competent job safeguarding the security in my implementation? I believe I have, and to back that up I’ll cover the details of what I’ve done below. The source code is also freely available through the SourceForge project for anyone who wants to examine it for themselves.

The basic foundation of MashProxy is that signed Java applets running within a web browser are able to fetch pages from anywhere on the web, and aren’t restricted by the same-domain policy that controls most page access functions.

My initial attempt at a search mash ran entirely within a signed applet, but I discovered that support for parsing and rendering HTML was too limited in Java, and I wasn’t happy with what I could achieve.

Taking what I’d learnt, I looked into other possible ways of implementing the functionality I wanted. JavaScript fitted the bill, since it has great support for page display and parsing, is well supported and documented, and is cross-browser. The one thing it didn’t have is the ability to load pages from other domains.

However, it was able to call into a signed applet, and the applet could then load the pages from any domain. So, I took my original code for the full applet, and reduced it to just a single public function to load a web page. I was then able to call it from JavaScript and build SearchMash.

Now, since the applet doesn’t know what JavaScript is calling it, this opens up a lot of possibilities for abuse. I set out to engineer some restrictions into the applet to prevent malicious usage:

Applet Safeguards

The major safeguard is that I only allow the applet to run from pages on the petewarden.com or mashproxy.com, sites I control. This ensures that the applet can’t be used with my signature on other people’s sites. I also allow the applet to be run from pages that are on the local machine (checking for file as the protocol), with the assumption that local files are trusted. This means that development using my signed applet will still be possible by third parties using local files, but to deploy they’ll need to get a certificate and rebuild their own applet with their site added to the whitelist of trusted domains.

To check the page I’m running on, I call JApplet’s getDocumentBase() function, which returns the page that invokes the applet (not the location of the applet, that would still allow untrusted third parties to call it).

I also make sure that all requests begin with “http://”, (even though the HTTPURLConnection code that I call in Java shouldn’t allow anything else), to restrict any other protocols from being invoked.

The one thing I wasn’t able to do was prevent the passing of cookies automatically. I would prefer to always work without cookies, to avoid the possibility of accessing personal data, but these seem to be added automatically by the browser to all requests made through HTTPURLConnection.

SearchMash JavaScript

These are the measures I’ve taken to ensure the applet can’t be used maliciously by someone else. I’ll now cover what the JavaScript that implements SearchMash actually does, step by step, to provide evidence it won’t leave you vulnerable.

When the main index.html is loaded, it contains three frames, the one for results, the one for the preview, and a small one to hold the applet. The small one is necessary because some browsers don’t want to run invisible applets, which seems sensible.

The onload function calls code that Internet Explorer needs to run the applet automatically, because of patent issues. Then, the applet is loaded, and calls back to SB_NotifyAppletLoaded(), which triggers the first fetch of the google search page.

When a search page is loaded, I add JavaScript hooks into the page links so the user can navigate through pages without leaving the mashup. I also add mouse-over hooks to activate the preview frame. I parse all the links in the page, and put in page requests to the applet for all that appear to be external to Google.

The applet calls back to JavaScript when a page is loaded or determined to be missing. If it’s an external page, the current search page is searched for the link, the status text is updated, and the contents are stored for later use in the preview frame. If it’s a search results page, then the link hooks are added, and the search frame is updated.

At no point in the code is any information from the fetched pages passed back to my site, or any other external site. Those pages never leave your local machine, which ensures that your information is kept private.

Remaining Vulnerabilities

Being able to read from arbitrary domains opens up some malicious possibilities:

  • Reading from web pages using the browsers’ cookies to access personal information
  • Reading from web pages that are within a private intranet
  • Passing any information obtained back to the attackers website by encoding the information in a URL

Because the HTML from the remote sites that SearchMash fetches is loaded into frames that have access to the applet, any scripts inside the HTML loaded from external sites could use the applet’s functions.

I’ve added some checks to try and remove scripts from loaded pages, but it’s notoriously tough to parse out all scripts (see the myspace samy exploit).

Something that would limit the threat would be removing access to cookies. I’m looking into using a lower level functions in Java, or a third-party library that would do let me disable them.

Introducing MashProxy

Wavehello

Hi, I’m Pete Warden (that’s not me in the photo!), and as a fun project, I decided to create a search mashup that let me search the way I wish I could.

While I was doing that, I discovered that there was a big restriction on using XMLHttpRequest and AJAX. You can only request pages from the same server you’re on, as a security measure. This obviously makes doing a mashup of pages on other servers much more difficult.

The standard ways of working around this involve setting up a way to use your server to fetch external web pages. There were several reasons I wanted to avoid this:

– It doesn’t scale with the number of users, since everything has to go through your server.
– A big goal of the project was to discover if the pages found in the search were accesible to the user. This isn’t possible if it’s a remote server doing the checking.
– Setting up the server to act as a proxy requires at least some knowledge of scripting and Apache

The main reason that client-side proxies haven’t been done before is the potential for security holes that it opens up. Chris Shiflett has a great article that covers the problems if XMLHttpRequest were opened up to allow cross-domain requests, which is equivalent to what MashProxy allows.

Julian Couvreur also has helped my understanding of the issues. He’s written something similar using Flash rather than Java, FlashXMLHttpRequest.

I’ll discuss the security policy I adopted in my next post, including the safeguards against abuse I’ve implemented and possible remaining problems.

In short, MashProxy is a Java applet that lets JavaScript request web pages, just like XMLHttpRequest, but without the same domain restriction. This let me build the SearchMash project, implementing my ideas on a better interface to search results. It’s open source, and up on SourceForge, and my hope is that other developers will use it as an easier way to create mashups. I want to see more mashing, and I think the server proxy requirements have been holding things back.