Search Tips for Firefox


  1. Use PeteSearch! The hot-keys, link checking, split-screen preview and term highlighting will save you time.
  2. Use CustomizeGoogle. This add-on allows you to do a lot of handy stuff like excluding sites from your results, and has just added a cool infinite scrolling feature.
  3. Learn Google’s advanced options . There’s a graphical way to do these searches, but this always generates keywords you can see and reuse in the search box.
  4. Look at other search engines. Technorati is handy for breaking news, since posts show up very quickly in their search results. If you’re having trouble finding what you want on Google, Ask uses a different algorithm that may better suit that search.
  5. Use the built-in Firefox search hot-keys. Control (Command on the Mac) and K will take you to the search bar for example, and Ctrl/Cmd+Up/Down will move through the different search engines.
  6. Add more sites to the search bar. The Mycroft site has a list of hundreds of plugins, so if there’s a specialized site you already use, there’s a good chance it’s in there.
  7. Pick distinctive terms. English is full of overloaded words, one example is "Wedding band" which could either mean a musical group, or jewellery. If you’re getting back confused results, try a more distinctive synonym, in this case maybe "Wedding ring" or "Wedding music" depending on which you want.
  8. Add words or phrases you expect in the result page. When I’m looking for a page I previously found, I try to remember distinctive words or phrases from it, and add those to the search.
  9. Use the site: advanced operator to focus on a single site. Often Google works better than a site’s built-in search!
  10. Quote search terms. Normally Google will find pages with the terms in any order, but often you want only those with an exact phrase, for example "Pete Warden" to find only those with my name, rather than any pages with both Pete and Warden anywhere on them.

More tips:

More Search Tips…


PeteSearch and the semantic web applied

PeteSearch is a semantic web application, it’s taking web pages designed to be read by humans and turning them into data that can be processed by software. It’s a pretty specialized application, focused purely on pages that list external sites associated with particular search terms, but the wide range of sites I’m able to support using the same code shows that my approach is robust.

The model I use for search pages is that they must contain three pieces of information:

  • A list of search terms, embedded in the URL
  • A list of external sites associated with those terms
  • A link to the next page of results

All of the recognition of these is data-driven, using a definition for each engine that includes

  • What is the host name and action used by the engine, so we can tell what’s a page of search results. For google that’s
  • What precedes the search terms in the URL, eg for google that’s q=
  • Which external sites are linked to, but not part of the results, eg google links to for definitions of words
  • Which words indicate an external link that isn’t part of the results, eg google links to the cached results on numbered servers using Cached as the link’s text
  • Which word is used for the link to the next page of results. For English that’s almost always Next but I also support other languages

You can experiment with this by editing the SearchEngineList.js file inside PeteSearch, it contains an array of these definitions, and an explanation of the exact format they’re stored in. It’s pretty straightforward to add a new engine that fits into this pattern, and most of them do.

The only way that the semantic web is going to progess beyond proofs of concept is if there’s some concrete, practical and commercial application for it. I’ve seen this with AI, its applications in robotics and games have pushed the field forward much more than pure research. The semantic web is stuck in a chicken-and-egg situation; nobody builds applications because nobody builds sites that are data sources,
because nobody builds applications, etc.

I’m not the only one to notice this, Piggy Bank is an MIT project that’s much more ambitious, and works like Grease Monkey in that it provides a framework to support data capture from many different sites using plugin scripts.

My goal is to demonstrate that a semantic web application can be useful today, in the real world, by creating a compelling tool based on my approach. I’m worried that unless somebody can show something useful, it’s going to succumb to the AI curse and remain the technology of tomorrow indefinitely!

Porting Firefox Extensions to Internet Explorer Part 3

In the first two parts I covered setting up our tools, and creating a code template to build on. The next big challenge is replacing some of the built-in facilities of Javascript, like string handling, regular expressions and DOM manipulation, with C++ substitutes.

For string handling, watch out for the encoding! Most code examples use 8 bit ASCII strings, but Firefox supports Unicode strings, which allow a lot more languages to be represented. If we want a wide audience for our extension, we’ll need to support them too.

C++ inherits C’s built-in strings, as either char (for ASCII )or wchar_t (for Unicode) pointers. These are pretty old-fashioned and clunky to use, doing common operations like appending two strings involves explicit function calls, and you have to manually manage the memory allocated for them.

We should use the STL’s string class, std::wstring, instead. This is the Unicode version of std::string, and supports all the same operations, including append just by doing "+".  The equivalent for indexOf() is find(), which returns std::wstring::npos rather than -1 if the substring is not found. lastIndexOf() is similarly matched by find_last_of(). The substring() method is closely matched by the substr() call, but beware, the second argument is the length of the substring you want, not the index of the last character as in JS!

For regular expressions, our best bet is the Boost Regex library. You’ll need to download and install boost to use it but luckily the windows installer is very painless. Once that’s done, we can use the boost::wregex object to do Unicode regular expression work (the boost::regex only handles ASCII). One pain dealing with REs in C++ is that you have to use double slashes in the string literals you set them up with, so that to get the expression \?, you need a literal "\\?", since the compiler otherwise treats the slash as the start of a C character escape. The regular expressions functions themselves are a bit different than Javascript’s; regex_match() only returns true if the whole string matches the RE, and regex_search() is the one to use for repeated searches.

DOM maniplation is possible through the MSHTML collection of interfaces. IHTMLDocument3 is a good start, it supports a lot of familiar functions such as getElementsByTagName and getElementById. It does involve a lot of COM query-interfacing to work with the DOM, so I’d recommend using ATL pointers to handle some of the house-keeping with reference counts and casting.

PeteSearch is now detecting search page loads, and extracting the search terms and links from the document, next we’ll look at XMLHttpRequest-style loading from within a BHO.

More posts on porting Firefox add-ons to IE

Porting Firefox extensions to Internet Explorer Part 2


In the first part, I showed how to set up the tools you need to build a BHO. I’ve now created a skeleton project that will compile into a very simple BHO, TinyBHO. I wasn’t able to find any good official sample code, all the articles involve starting with a template, and describe how to build on it, but the SurfHelper example by Xiaolin Zhang on CodeProject was a great help. Having a complete example I could build and run gave me the information I needed to build my own.

To build the BHO, download and unzip the project, open the TinyBHO.vcproj file in Visual Studio Express, and hit f7 to build it. The build process should compile, and then add the DLL to the registry, and the next time you run Internet Explorer, a message box should appear whenever a document’s loaded. This provides all the boilerplate to hook into browser events, and provides a stub that you can insert your  own document processing logic into.

If we want to build on this, we’ll need your own GUIDs for the interface and type library. Use a tool like uuidgen or one of the online versions to generate three IDs, and then load the TinyBHO.idl file into VS’s editor. Replace the first ID that begins with 03147ee0 with our first ID, it should only appear once near the top of the .idl file.

The second ID we need to replace starts with 7cd37f36, and appears once in both TinyBHO.idl and TinyBHO.rgs. The third begins with 00e71626, and defines the main class’s ID. It occurs four times in TinyBHO.rgs and once in TinyBHO.idl. I’d also recommend customizing the class and file names, but that shouldn’t be strictly necessary for the BHO to coexist with others the way that unique GUIDs are.

Now we’ve got the BHO personalized, we can start looking at CTinyBHOClass::Invoke(), the function that gets called whenever a browser event, such as a page load or key press, occurs. Right now, all it does is catch document load events, and pulls the document interface pointer from the event, and shuts down the event handler when the browser quits. For PeteSearch, I’ll be focused on key presses and document loads, since that’s where it does most of its work, but there’s a wide range of other events to grab if you want to implement something else, like a popup blocker. In Firefox terms, it’s pretty close to having called AddEventListener() on the browser, except that you get called on all events and have to check yourself to figure out what happened.

The trickiest thing about getting this working was that all the ATL COM handling code is such a black box. The registration part of the build phase was failing, and I eventually tracked it down to a misspelt resource name, but there was no logging or other useful debug information available, I just had to spend a long time inspecting the code looking for anything suspicious!

More posts on porting Firefox add-ons to IE

Porting Firefox extensions to Internet Explorer

Compared to writing Firefox extensions, there isn’t a whole lot of information about writing add-ons for Internet Explorer, and almost nothing about porting from Firefox to IE.  I’m converting PeteSearch over, and I’ll be sharing my notes on the process here.

There’s no way of writing a plugin using an interpreted environment like Firefox’s Javascript, instead you have to create a binary library in a DLL. These are known as Browser Helper Objects, and give you the ability to catch browser events like page loads, and work with the DOM, but don’t offer any UI access for things like toolbars. Hopefully PeteSearch doesn’t use anything outside of a BHO’s domain, since it works almost entirely on processing and altering existing pages.

Since we’ll need to compile a DLL, the first thing we need is a compiler. Luckily Microsoft has released a cutdown version of Visual Studio for free, though there are some limitations that I’ll describe. I chose to use C++ to create the DLL, because I know it well and it’s what most of the BHO examples use, but you could also use C# or Visual Basic.

After we’ve downloaded the compiler there’s some other components we need before we can build a DLL. Since the BHO is a win32 DLL, we need the platform SDK. To get it working with the express edition of Visual Studio, we’ll also need to muck about with some config files, as described in this MSDN article.

For dealing with COM code, ATL/WTL is very useful, so I’d also recommend downloading the open source WTL from MS, and following the steps in this CodeProject article.

Now we’re ready to compile a DLL, the best place to start is with some example code. There’s two good articles from Microsoft that describe writing BHOs, one from 1999 and a more recent 2006 version. Unfortunately they both start off using a wizard to create the base project, and the ATL wizard isn’t available for the express edition! They don’t have any complete examples downloadable, which is a bit puzzling. I’ve contacted one of the authors, Tony Schreiner, in the hope he can provide a complete example. If not, I may finagle a base ATL project from someone with the full studio package, then I’d be able to build it myself following the rest of the steps. Of course, I could purchase the standard studio for $300, but it seems worthwhile to work out a free way to write IE extensions.

René Nyffenegger has some example code that we can use instead, though I found I had to do some tweaking to get it to compile, such as defining a couple of logging macros, and sorting out some string literals that were expected to be wide but weren’t.

Now we should have the starting point for a BHO. I’ll cover the next steps, and show some example code, in a following article.

Edit – I’m trying this on a new Vista laptop, and there’s a few extra steps I noticed:

  • Folder permissions for editing the ATL headers in the SDK are tricky in Vista. You need to make yourself the owner, and only after that sort out the permissions.
  • You need to add the mfc include folder from the SDK too. I may have just forgotten this before.
  • The registry-setting part of the build process doesn’t work. I’ll cover fixing that in my description of writing an installer, and update here once that’s done.

Edit – To fix the registry build stage, you can run Visual C++ as administrator by right-clicking on the app and choosing that option from the menu.

More posts on porting Firefox add-ons to IE

PeteSearch and SEO

Search engine optimization is the art of making sure a site appears as prominently as possible in search results. There’s the respectable kind, which is all about tweaking pages users are likely to want in their search results to appear higher, and black hat SEO, which is aimed at tricking search engines into favoring sites users aren’t likely to want.

One of the major ways that black hats use to deceive is cloaking; presenting a different page to the user than to the search engines when they index the page. WebmasterWorld would drive me nuts with their registration-required pages in search results, though they now claim it was a side-effect of blocking bots. The New York Times always used to show up with pages that required registration, though they may have stopped that now.

This was one of the driving forces behind PeteSearch; I wanted a way to spot sites that were serving me up useless content before I clicked on them and wasted a small part of my life . The first line of defense is the term checking; the page I check is the one that the user will see when they click on the link, so if it’s cloaked, it won’t have the terms, and the result will be flagged. The second defense is the split-screen preview; this scrolls down to where the search terms are, so if the content is buried below a lot of ads, you’ll go straight to it without scrolling.

It’d be nice if this wasn’t needed, but the interests of web site hosts and users are fundamentally in conflict; hosts want as much revenue as possible and users want quick access to useful content. PeteSearch tries to give users a tool to hack through the jungle of misinformation that some hosts resort to.

Highlighting search terms


I’ve just put up version 0.6 of PeteSearch, the biggest change is that your search terms are now highlighted in yellow when you load up a preview page, and you can move through the terms by pressing shift and the up/down arrow keys! This is something I’ve been wanting for a long time, there’s no way to search for multiple terms using Firefox’s normal find, this lets me look for any of the terms, rather than having to pick just one.

There’s also some smaller improvements; Orkut and Blogger links started showing up in Google results, so I removed them, and tidied up a few other minor bugs.

I just saw another nice blog post on PeteSearch, thanks Farshid!