How to create automatic blog categories with Lijit

Categories
Photo by Hawkexpress

I really like Lijit’s blog search widget, but I don’t want a cloud generated from the most popular searches. I’ve seen other blogs end up with some very inappropriate word combinations, apparently from people gaming the system. I also find the standard notion of tags very limiting; it’s only when I step back and see what I’ve been posting about that natural categories emerge. When I’m writing a post I often have no idea if it’s the first in a long series, or a one-off. I’d much rather have an automatic way of tagging all my posts, based on a few categories I describe after the fact.

If you look on my right bar, you’ll see a new ‘Categories’ list. These are actually canned Lijit searches, so clicking on them will bring up an in-context list of all the posts that match. For each category I’ve defined a Google search, often using the upper-case OR operator to pick a variety of different terms that are present in those types of posts. For example, the ‘Outdoors’ category searches for ‘hiking OR camping OR trails OR biking’.

I’ve mentioned this to the very nice people at Lijit as a feature request for a more general widget, but for now I’ve included a simple tool below to generate your own category lists. It generates the raw HTML, and you’ll need to work out how to get it into your own blog. It also calls back into Lijit’s scripts to bring up the in-context results, so you’ll need to have the main widget already installed.

Here’s what it takes to get this into Typepad:

  • Generate the HTML for your list using the form below. Copy the HTML that appears in the textbox when you hit the button onto the pasteboard.
  • Go to the TypeLists tab on your Typepad blog administration page.

Lijittutorial1

  • Click on the Create New List link.

Lijittutorial2

  • Set the type of the new list to Notes and the name to ‘Categories’

Lijittutorial3

  • Click on Add Item, and paste the HTML from the generator into the label textbox.
  • Go to the Publish tab and select the blog you want to add it to, and click Save Changes.
  • Go to Weblogs, then Design, and choose Select Content.
  • Disable the built-in categories module if you have it already selected, and click Save Changes.
  • Go to Content Ordering and drag the new ‘Categories’ list to where you want it, and save.

Now if you refresh your site, you should see the new categories appear.

function get_object(id)
{
return document.getElementById(id);
}
function get_value(id)
{
var currentobject = get_object(id);
if (currentobject!=null)
return currentobject.value;
else
return “”;
}
function generate_widget()
{
var widgethtml = “

“;
var username = get_value(“username”);
var count;
for (count=0; count<12; count+=1)
{
var nameid = "name"+count;
var keywordsid = "keywords"+count;
var namevalue = get_value(nameid);
var keywordsvalue = get_value(keywordsid);
if ((namevalue!="") && (keywordsvalue!=""))
{
var keywordsescaped = keywordsvalue.replace(/ /g,"+");
var currenttag = "“;
currenttag += namevalue;
currenttag += “

“;
widgethtml += currenttag;
}
}
widgethtml += “

“;
var textpreview = get_object(“textpreview”);
var htmlpreview = get_object(“htmlpreview”);
textpreview.value = widgethtml;
htmlpreview.innerHTML = widgethtml;
}

Lijit user name:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:


Generated HTML:


Preview:

 

You can also open this in a separate page in case Typepad’s cleanup breaks the tool, and here’s a screenshot from my category creation:
Lijittutorial5

Easy user authentication for Windows with PHP

Password
Photo by Richard Parmiter

The internet is slowly groping towards a single user identity system through the OpenID initiative, but one of the nice things about working inside a corporate firewall is that there’s already a directory of user names and passwords. In the dominant Microsoft world, you rely on Active Directory to keep track of all that information. The ‘Active’ prefix usually strikes fear into anyone integrating non-MS technology into a Windows world, since it often translates to ‘proprietary’, but they’ve actually done a really good job of making the directory information available through the LDAP open standard.

If you want to try converting your PHP-based internet app to intranet authentication, check out this tutorial on using LDAP from PHP with an Exchange server. If you’re interested in the details of using LDAP with PHP in general, things like how to install the LDAP module if it isn’t there by default on your PHP installation, check out this two-part guide.

Get a real-time view of the Apache error log

Tail
Photo by CR

I spend a lot of time developing on a remote server through an SSH connection, and I’ve found it tough to keep an eye on the error log file. Typically I’ve been running the tail unix command to look at the last 10 lines, but this only gives you a snapshot of the errors at that instant. If I wanted to see more, I had to run tail again. I knew there had to be a better way, something I was missing since I couldn’t imagine unix wizards putting up with this. Luckily I was right. If you pass the -f or –follow option to tail it will continuously update, so you can see errors in real-time as the lines are written to the log file. This is perfect, I can see at a glance what’s going on.

To do the same, just open an SSH session to your remote server, and then type in the following command:

tail -f /var/log/httpd/error_log

The log file location varies on different flavors of Linux, and if you have access problems, make sure the logged-in user has high enough permissions to see it.

How to easily search and replace with sed

Textjumble
Photo by wai:ti

If you’ve used any flavor of unix for programming, you’re probably familiar with grep, the tool for locating patterns in text files. That’s great if you just want to search for a string, but what if you want to replace it?

Sed, the stream editor, is the answer, but that also brings up a new question: how on earth do I use it? It’s probably one of the most obscure interfaces ever invented, its syntax makes obfuscated perl look like a model of clarity. Usually with a new tool I start off looking at a series of examples, like these sed one-liners, to get a rough mental model of how it works, and then dive into the documentation on specific points. That didn’t work with sed, I was still baffled even after checking those out. The man page didn’t help, I could read the words but they didn’t make any sense.

Finally I came across my salvation, Bruce Barnett’s introduction and tutorial for sed. He hooked me with his first section on The Awful Truth about sed, with its reassurance that it’s not my fault that I’m struggling to make head or tail of anything. He then goes through all the capabilities of sed in the order he learnt them. It’s a massive list, but even if you only get a few pages in you’ll know how to do a simple search and replace. Sed is a very powerful tool, it’s worth persevering with the rest so you can discover some of the advanced options, like replacing only between certain tags in a file (eg how to change the content text of a particular XML tag) and working from line numbers. Bruce is an entertaining companion for your journey too, he has great fun with some asides demonstrating how to make the syntax even harder to read, just for kicks.

Why massive datasets beat clever algorithms

Library
Photo by H Wren

Jeremy Liew recently posted some hints, tips and cheats to better datamining. The main thrust, based on Anand Rajaraman’s class at Stanford, is that finding more data is a surer way to improve your results than tweaking the algorithm. This matches both my own experience trying to do datamining, and what I’ve seen with other company’s technologies. Engineers have a bias towards making algorithms more complex, because that’s the direction that gets the most respect from your peers and offers the most intellectual challenge.

That makes us blind to the advantages of what the Jargon File calls wall-followers, after the Harvey Wallbanger robot that simply kept one hand on a wall at all times to complete a maze, and gave far better results than the sophisticated competitors using complex route-finding. Google’s PageRank is a great example, almost zen in its simplicity, with no semantic knowledge at all.

One hidden advantage to this simple-mindedness is very predictable behavior, since the simplicity means there’s a lot fewer variables that affect the outcome. This is why machine-learning is so scary a change for Google, there’s no guarantee that some untested combination of inputs won’t result in very broken results.

Another obvious benefit is speed of both development and processing. This lets you get up and running very fast, and get through a lot of data. This gives you a lot more coverage. Yahoo’s directory wasn’t beaten because Google ranked pages more accurately than humans, but because Yahoo could only cover a tiny fraction of what was out there.

On the face of it, this doesn’t sound good for a lot of the alternative search engines that are competing out there. If it’s hard to beat a simple ranking algorithm, should they all pack up and go home? No, I think there’s an immense amount that can be improved both on the presentation side and by gathering novel data sets. For example why can’t I pull a personal page rank based on my friends and friends-of-friends preferences for sites? What about looking at my and their clickstreams to get an idea of those preferences?

How to profile your PHP code with Xdebug

Tapemeasures
Photo by Jek in the Box

I was adding some functionality to my mail system, and noticed it seemed to be running more slowly. I didn’t have any way to be sure though, and I didn’t know how to get performance information on my PHP code. After a bit of research, I discovered the xdebug PHP extension can gather profile data, and you can then use kcachegrind to view it.

On Fedora linux, you can install xdebug by running the following line in superuser mode:
yum update php-pecl-xdebug

To enable capturing of performance data, you’ll then need to edit your php.ini file, adding the following lines:

; xdebug settings
xdebug.profiler_enable = 1
xdebug.profiler_append = 1
xdebug.profiler_output_name = cachegrind.out.%s

This will output raw profile data to files named cachegrind.out.<your php script name> in /tmp. There’s a some options you may want to tweak, for example not appending repeated calls, or naming them after something other than the script name.

Once you’ve made those changes, restart apache with
/sbin/service httpd restart

Now navigate to the pages you want to profile, and the data should appear as files in /tmp. Once you’ve done the operations you’re interested in, possibly repeatedly to generate a larger sample, edit the php.ini xdebug.profile_enable to 0 and restart apache again.

I’d now got a nice collection of data, but that wasn’t much use without a way to understand what it meant. Kcachegrind is the most popular tool for viewing the output files, but it doesn’t have a native OS X version. I tried the darwin ports approach, but as always at least one of the dozens of dependencies failed to compile automatically, so I resorted to my Fedora Linux installation running in Parallels. If you’re on Windows, WinCacheGrind is a native version that’s had some good reviews. I couldn’t find a separate Linux binary, but it’s part of the kdevelop suite, so I was able to install that easily through the package manager.

Once that’s installed, copy over the cachegrind data files from your server, and then open them up in the application. You should see a list of function calls, and  if you’re used to desktop profiling, a lot of the options for drilling down through the data should seem familiar. The kcachegrind team has some tips if you’re looking for a good way to get started.

For my case, I found that most of the time was spent inside IMAP, which is actually good news since it means I’m running close to the maximum download speed and my parsing code isn’t getting in the way too much.

An easy way to keep a running total in mysql

Numbers
Photo by Walsh

I needed to keep track of how many emails a user had sent on a given day in my database, and it was surprisingly tough to implement. As I came across a message, I needed to add one to the total for that user on that day. If a record for that time and address already exists, it’s simple:

UPDATE messagefrequencies SET senttocount=senttocount+1
WHERE address=currentaddress AND day=currentday;

When I implemented that, I realized that it didn’t do quite what I wanted. If there wasn’t already a record for that address and day, the update would fail. There were a potentially massive number of combinations of days and addresses, and I didn’t want to create blank rows for all of them, so I needed some way to increment the current value if it already exists, or create it and set it to 1 if there isn’t a row that matches.

My first attempt was to use the IF EXISTS syntax, but I discovered that’s only valid within stored procedures. The real solution turned out to be the opposite of the way I was thinking about the problem, since there’s an ON DUPLICATE KEY command that lets you attempt a row INSERT and then if the row already exists you can do an update. One thing to watch out for is that this update syntax doesn’t require a SET, instead you just specify the columns you want to change.

INSERT INTO messagefrequencies (day, address, senttocount)
VALUES (TO_DAYS(‘2006-05-29 13:59:10’), ‘somebody@gmail.com’, ‘1’)
ON DUPLICATE KEY UPDATE senttocount = senttocount+1;