How to create automatic blog categories with Lijit

Categories
Photo by Hawkexpress

I really like Lijit’s blog search widget, but I don’t want a cloud generated from the most popular searches. I’ve seen other blogs end up with some very inappropriate word combinations, apparently from people gaming the system. I also find the standard notion of tags very limiting; it’s only when I step back and see what I’ve been posting about that natural categories emerge. When I’m writing a post I often have no idea if it’s the first in a long series, or a one-off. I’d much rather have an automatic way of tagging all my posts, based on a few categories I describe after the fact.

If you look on my right bar, you’ll see a new ‘Categories’ list. These are actually canned Lijit searches, so clicking on them will bring up an in-context list of all the posts that match. For each category I’ve defined a Google search, often using the upper-case OR operator to pick a variety of different terms that are present in those types of posts. For example, the ‘Outdoors’ category searches for ‘hiking OR camping OR trails OR biking’.

I’ve mentioned this to the very nice people at Lijit as a feature request for a more general widget, but for now I’ve included a simple tool below to generate your own category lists. It generates the raw HTML, and you’ll need to work out how to get it into your own blog. It also calls back into Lijit’s scripts to bring up the in-context results, so you’ll need to have the main widget already installed.

Here’s what it takes to get this into Typepad:

  • Generate the HTML for your list using the form below. Copy the HTML that appears in the textbox when you hit the button onto the pasteboard.
  • Go to the TypeLists tab on your Typepad blog administration page.

Lijittutorial1

  • Click on the Create New List link.

Lijittutorial2

  • Set the type of the new list to Notes and the name to ‘Categories’

Lijittutorial3

  • Click on Add Item, and paste the HTML from the generator into the label textbox.
  • Go to the Publish tab and select the blog you want to add it to, and click Save Changes.
  • Go to Weblogs, then Design, and choose Select Content.
  • Disable the built-in categories module if you have it already selected, and click Save Changes.
  • Go to Content Ordering and drag the new ‘Categories’ list to where you want it, and save.

Now if you refresh your site, you should see the new categories appear.

function get_object(id)
{
return document.getElementById(id);
}
function get_value(id)
{
var currentobject = get_object(id);
if (currentobject!=null)
return currentobject.value;
else
return “”;
}
function generate_widget()
{
var widgethtml = “

“;
var username = get_value(“username”);
var count;
for (count=0; count<12; count+=1)
{
var nameid = "name"+count;
var keywordsid = "keywords"+count;
var namevalue = get_value(nameid);
var keywordsvalue = get_value(keywordsid);
if ((namevalue!="") && (keywordsvalue!=""))
{
var keywordsescaped = keywordsvalue.replace(/ /g,"+");
var currenttag = "“;
currenttag += namevalue;
currenttag += “

“;
widgethtml += currenttag;
}
}
widgethtml += “

“;
var textpreview = get_object(“textpreview”);
var htmlpreview = get_object(“htmlpreview”);
textpreview.value = widgethtml;
htmlpreview.innerHTML = widgethtml;
}

Lijit user name:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:


Generated HTML:


Preview:

 

You can also open this in a separate page in case Typepad’s cleanup breaks the tool, and here’s a screenshot from my category creation:
Lijittutorial5

Easy user authentication for Windows with PHP

Password
Photo by Richard Parmiter

The internet is slowly groping towards a single user identity system through the OpenID initiative, but one of the nice things about working inside a corporate firewall is that there’s already a directory of user names and passwords. In the dominant Microsoft world, you rely on Active Directory to keep track of all that information. The ‘Active’ prefix usually strikes fear into anyone integrating non-MS technology into a Windows world, since it often translates to ‘proprietary’, but they’ve actually done a really good job of making the directory information available through the LDAP open standard.

If you want to try converting your PHP-based internet app to intranet authentication, check out this tutorial on using LDAP from PHP with an Exchange server. If you’re interested in the details of using LDAP with PHP in general, things like how to install the LDAP module if it isn’t there by default on your PHP installation, check out this two-part guide.

Get a real-time view of the Apache error log

Tail
Photo by CR

I spend a lot of time developing on a remote server through an SSH connection, and I’ve found it tough to keep an eye on the error log file. Typically I’ve been running the tail unix command to look at the last 10 lines, but this only gives you a snapshot of the errors at that instant. If I wanted to see more, I had to run tail again. I knew there had to be a better way, something I was missing since I couldn’t imagine unix wizards putting up with this. Luckily I was right. If you pass the -f or –follow option to tail it will continuously update, so you can see errors in real-time as the lines are written to the log file. This is perfect, I can see at a glance what’s going on.

To do the same, just open an SSH session to your remote server, and then type in the following command:

tail -f /var/log/httpd/error_log

The log file location varies on different flavors of Linux, and if you have access problems, make sure the logged-in user has high enough permissions to see it.

How to easily search and replace with sed

Textjumble
Photo by wai:ti

If you’ve used any flavor of unix for programming, you’re probably familiar with grep, the tool for locating patterns in text files. That’s great if you just want to search for a string, but what if you want to replace it?

Sed, the stream editor, is the answer, but that also brings up a new question: how on earth do I use it? It’s probably one of the most obscure interfaces ever invented, its syntax makes obfuscated perl look like a model of clarity. Usually with a new tool I start off looking at a series of examples, like these sed one-liners, to get a rough mental model of how it works, and then dive into the documentation on specific points. That didn’t work with sed, I was still baffled even after checking those out. The man page didn’t help, I could read the words but they didn’t make any sense.

Finally I came across my salvation, Bruce Barnett’s introduction and tutorial for sed. He hooked me with his first section on The Awful Truth about sed, with its reassurance that it’s not my fault that I’m struggling to make head or tail of anything. He then goes through all the capabilities of sed in the order he learnt them. It’s a massive list, but even if you only get a few pages in you’ll know how to do a simple search and replace. Sed is a very powerful tool, it’s worth persevering with the rest so you can discover some of the advanced options, like replacing only between certain tags in a file (eg how to change the content text of a particular XML tag) and working from line numbers. Bruce is an entertaining companion for your journey too, he has great fun with some asides demonstrating how to make the syntax even harder to read, just for kicks.

Why massive datasets beat clever algorithms

Library
Photo by H Wren

Jeremy Liew recently posted some hints, tips and cheats to better datamining. The main thrust, based on Anand Rajaraman’s class at Stanford, is that finding more data is a surer way to improve your results than tweaking the algorithm. This matches both my own experience trying to do datamining, and what I’ve seen with other company’s technologies. Engineers have a bias towards making algorithms more complex, because that’s the direction that gets the most respect from your peers and offers the most intellectual challenge.

That makes us blind to the advantages of what the Jargon File calls wall-followers, after the Harvey Wallbanger robot that simply kept one hand on a wall at all times to complete a maze, and gave far better results than the sophisticated competitors using complex route-finding. Google’s PageRank is a great example, almost zen in its simplicity, with no semantic knowledge at all.

One hidden advantage to this simple-mindedness is very predictable behavior, since the simplicity means there’s a lot fewer variables that affect the outcome. This is why machine-learning is so scary a change for Google, there’s no guarantee that some untested combination of inputs won’t result in very broken results.

Another obvious benefit is speed of both development and processing. This lets you get up and running very fast, and get through a lot of data. This gives you a lot more coverage. Yahoo’s directory wasn’t beaten because Google ranked pages more accurately than humans, but because Yahoo could only cover a tiny fraction of what was out there.

On the face of it, this doesn’t sound good for a lot of the alternative search engines that are competing out there. If it’s hard to beat a simple ranking algorithm, should they all pack up and go home? No, I think there’s an immense amount that can be improved both on the presentation side and by gathering novel data sets. For example why can’t I pull a personal page rank based on my friends and friends-of-friends preferences for sites? What about looking at my and their clickstreams to get an idea of those preferences?

How to profile your PHP code with Xdebug

Tapemeasures
Photo by Jek in the Box

I was adding some functionality to my mail system, and noticed it seemed to be running more slowly. I didn’t have any way to be sure though, and I didn’t know how to get performance information on my PHP code. After a bit of research, I discovered the xdebug PHP extension can gather profile data, and you can then use kcachegrind to view it.

On Fedora linux, you can install xdebug by running the following line in superuser mode:
yum update php-pecl-xdebug

To enable capturing of performance data, you’ll then need to edit your php.ini file, adding the following lines:

; xdebug settings
xdebug.profiler_enable = 1
xdebug.profiler_append = 1
xdebug.profiler_output_name = cachegrind.out.%s

This will output raw profile data to files named cachegrind.out.<your php script name> in /tmp. There’s a some options you may want to tweak, for example not appending repeated calls, or naming them after something other than the script name.

Once you’ve made those changes, restart apache with
/sbin/service httpd restart

Now navigate to the pages you want to profile, and the data should appear as files in /tmp. Once you’ve done the operations you’re interested in, possibly repeatedly to generate a larger sample, edit the php.ini xdebug.profile_enable to 0 and restart apache again.

I’d now got a nice collection of data, but that wasn’t much use without a way to understand what it meant. Kcachegrind is the most popular tool for viewing the output files, but it doesn’t have a native OS X version. I tried the darwin ports approach, but as always at least one of the dozens of dependencies failed to compile automatically, so I resorted to my Fedora Linux installation running in Parallels. If you’re on Windows, WinCacheGrind is a native version that’s had some good reviews. I couldn’t find a separate Linux binary, but it’s part of the kdevelop suite, so I was able to install that easily through the package manager.

Once that’s installed, copy over the cachegrind data files from your server, and then open them up in the application. You should see a list of function calls, and  if you’re used to desktop profiling, a lot of the options for drilling down through the data should seem familiar. The kcachegrind team has some tips if you’re looking for a good way to get started.

For my case, I found that most of the time was spent inside IMAP, which is actually good news since it means I’m running close to the maximum download speed and my parsing code isn’t getting in the way too much.

An easy way to keep a running total in mysql

Numbers
Photo by Walsh

I needed to keep track of how many emails a user had sent on a given day in my database, and it was surprisingly tough to implement. As I came across a message, I needed to add one to the total for that user on that day. If a record for that time and address already exists, it’s simple:

UPDATE messagefrequencies SET senttocount=senttocount+1
WHERE address=currentaddress AND day=currentday;

When I implemented that, I realized that it didn’t do quite what I wanted. If there wasn’t already a record for that address and day, the update would fail. There were a potentially massive number of combinations of days and addresses, and I didn’t want to create blank rows for all of them, so I needed some way to increment the current value if it already exists, or create it and set it to 1 if there isn’t a row that matches.

My first attempt was to use the IF EXISTS syntax, but I discovered that’s only valid within stored procedures. The real solution turned out to be the opposite of the way I was thinking about the problem, since there’s an ON DUPLICATE KEY command that lets you attempt a row INSERT and then if the row already exists you can do an update. One thing to watch out for is that this update syntax doesn’t require a SET, instead you just specify the columns you want to change.

INSERT INTO messagefrequencies (day, address, senttocount)
VALUES (TO_DAYS(‘2006-05-29 13:59:10’), ‘somebody@gmail.com’, ‘1’)
ON DUPLICATE KEY UPDATE senttocount = senttocount+1;

How to time mysql queries in PHP

Clockeye
Photo by BadBoy69

When you run a mysql query in the console, you get a line of information telling you how long it took to run. I was hoping to pull the same information in PHP to help me profile my database usage, but unfortunately there isn’t any way to access that directly through the API.

What you can do instead is time the mysql_query() call itself, by recording a time stamp before and after, and subtracting to get the total. This isn’t ideal since it will include a small amount of overhead for things like the socket connection to the database, but it will be good enough for most purposes. This is the code I’m using, as seen in phpMyAdmin:

list($usec, $sec) = explode(' ',microtime());

$querytime_before = ((float)$usec + (float)$sec);


// your query goes here

   

list($usec, $sec) = explode(' ',microtime());

$querytime_after = ((float)$usec + (float)$sec);


$querytime = $querytime_after - $querytime_before;

$strQueryTime = 'Query took %01.4f sec';

echo sprintf($strQueryTime, $querytime);

How to speed up your website with Yslow

Snail
Photo by Ezu

One of the downsides of the increase in widgets and customization over the last few years is that they often result in a web page that takes seconds to load. Thanks to my desktop app heritage, I’m really sensitive to this, since poor responsiveness in an application destroys the user experience. The emotional response to waiting is frustration, and both gives users a subconscious motivation to avoid it and a chance to get distracted by something else and abandon your service.

That’s made me very wary of installing new widgets on this blog, since I sometimes see long loading times even now, and I’ve never quite been sure why. I wanted a new discussion service though, and Intense Debate looked very appealing, so I resolved to install it and also figure out how to profile my site.

Firebug is the best tool for getting under the hood and understanding what Firefox is up to when you load a page, but it’s more aimed at debugging script, CSS and markup problems rather than understanding performance issues. That’s where Yslow, a free plugin for Firebug from Yahoo, comes in.

It’s based on some principles of website optimization that Yahoo have worked out. It applies these rules programatically to your page and then gives you a report card giving details of problems in each are. My site received an F. There’s a whole lot of improvements I’ll be looking at implementing, but one that’s interesting is setting a long expiry time for external objects like scripts and images. This is inconvenient when you change a resource on the server, since you also need to change the name, but Yahoo estimate that 80% of fetches can be avoided in a typical scenario if you set an expiry header that allows the browser to cache the resource locally. I’ll be poking some of my widget providers to see if that’s possible.

I highly recommend giving Yslow a shot on your site, you’ll learn a lot about what your page loads actually involve, and probably get some ideas on improving performance.

When should you use sessions in PHP?

List
Photo by BrittneyBush

For anyone used to traditional desktop programming switching to the web, one of the hardest things to wrap your head around is the lack of state. There’s no inherent way of keeping information around when you’re interacting with a user. Each page request starts with a blank slate, you don’t have in-memory variables that can keep track of useful information.

If you’re working in PHP, this is where sessions look like a great solution. They’re a general-purpose mechanism built around cookies, and let you store arbitrary variables that are remembered across all page requests from a particular user. Under the hood, they set a single sessionid cookie on the user’s machine, that’s sent along with any subsequent page requests. That id is used to load a file from the server’s disk containing a list of variable names and values that are stored for that user. Any changes or additions the server makes to the data are saved into that same file.

From the programmers point of view, you call session_start() and then have access to a global associative array, $_SESSION[]. You set and read entries in this array, and they remain persistent for page requests for a given user as long as they keep sending the cookie. This all looks like a very natural model for storing state, one that traditional app programmers would feel very comfortable with. You could do something similar by setting cookies directly, but then you’re exposing a lot of information to the user, and opens the door to malicious tinkering with your internal server variables.

As you might have guessed, there’s no such thing as a free lunch, and sessions have some significant drawbacks. The data is stored in a file on the server’s disk, which means that you’re tied to a single server and can’t load balance without duplicating that file and any changes across all machines. The file is locked so it can only be accessed by one request at a time, which means that simultaneous requests get serialized, which is a serious problem if you have a long-running calculation in one of them. The locking also results in deadlocks if you’re making sub-requests within the main page request to get parts of the page, and passing the session id cookie manually. In general the behind-the-scenes nature of sessions make it tough to tell who’s connected and debug state problems.

Some of these issues are fixed if you write your own handler to back up the sessions to a database, rather than to file. You still end up locking though, and the database access makes the operation much more expensive. It also requires some planning ahead to know exactly what state you want to store, which abandons a lot of the flexibility that makes sessions so useful.

I ended up with my own API for storing and reading information about each session in a database, using a special cookie ID as a key, generated once a user logs in and is authenticated. I also have a convention where the ID is passed through POST or GET parameters to make sub-requests very easy. It isn’t that different from storing sessions in a database, but it does avoid the locking problem, and makes the database cost explicit on the programming side. The fact that it’s associated with a particular user, and can only be created by logging in, makes it harder to spoof too, and lets you limit the number of connections for a single user.