Two ways you can easily find interesting phrases from an email

Maybe it was my weekly D&D game last night, but probability is on my mind. One thing I’ve learnt from working in games is that accuracy is overrated in AI. Most problems in that domain have no perfect solution. The trick is to find a technique that’s right often enough to be useful, and then make it part of a workflow that makes coping with the incorrect guesses painless for the user.

A lot of Amazon’s algorithms work like this. They recommend other books based on rough statistical measures which bring up mostly uninteresting items, but it’s right often enough to justify me spending a few seconds looking at what they found. The same goes for their statistically improbable phrases. They’re odd and random most of the time but usually one or two of them do give me an insight into the book’s contents.

This is interesting for email because when I’m searching through a lot of messages I need a quick way to understand something about what they contain without reading the whole text. One of the key features of Google’s search results is the summary they extract surrounding the keywords for each hit. This gives you a pretty good idea of what the page is actually discussing. In a similar way I want to present some key phrases from an email that very quickly give you a sense of what it’s about.

The main approach I’m using is vanilla SIPs, but there’s a couple of other interesting heuristics (sounds so much more technical than ‘ways of guessing’). The first is looking for capitalized phrases within sentences. These are usually proper nouns, so you’ll get a rough idea of what people or places are discussed in a document. The second is to find sentences that end with a question mark, so you can see what questions are asked in an email.

These are fun because they’re both reliant on easily-parsed quirks of the language, rather than deep semantic processing. This means they’re quick and easy to implement. It also means that they’re not very portable to other languages, German capitalizes all nouns for example, but one problem at a time!

How to use corporate data to identify experts


Nick over at the Disruptor Monkey blog talks about how their FindaYoda feature has proved a surprise hit. This is a way of seeing who else has a lot of material with a keyword you’re looking for, and its success backs up one of the hunches that’s driving my work. I know from my own experience of working in a large tech company that there’s an immense amount of wheel-reinventing going on just because it’s so hard to find the right person to talk to.

As a practical example I know of at least four different image comparison tools that were written by different teams for use with automated testing, with pretty much identical requirements. One of the biggest ways I helped productivity was simply by being curious about what other people were working on and making connections when I heard about overlap.

One of the tools I’d love to have is a way to map keywords to people. It’s one of the selling points of Krugle’s enterprise code search engine. Once you can easily search the whole company’s code you can see who else has worked with an API or algorithm. Trampoline systems aim to do something similar using a whole company’s email store, they describe it as letting you discover knowledge assets. I’m trying to do something similar with my automatic tag generation for email.

It’s not only useful for the people on the coal face, it’s also a benefit that seems to resonate with managers. The amount and cost of the redundant effort is often clearer to them than to the folks doing the work. Since the executives are the ones who make the purchasing decisions, that should help the sales process.

How to write a socket server VI


Once you’ve got a socket server running locally on a web-hosting machine, you need to expose it to the outside world. Luckily this is quite easy using PHP. Sockets are well-supported and already heavily used to communicate with MySQL.

Take the source from the previous article, and build the server part on the machine you’ll be hosting the service on. You’ll want this service to run even if you’re logged off, so for now use the command

nohup ./wcserver /tmp/myservicesocket &

This will keep the server process running even if you exit the current terminal window. On a production system you’ll actually want to start the server when the system reboots, but this approach is simpler for testing purposes.

The next hurdle to overcome is making sure that your Apache httpd process, which runs PHP, has the right permissions to access that socket file. This will depend on the user setup on your machine, but typically you’ll have a special apache user account that you’ll need to add to the file access list. For testing purposes you can always grant everyone on the machine access to the socket file, though it would be preferable for security reasons to be a bit more picky in production. Run this command to grant everyone permission to access it:

chmod a+rw /tmp/myservicesocket

Now the server is running and available, you need to write an equivalent to the command-line client in PHP. Here’s the source and I’ll go over the details below.

$fp = stream_socket_client("unix:///tmp/myservicesocket", 
$errno, $errstr, 30);

PHP takes care of a lot of the socket setup code for you. The only bit I found tricky was specifying a local file socket, it turns out you do that using the special ‘unix’ protocol specification in the URL, followed by the file path.

    fwrite($fp, "Some Message\n");
    while (!feof($fp)) {
        echo fgets($fp, 1024);

The connect call returns a file handle that you can use with the standard file access functions. As you can see, it looks pretty similar to the C version of the same code. You can see the results running on one of my servers here.

The example service isn’t doing very much so far. What’s really exciting about this approach is that it offers a completely language-independent way for any interesting computational service to be integrated into the standard LAMP stack. A lot of my work involves processor-heavy work on hard problems like laying out large graphs and statistical analysis. This should let me move them from the client to become online services.

Try a secret new search engine


Well, I’m not sure about secret, but it sure is mysterious. showed up in my visitor logs, and visiting the site it looks like rather a nifty visual search interface. It’s got thumbnails of the top results, and automatically generated keywords sorted by type down the left side:


The interesting part is that most of the site is returning 404 or authorization errors, which makes me wonder if they might still be in stealth mode? Unfortunately email messages to their public address bounce, they’ve got a private domain registration so I can get any contact details from that, and Google searches don’t get me any more information, so I can’t check with them before mentioning it here.

They’re using snap for the thumbnails, and I’m not sure how they’re pulling out the tags. The keywords definitely look automatically generated, rather than user driven. I’d love to know more about their work, so if anyone has more details or a way to contact them, email me or add a comment.

How to write a socket server V


As I mentioned in my previous post on security, it’s much better if you can confine your connections to clients on the current machine and completely block external attackers. Since my service is going to fit in the same slot as MySQL in the LAMP ecology, confining it to the local machine and having an external interface provided through PHP is reasonable for my purposes. A bonus is that I should get better performance thanks to skipping several layers of network code.

To restrict connections, you need to use a different family of sockets based on the file system rather than TCP/IP network connections. Instead of a port number, the client and server collaborate by agreeing on a file name they’ll use to contact each other. There are potential security issues involved in choosing the location of the file, but if you trust all processes on the local machine then using something like /tmp/myservicesocket should be fine.

The code itself looks pretty similar to the internet socket example, with the main change being that we use a sockaddr_un structure to specify the file name of the socket rather than a sockaddr_in to define the IP address and port number. Here’s the new source, and I’ll describe the server changes below.

	struct sockaddr_un listenAddress;
if (listenFileNameLength>=sizeof(listenAddress.sun_path))
error("ERROR, the filename must be shorter
than the maximum path length (normally around a hundred chararacters)\n");

const int listenSocketFile = socket(AF_UNIX,
if (listenSocketFile < 0)
error("ERROR opening server socket");

bzero((char *) &listenAddress, sizeof(listenAddress));
listenAddress.sun_family = AF_UNIX;
strncpy(listenAddress.sun_path, listenFileName,

const int connectResult = connect(transferSocketFile,
(struct sockaddr*)&serverAddress,

We’re taking a single filename as an argument to both the client and server now, so you’ll need to run them as "./wcserver /tmp/yourfilename" and "./wcclient /tmp/yourfilename" . The structure the filename is stored is an odd one, it’s basically a char to specify the family, sometimes preceded by another byte for a string length on some systems, immediately followed in memory by a stream of bytes representing the string for the file name, eg:

|sun_family byte|p|a|t|h|b|y|t|e|s|…

There’s a lot of confusion about this structure, whether it’s NULL terminated, if you can dynamically allocate a large than provided path string, and lots of other behavior that seems to vary between the *nixes.

You specify it’s total size as arguments to bind() or connect(). So, in theory you can duplicate this layout with a string of any length. In practice the default sockaddr_un structure defines a fixed-length array of chars after the family, and in most implementations this is at least 100 (104 on OS X, 108 on Red Hat). In practice, I experimented with dynamic sizing and found myself in a scary wood full of underdocumented questions, so I decided to go with using the default structure layout. This limits me to file names less than 100 characters, but it also means I’m using the same code as 99% of the other local socket programs out there. This should make it a lot more portable and less prone to strange OS bugs.

All the articles in the simple server series

How to write a socket server IV

If you’re going to be writing a socket server, you need to be thinking about security right from the start. You’re opening a new door for hackers to attack your host computers, and you’re responsible for making sure that you’re not making your users vulnerable.

The first option for securing your service is deciding whether you need a TCP/IP based socket open to external machines, or whether an AF_UNIX style socket that is only accessible by other programs on the local computer is good enough. You not only prevent external hackers from connecting, it’s also a lot faster to skip the network code that AF_INET sockets involve. I’ll provide a modified example that demonstrates this in a later post.

If you do need an internet socket, then your top priority must be to avoid writing code that external attackers can hack using buffer overflows. When you’re writing code that accepts user inputs, never use plain C functions like sprintf() or gets() that don’t allow you to specify a length for the buffer you’re writing into. If the user has set things up so they send input that overflows your buffer, they could write to the stack and start executing arbitrary code. The existing example just uses read() with a fixed length, so there’s no chance of a client exploit, but as soon as you have to start accepting more arbitrary inputs it’s something you need to think about. This is one great reason to use an established library like Java which doesn’t suffer from the same sort of vulnerabilities as C. If you are using C, look at newer functions that take a buffer length argument like snprintf.

Another problem that’s tough to find a solution for, but is hopefully not a problem as long as your host isn’t compromised, is local port hijacking. This is where another process on the same machine tries to grab your socket. In certain cases they can get priority over your service, and fool the outside world into connecting to them instead. You can try to avoid it by binding your service to a specific IP address of the host instead of INADDR_ANY, but it can be hard to do this for all the IPs a server may have.

Firewalls are another issue. It’s tempting to view them as an obstruction, and try to work around them by piggy-backing on a well-known port like 80, but from a security point of view you’re much better off if you can work in concert with them. It will take a little bit more effort sometimes to persuade the firewall owner to authorize your service, but there are some rules that make life easier.

  • Make your port number configurable. This is good in general, there may be another service running on your default port, and it allows you to fit into any firewall policies about which port ranges are open.
  • Only use a single port. Fewer ports mean fewer rules in the firewall, which means less maintenance, lighter processing load and better performance. You can use your own light-weight protocol to distinguish different types of data you’re sending across a single connection, rather than using multiple connections to do the same job.
  • Don’t connect back to the client with a new socket. FTP does this, and it means that the firewalls on both ends need to be set up correctly, and opens up the possibility that an attacker could connect to a client listening for your incoming connection. It’s much better to leave the connection initiation to the client, and then just use that for two-way communication.

These are some of the socket-specific issues, but now you’re writing code that’s open to external attack, you need to ask yourself about the security implications of every design and implementation decision you make. For more information, here’s one of my favorite guides. It’s actually from Microsoft, but most of the content is applicable on any platform.

All the articles in the simple server series