One of the components I need for automatic tagging is an easy way to get the average frequencies of a set of words, so I can spot the unusual ones. The normal way of doing this would be to pull the data from a mysql database with a query, but investigating this route revealed a lot of performance issues . With the volume and complexity of the analysis I’m doing, a custom-coded database seems the way to go.
A simple way to do that would be a command-line program that handled everything. Unfortunately the raw word-frequency data files can be very large, and reloading them for every query kills performance too.
What I need is something that behaves like the mysql server, parsing the data-files once at startup and then listening for requests. I did consider putting this into an Apache mod, but my needs are simple enough that writing a standalone socket server seemed preferable. It also avoids the overhead of routing everything through HTTP.
Since this seems like a fairly common dilemma, and there’s not much information on how to solve it, I’ll show you how to write a simple generic server over the next few articles. I’m using C, not because I’m stuck in the 80’s, but because the high-performance data structures I’m using are in C++.
As a start, here’s a tar file containing the source for a very simple socket-based server, which is based on this great socket page from the Rensselaer Polytechnic Institute. To use it you’ll need a unix-y OS (it’s been tested on Red Hat Fedora and OS X). Un-tar the folders and open two terminals. In one, go into wcserver and run make. Do the same in the second terminal for wcclient. That should compile the two executables you need. Then run ./wcserver 9998 from the server terminal and ./wcclient localhost 9998 from the client.
The server sits listening on the port you specify (9998 in this case) until a client tries to connect. The client asks you to type in a message and passes it to the server. The server prints it out on its own terminal, sends an acknowledgement to the client and exits.
In the next post I’ll go over the source code in depth, since there’s a lot of subtleties in the details. After that I’ll show you how to handle multiple connections and do something useful for the clients.