Try a secret new search engine

January 21, 2008 By Pete Warden in GoogleHotKeys, PeteSearch, Search Tips, SearchMash 1 Comment

Well, I’m not sure about secret, but it sure is mysterious. http://alpha.managedq.com/ showed up in my visitor logs, and visiting the site it looks like rather a nifty visual search interface. It’s got thumbnails of the top results, and automatically generated keywords sorted by type down the left side:

The interesting part is that most of the site is returning 404 or authorization errors, which makes me wonder if they might still be in stealth mode? Unfortunately email messages to their public inquiries@managedq.com address bounce, they’ve got a private domain registration so I can get any contact details from that, and Google searches don’t get me any more information, so I can’t check with them before mentioning it here.

They’re using snap for the thumbnails, and I’m not sure how they’re pulling out the tags. The keywords definitely look automatically generated, rather than user driven. I’d love to know more about their work, so if anyone has more details or a way to contact them, email me or add a comment.

Feedburner subscription figures are still broken by Feedblitz

January 20, 2008 By Pete Warden in Uncategorized Leave a comment

Feedblitzlogo
Despite their hopes they’d fixed the problem on Thursday night, Feedblitz is still adding random numbers to my Feedburner subscriber total, with 72 extra for Saturday and a whopping 257 on Friday! Now, if they could just make things right by adding that many actual subscribers every day…

How to write a socket server V

January 19, 2008 By Pete Warden in Coding, Simple server Leave a comment

File

As I mentioned in my previous post on security, it’s much better if you can confine your connections to clients on the current machine and completely block external attackers. Since my service is going to fit in the same slot as MySQL in the LAMP ecology, confining it to the local machine and having an external interface provided through PHP is reasonable for my purposes. A bonus is that I should get better performance thanks to skipping several layers of network code.

To restrict connections, you need to use a different family of sockets based on the file system rather than TCP/IP network connections. Instead of a port number, the client and server collaborate by agreeing on a file name they’ll use to contact each other. There are potential security issues involved in choosing the location of the file, but if you trust all processes on the local machine then using something like /tmp/myservicesocket should be fine.

The code itself looks pretty similar to the internet socket example, with the main change being that we use a sockaddr_un structure to specify the file name of the socket rather than a sockaddr_in to define the IP address and port number. Here’s the new source, and I’ll describe the server changes below.

	struct sockaddr_un listenAddress;
	if (listenFileNameLength>=sizeof(listenAddress.sun_path))
		error("ERROR, the filename must be shorter 
than the maximum path length (normally around a hundred chararacters)\n");

	const int listenSocketFile = socket(AF_UNIX,
		SOCK_STREAM, 0);
	if (listenSocketFile < 0) 
		error("ERROR opening server socket");

	bzero((char *) &listenAddress, sizeof(listenAddress));
	listenAddress.sun_family = AF_UNIX;
	strncpy(listenAddress.sun_path, listenFileName, 
		sizeof(listenAddress.sun_path));

	const int connectResult = connect(transferSocketFile,
		(struct sockaddr*)&serverAddress,
		sizeof(serverAddress));

We’re taking a single filename as an argument to both the client and server now, so you’ll need to run them as "./wcserver /tmp/yourfilename" and "./wcclient /tmp/yourfilename" . The structure the filename is stored is an odd one, it’s basically a char to specify the family, sometimes preceded by another byte for a string length on some systems, immediately followed in memory by a stream of bytes representing the string for the file name, eg:

|sun_family byte|p|a|t|h|b|y|t|e|s|…

There’s a lot of confusion about this structure, whether it’s NULL terminated, if you can dynamically allocate a large than provided path string, and lots of other behavior that seems to vary between the *nixes.

You specify it’s total size as arguments to bind() or connect(). So, in theory you can duplicate this layout with a string of any length. In practice the default sockaddr_un structure defines a fixed-length array of chars after the family, and in most implementations this is at least 100 (104 on OS X, 108 on Red Hat). In practice, I experimented with dynamic sizing and found myself in a scary wood full of underdocumented questions, so I decided to go with using the default structure layout. This limits me to file names less than 100 characters, but it also means I’m using the same code as 99% of the other local socket programs out there. This should make it a lot more portable and less prone to strange OS bugs.

All the articles in the simple server series

How to write a socket server IV

January 18, 2008 By Pete Warden in Coding, Simple server Leave a comment

Lock
If you’re going to be writing a socket server, you need to be thinking about security right from the start. You’re opening a new door for hackers to attack your host computers, and you’re responsible for making sure that you’re not making your users vulnerable.

The first option for securing your service is deciding whether you need a TCP/IP based socket open to external machines, or whether an AF_UNIX style socket that is only accessible by other programs on the local computer is good enough. You not only prevent external hackers from connecting, it’s also a lot faster to skip the network code that AF_INET sockets involve. I’ll provide a modified example that demonstrates this in a later post.

If you do need an internet socket, then your top priority must be to avoid writing code that external attackers can hack using buffer overflows. When you’re writing code that accepts user inputs, never use plain C functions like sprintf() or gets() that don’t allow you to specify a length for the buffer you’re writing into. If the user has set things up so they send input that overflows your buffer, they could write to the stack and start executing arbitrary code. The existing example just uses read() with a fixed length, so there’s no chance of a client exploit, but as soon as you have to start accepting more arbitrary inputs it’s something you need to think about. This is one great reason to use an established library like Java which doesn’t suffer from the same sort of vulnerabilities as C. If you are using C, look at newer functions that take a buffer length argument like snprintf.

Another problem that’s tough to find a solution for, but is hopefully not a problem as long as your host isn’t compromised, is local port hijacking. This is where another process on the same machine tries to grab your socket. In certain cases they can get priority over your service, and fool the outside world into connecting to them instead. You can try to avoid it by binding your service to a specific IP address of the host instead of INADDR_ANY, but it can be hard to do this for all the IPs a server may have.

Firewalls are another issue. It’s tempting to view them as an obstruction, and try to work around them by piggy-backing on a well-known port like 80, but from a security point of view you’re much better off if you can work in concert with them. It will take a little bit more effort sometimes to persuade the firewall owner to authorize your service, but there are some rules that make life easier.

Make your port number configurable. This is good in general, there may be another service running on your default port, and it allows you to fit into any firewall policies about which port ranges are open.
Only use a single port. Fewer ports mean fewer rules in the firewall, which means less maintenance, lighter processing load and better performance. You can use your own light-weight protocol to distinguish different types of data you’re sending across a single connection, rather than using multiple connections to do the same job.
Don’t connect back to the client with a new socket. FTP does this, and it means that the firewalls on both ends need to be set up correctly, and opens up the possibility that an attacker could connect to a client listening for your incoming connection. It’s much better to leave the connection initiation to the client, and then just use that for two-way communication.

These are some of the socket-specific issues, but now you’re writing code that’s open to external attack, you need to ask yourself about the security implications of every design and implementation decision you make. For more information, here’s one of my favorite guides. It’s actually from Microsoft, but most of the content is applicable on any platform.

All the articles in the simple server series

Feedburner or Feedblitz subscriber count problems?

January 18, 2008 By Pete Warden in Personal Leave a comment

Feedblitzlogo
A couple of weeks ago I added a Feedblitz option (the box on the top left) for people who wanted to subscribe by email. I haven’t seen much takeup, but yesterday’s Feedburner subscriber stats show 69 new people using it! That seemed suspicious to say the least, but it looks like I‘m not the only person seeing the same issue. I’m jealous though, he got 16,000 imaginary subscribers whereas the glitch still left my total in triple digits.

Are your emails too robotic?

January 18, 2008 By Pete Warden in Addon Promotion 1 Comment

Postbox

I love the random email conversations my blogging spawns, which have involved everything from why hedgehogs hibernate to arcane technical details of COM dll registration. I recently received an email from a young entrepreneur asking me to check out their site. Then, a few days later I had a another email saying basically "I noticed you read the last email we sent you, why didn’t you reply?". Checking the source of the original mail, he’d embedded an image bug to work like a read receipt.

I had to admire his enthusiasm and drive, but was definitely creeped out by the way he was tracking me. I wish I understood why my marketing sometimes fails, so to try and turn this into something positive, I thought I’d discuss what didn’t work for me with his original message:

——

Subject: Blog Question

I came across your blog and thought you might be interested in trying out my product X. We help you solve common problems with creating your blog from placing in text links, placing pictures and videos, and more.

Check us out at http://X.com , use the invite code: X

If you have problems understanding what we do on the site please let me know, we are a young company with a great product. We have a new site coming out in the next few days that will make everything more clear.
Thanks for your time, I look forward to hearing what you have to say.
Thanks,

Robert X

X@X.com
Forward email
This email was sent to searchbrowser@gmail.com, by X@X.com
Update Profile/Email Address | Instant removal with SafeUnsubscribe™ | Privacy Policy.

Email Marketing by
X | 142 South X | San Francisco | CA | 94103

——-

I came across your blog and thought you might be interested in
trying out my product X.

For a start, there’s no "Hi Pete", or any other personalization that demonstrates he’s an actual reader of my blog. Mentioning a recent article is a quick way to show you’ve spent a little time learning about your recipient.

We help you solve common problems with
creating your blog from placing in text links, placing pictures and
videos, and more.

This sentence is pretty vague and unclear. You have a few seconds of someone’s attention, you need to make a clear and concrete promise about the benefit to them of investing more time on your proposal.

If you have problems understanding what we do on the site please let me know, we are a young company with a great product.

I actually quite like this part, it’s projecting a good attitude and wanting to start a conversation.

We have a new site coming out in the next few days that will make everything more clear.

This, not so much. That made me think, "well, maybe I’ll just wait a few days and look then, rather than spending time now on something unclear".

Thanks for your time, I look forward to hearing what you have to say.
Thanks,

The double thanks adds to the assembled-by-robots feel.

This is a screenshot of the actual footer to the message. The mentions of email marketing and unsubscribe options made it seem even more like this was a mass-mailing. The tracking bug is hidden in there too, though I didn’t know it until I checked.

I will now be replying to Robert, and I hope he takes this in the spirit it’s intended. I spend plenty of time trying to get people interested in my projects, so I know it’s not easy. One thing I have learnt is that there’s no short-cut to getting someone’s attention. You have to put in the time to understand what your recipient wants, and highlight what you’re offering that fits in with that. A form letter like this is unlikely to get the response you want.

How to write a socket server III

January 15, 2008 By Pete Warden in Coding, Simple server Leave a comment

Dishes

So far, we’ve created a server program that will accept a single client connection and then exit. The next step is to handle more than one client. Have a look at the final code that does this, and then I’ll explain how it works.

The most obvious way to handle multiple requests is to loop on accept() inside main. You could pull a client socket for every new connection and then having the client conversation inside the loop. The flaw with this plan is that transferring data back and forth between the server and the client might take a comparatively long time, and there might be other clients trying to connect who will be blocked until the first client is completely done.

What we need is a way to split up the work so we can have multiple conversations in progress at the same time. This is efficient because networks are a lot slower at transferring data than processors are, so in the gaps where the server is waiting for a client response there’s lots of CPU time to handle other connections, especially on multi-core machines.

There’s three main techniques for spliting up the work. Probably the most complex but also the most flexible is using non-blocking sockets and select() to run a single loop that pulls chunks of data from multiple connections. This pattern does a small amount of work for each chunk before looping around and dealing with another conversation in the next iteration. thttpd is an excellent example of this approach. The complexity comes because it takes careful design to make sure that the work you’re doing every time you deal with a chunk of data always runs quickly enough for other conversations to be dealt with responsively. It’s also hard to make this event-based model run well on more than one processor.

The other two approaches are a lot simpler to write. You can create a new process to deal with every client conversation, either using fork() or even explicitly calling a command line like inetd does. This has the advantage of being incredibly simple to write, but on some OS’s (though not Linux) creating a new process can be time-consuming and inefficient. For my purposes I also need to share resources between the conversations. since the goal of my server is to allow fast access to preloaded word frequency data. There are mechanisms to communicate between processes that would allow this, but the simplest way to do this is using threads.

Threads are similar to lightweight processes, but they share full access to all the same memory as the parent. This is both a blessing and a curse, thanks to all the variations that timing and resource-locking introduces, threading bugs can be incredibly time-consuming to track down and there can often be subtle bugs in even simple threading code.

The basic idea for threading the server is that we’ll create a new worker thread for every client connection. This thread will carry on the conversation, but it will actually be paused and control passed to a different thread every so often, so that both the main listener and other worker threads get a chance to do the work they need to. The beauty of the thread model is that the actual data transfer code looks purely procedural just like the original single connection version, there’s no alteration to the internal logic needed to deal with the multi-tasking.

The most common threading library is the Posix standard, though Windows has its own libraries for this. I’ve implemented the new server code using pthread, adding –lpthread to the compile flags to ensure the library is linked in. There’s a new function to deal with each client:

void* handleClientRequestThreadFunction(void* threadArgumentPointer)
{
	threadArgumentStruct* threadArgument = (threadArgumentStruct*)(threadArgumentPointer);
	
	const int transferSocketFile = threadArgument->_transferSocketFile;

	free(threadArgument);
...

The rest of the function is exactly the same data transfer code that was in the old server’s main loop. The main subtlety here is that we need to pass in the file descriptor for the connection, but thread functions always take a void pointer as input, not an integer as we need. Instead, we create a structure that holds all the data we want to pass in, since in the future we’ll need to pass more arguments. This structure can’t live on the stack as a local variable in the calling function, since the main thread may have moved out of that function by the time we get here. Instead, we create an area of memory in the heap using malloc to hold the data, and pass a pointer to this into the function, relying on the client thread to free it.

You can get away with simpler techniques for basic data types, such as casting an int directly to a void pointer and back, but these will throw up warnings when the size of those types doesn’t match on a 64 bit machine, so this is generally a cleaner approach at the cost of some extra heap allocations.

The main function now contains a loop that spawns a thread for every client connection, and then goes back to listening:

	do
	{
		int transferSocketFile = accept(listenSocketFile, 
			(struct sockaddr *) &transferAddress, 
			&sizeofTransferAddress);

		if (transferSocketFile<0)
			break;

		// It's the thread's responsibility to free this memory
		threadArgumentStruct* threadArgument = malloc(sizeof(threadArgumentStruct));
		threadArgument->_transferSocketFile = transferSocketFile;
		void* threadArgumentPointer = (void*)(threadArgument);

		pthread_t clientRequestThread;
		pthread_create(&clientRequestThread, 
			NULL, 
			handleClientRequestThreadFunction, 
			threadArgumentPointer);

	} while (1);

This loop does very little work other than setting up the data structure for the function arguments and starting off the thread. pthread_create() is the key function here, it’s kicking off the execution of the client request handling function. One key thing to understand is that once that’s called, you really don’t know when the function is getting executed, or when it’s done. It may even be happening at the same time as the main loop if you’re on a multi-processor machine.

This means we need to be very careful about making sure that different threads don’t step on each others toes by altering data that another thread might also be working with. I’ll cover how to handle that sort of synchronization next.

All the articles in the simple server series

How to write a socket server II

January 14, 2008 By Pete Warden in Coding, Simple server Leave a comment

Plug

In the previous article I presented a simple socket-based server and client. The beauty of sockets is that they’re comparatively simple, especially for the client. A program wanting to connect to a server does the equivalent of picking up a phone and dialing a number, and then transfers data back and forth across that open connection.

The server’s job is a little harder, since they have to wait for a call from a client, and potentially take connections from many clients at once. This example doesn’t demonstrate handling multiple connections, but it does show how to take a call. I’ll go over the interesting points inline in the server source code, but the part I always found hardest to wrap my head around is that the server actually has two different types of sockets. The first, which I’ve called the listener below, is the one that gets notified when a client tries to contact the server. This is the one which has the port number assigned to it. No data is transferred over this socket though. Instead, you have to pull another socket from it for every client connection. This, which I call the transfer socket, is the one that the conversation with the client actually happens on, and is the one which handles passing data back and forth. You get a new one of these for every client connection on the server side, though to the client it just appears that they’re talking to the single socket they opened.

The actual file transfer is handled using file-like calls to read and write. There are plenty of more complex ways to transfer data over sockets, but for most purposes this sort of TCP/IP/Streaming style of connection is both simplest and sufficient.

If you want the raw source code, you can download the example here.

/* Pete Warden - A simple example demonstrating a socket-based server */
/* Adapted from code at http://www.cs.rpi.edu/courses/sysprog/sockets/sock.html */
/* See http://petewarden.typepad.com/ for more details */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h> 
#include <sys/socket.h>
#include <netinet/in.h>

void error(char *msg)
{
	fprintf(stderr,"%s\n",msg);
	exit(1);
}

int main(int argc, char *argv[])
{
	if (argc < 2)
		error("ERROR, no port provided\n");

This port number allows multiple server programs to listen for different clients on the same machine. Each service has to pick a number that’s like an internal phone extension in a company. Unfortunately there’s no good way to figure out if another service has already picked that extension other than trying to use it. Most well-known services like http and smtp have numbers below 2000, so the usual advice is to pick a random-looking number between that and 9999, and provide a user-configuration option to change that if the administrator of the machine you’re using discovers a conflict. The IANA maintains a list of registered ports here, but there’s no guarantee it’s complete.

	const int listenPortNumber = atoi(argv[1]);

	const int listenSocketFile = socket(AF_INET, SOCK_STREAM, 0);

	if (listenSocketFile < 0) 
		error("ERROR opening server socket");

	struct sockaddr_in listenAddress;
	bzero((char *) &listenAddress, sizeof(listenAddress));

	listenAddress.sin_family = AF_INET;
	listenAddress.sin_addr.s_addr = INADDR_ANY;
	listenAddress.sin_port = htons(listenPortNumber);

	const int bindResult = bind(listenSocketFile, 
		(struct sockaddr *) &listenAddress,
		sizeof(listenAddress));

	if (bindResult<0) 
		error("ERROR on binding");

	listen(listenSocketFile,5);

The code above sets up the socket that’s going to sit on the port and listen for any incoming client connections. The sin_family = AF_INET tells the system we want a network socket, rather than one that’s only available to processes on the same file system. We don’t specify an address since it will be listening on the current machine rather than trying to connect remotely, and we pass in the user-specified port number. The bind() passes the request to the system and tries to set up the socket, which may fail if the port is already in use. Finally the listen() activates the socket so it can start accepting requests for client connections. The second argument is the number of connections to queue up for the server to deal with before new connections are rejected. On most systems the maximum is 5, so this is the usual setting.
A consequence of this is that you have to deal with each client very quickly or you’ll start rejecting connections if you have a lot trying to connect at once. I’ll be covering strategies to handle this later.

	struct sockaddr_in transferAddress;
	socklen_t sizeofTransferAddress = sizeof(transferAddress);

	const int transferSocketFile = accept(listenSocketFile, 
		(struct sockaddr *) &transferAddress, 
		&sizeofTransferAddress);

This call to accept() will not return until a client has tried to connect to the server. It’s possible to create non-blocking sockets that let accept return immediately with an error if there’s no clients waiting, but blocking is often a lot simpler.
The call returns a new socket file descriptor, which I’ve called the transfer socket. This is the actual connection to the client. You can think of the listener socket like an old-style switchboard operator, whose job it is to set up a direct call between the client and server, returning a socket that’s a private connection between the two.

The server can then treat this transfer socket as a file, though there are more advanced options for accessing the data that I won’t cover.

	if (transferSocketFile<0) 
		error("ERROR on accept");

	const int bufferLength = 256;
	char buffer[bufferLength];

	const int bytesRead = read(transferSocketFile,buffer,(bufferLength-1));
	if (bytesRead<0) 
		error("ERROR reading from socket");

	// Pete- On some systems the buffer past the read bytes may be altered, so make sure the string is zero terminated
	buffer[bytesRead] = '';

	printf("Here is the message: %s\n",buffer);

Getting data from the client is simply a matter of doing a read() call like you would do with a local file. I’m not demonstrating how to handle anything beyond a short input message here, but I will be covering that in a future post. One wrinkle I discovered that some examples don’t handle is that Linux (though not OS X) doesn’t define what will be in the buffer past the bytes that were read, even if you cleared the whole buffer to zero before the call, so I had to manually add a terminator.

	const char* outputMessage = "I got your message";
	const int outputMessageLength = strlen(outputMessage);
	const int bytesWritten = write(transferSocketFile, outputMessage, outputMessageLength);

	if (bytesWritten<0) 
		error("ERROR writing to socket");

Just like reading, you call write() to send data back to the client.

	close(transferSocketFile);

        close(listenSocketFile);

Make sure you close both socket files before you exit. Some OS’s will close them automatically, but Red Hat Linux at least will leave them open as zombies even after the process has exited.

	return 0; 
}

All the articles in the simple server series

How to write a socket server I

January 13, 2008 By Pete Warden in Coding, Simple server Leave a comment

Socket

One of the components I need for automatic tagging is an easy way to get the average frequencies of a set of words, so I can spot the unusual ones. The normal way of doing this would be to pull the data from a mysql database with a query, but investigating this route revealed a lot of performance issues . With the volume and complexity of the analysis I’m doing, a custom-coded database seems the way to go.

A simple way to do that would be a command-line program that handled everything. Unfortunately the raw word-frequency data files can be very large, and reloading them for every query kills performance too.

What I need is something that behaves like the mysql server, parsing the data-files once at startup and then listening for requests. I did consider putting this into an Apache mod, but my needs are simple enough that writing a standalone socket server seemed preferable. It also avoids the overhead of routing everything through HTTP.

Since this seems like a fairly common dilemma, and there’s not much information on how to solve it, I’ll show you how to write a simple generic server over the next few articles. I’m using C, not because I’m stuck in the 80’s, but because the high-performance data structures I’m using are in C++.

As a start, here’s a tar file containing the source for a very simple socket-based server, which is based on this great socket page from the Rensselaer Polytechnic Institute. To use it you’ll need a unix-y OS (it’s been tested on Red Hat Fedora and OS X). Un-tar the folders and open two terminals. In one, go into wcserver and run make. Do the same in the second terminal for wcclient. That should compile the two executables you need. Then run ./wcserver 9998 from the server terminal and ./wcclient localhost 9998 from the client.

The server sits listening on the port you specify (9998 in this case) until a client tries to connect. The client asks you to type in a message and passes it to the server. The server prints it out on its own terminal, sends an acknowledgement to the client and exits.

In the next post I’ll go over the source code in depth, since there’s a lot of subtleties in the details. After that I’ll show you how to handle multiple connections and do something useful for the clients.

All the articles in the simple server series

40 mile bike ride through the mountains

January 12, 2008 By Pete Warden in Personal Leave a comment

I needed to check out the Guadalasca trail before I lead a trail maintenance crew on it next Saturday (contact me if you’re in the LA area and want to play in the dirt). It’s a joint project with the CORBA biking group and their trail crew boss Hans Keifer invited me along on a group ride they were doing today. Hans organized it through the Over-the-Bars club, and those guys are mountain biking machines! I’m glad to say I made it through the countless hill-climbs and drops to the end, but it left me shattered. I like to think I’m fit, road-biking 4 days a week, but these are some tough hombres. Half of them were 20 years older than me too, which both gives me hope and makes me feel worse.

I’ll be spending the rest of tonight recovering with a martini and some Famous Dave’s…

	Ideal Dataset Size f… on How many images do you need to…
	How to set up Raspbe… on Why has the Internet of Things…
	Thomas on Launching Moonshine Micro
	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Monthly Archives: January 2008

Try a secret new search engine

Feedburner subscription figures are still broken by Feedblitz

How to write a socket server V

How to write a socket server IV

Feedburner or Feedblitz subscriber count problems?

Are your emails too robotic?

How to write a socket server III

How to write a socket server II

How to write a socket server I

40 mile bike ride through the mountains