C Hashmap

Knapping
Photo by crazybarefootpoet

I still remember my excitement when I discovered Google after years of struggling with awful search engines like Altavista, but every now and again it really doesn't find what I'm looking for.

I was just going to bed on Tuesday night when I remembered I had to start a job processing 500 GB of data, or it would never be done in time for a deadline. This process (merging adjacent lines of data into a single record) was painfully slow in the scripting languages I tried, so I'd written a tool in plain C to handle it. Unfortunately I'd never tried it on this size of data, and I quickly discovered an O(n^2) performance bug that ground progress to a halt. To fix it, I needed a hashmap of strings to speed up a lookup, so I googled 'c hashmap' to grab an implementation. I was surprised at the sparseness of the results, the top hit appeared to be a learning project by Eliot Back.

Before I go any further, if you need a good plain C hashmap that's been battle-tested and generally rocks, use libjudy. Don't do what I did, trying to build your own is a silly use of anyone's time! My only excuse is that I thought it would be quicker to grab something simpler than libjudy, and I'd had a martini…

I stayed up until 2am trying to get the hash map integrated, discovering some nasty performance bugs in the implementation as I did. For instance, the original code actually tried to completely fill the hash map before it reallocated, which means for a large map it often searches linearly through most of the entries if the key isn't present, since it only stopped when it found a gap. I also removed the thread primitives, and converted it over to use strings as keys, with a CRC32 hashing function.

I don't make any claims for the strength of the resulting code, but at least this version has a unit test and I've used it in anger. Thanks to Eliot for the original, here's my updated source:

http://github.com/petewarden/c_hashmap

Hopefully this will help out any other late-night coders like me searching for 'C hashmap'!

2 responses

  1. Since this is still a top result for a c hashmap search, I just thought I say that it is just not good.

    It kind of works, but disregarding the concept of buckets in favor of probing for the next index kind of sucks. It creates major inefficiencies all over the place. There’s 2 very notable issues though:

    1) the probing is extremely inefficient and results in map full when it shouldn’t.

    2) it is conceivable that a map with lots of elements could get stuck in a resize loop till it ran out of memory as a result of this implementation bucket concept eating up indexes.

    Don’t use it. Rather, don’t use the probing method.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: