How to fetch URLs in parallel using Python

Photo by FunkyMonkey

Here's my class to run web requests in parallel in Python

One of the most common patterns in my work is fetching large numbers of web resources, whether it's crawling sites directly or through REST APIs. Often the majority of the execution time is spent waiting for the HTTP request to make the round trip across the internet and return with the results. The obvious way to speed things up is to change from a synchronous 'call-wait-process' model where each call has to wait for the previous one to finish, to an asynchronous one where multiple calls can be in flight at once.

Unfortunately that's hard to do in most scripting languages, despite being a common idiom in Javascript thanks to Ajax. Threads are too heavy-weight in both resources and in programming complexity since we don't actually need any user-side code to run in parallel, just the wait on the network. In most languages the raw functions you need to build this are available through libcurl, but its multi_curl interface is nightmarishly obscure.

In PHP I ended up writing my own ParallelCurl class that provides a simple interface on top of that multi_curl complexity, letting you specify how many fetches to run in parallel and then just feed it URLs and callback functions. Recently though I've been moving away to using Python for longer-lived offline processing jobs, and since I couldn't find an equivalent to ParallelCurl I ported my PHP code over.

This is the result. You'll need to easy_install pycurl to use it and I'm a Python newbie so I'm sure there's ugliness in the code, but I'm really excited that one of the big barriers to migrating more of my code is now gone. Let me know how you get on with it.

One response

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: