Photo by Joe Penniston
I've been helping a friend who has a startup which relies on processing large amounts of data. He's using Hadoop for the calculation portions of his pipeline, but has a home-brewed system of queues and servers for handling other parts like web crawling and calls to external API providers. I've been advising him to switch almost all of his pipeline to run as streaming jobs within Hadoop, but since there's not much out there on using it for those sort of problems, it's worth covering why I've found it makes sense and what you have to watch out for.
If you have a traditional "Run through a list of items and transform them" job, you can write that as a streaming job with a map step that calls the API or does another high-latency operation, and then use a pass-through reduce stage.
The key advantage is management. There's a rich ecosystem of tools like ZooKeeper, MRJob, ClusterChef and Cascading that let you define, run and debug complex Hadoop jobs. It's actually comparatively easy to build your own custom system to execute data-processing operations, but in reality you'll spend most of your engineering time maintaining and debugging your pipeline. Having tools available to make that side of it more efficient lets you build new features much faster, and spend much more time on the product and business logic instead of the plumbing. It will also help as you hire new engineers, as they may well be familiar with Hadoop already.
The stumbling block for many people when they think about running a web crawler or external API access as a MapReduce job is the picture of an army of servers hitting the external world far too frequently. In practice, you can mostly avoid this by using a single-machine cluster tuned to run a single job at a time, which serializes the access to the resource you're concerned about. If you need finer control, a pattern I've often seen is a gatekeeper server that all access to a particular API, etc has to go through. The MapReduce scripts then call that server instead of going directly to the third-party's end-point, so that the gatekeeper can throttle the frequency to stay within limits, back off when there's 50x errors, and so on.
So, if you are building a new data pipeline or trying to refactor an existing one, take a good look at Hadoop. It almost certainly won't be as snug a fit as your custom code, it's like using lego bricks instead of hand-carving, but I bet it will be faster and easier to build your product with. I'll be interested to hear from anyone who has other opinions or suggestions too of course!