Scaling PHP with Gearman

There are times when you are asked to solve a really big problem. For example, parsing a multi-gigabyte XML file and loading it into your database. Maybe you need to build a system that can process hundreds of individual spider results every day.

These types of scenarios are not generally seen as something that PHP is well-suited to handle—partly because each process in PHP runs in a single thread. Single-threaded processes, they say, can't scale because everything must be processed synchronously.

Not true.

One of the best ways to really scale up parallel processes is through the use of message queues. In this model you generally have a single PHP script that is looping over a large problem. As it picks off a chunk that needs to be processed—rather than processing it inline—it sends that work out over a message queue and continues on to the next chunk of work. This allows the main script to farm out processing of the real work to be done to one or (typically) more "workers" without waiting for the work to actually be completed.

Gearman Queues

The message queue in my case here is Gearman, but there are many others out there such as RabbitMQ, ZeroMQ, and with some work Redis.

The message queue acts as a sort of load balancer between your worker scripts allowing the main Delegator to just throw work that needs to be done into the queues. Gearman itself can be configured to use MySQL for message persistence just in case your have more work than Workers. If you see your queues backing up with unprocessed work, it's easy enough to fire up some more Worker processes to get through the backlog quickly.

As the saying goes "Many hands make light work"--the same applies to worker scripts. The more worker scripts you have listening on the queue, the more work you can accomplish at once.

At this point the main bottleneck you have is the Delegator. Once you have scaled up your workers, the real issue is finding ways to give them work as fast as you can.