Scaling PHP with Gearman

There are times when you are asked to solve a really big problem. For example, parsing a multi-gigabyte XML file and loading it into your database. Maybe you need to build a system that can process hundreds of individual spider results every day.

These types of scenarios are not generally seen as something that PHP is well-suited to handle—partly because each process in PHP runs in a single thread. Single-threaded processes, they say, can't scale because everything must be processed synchronously.

While PHP is executed sequentially, that does not mean you are locked into a single thread to get your work done.

Fork It!

One of the best ways to really scale up parallel processes in PHP (and many other languages) is through the use of message queues. In this model you generally have a single PHP script that is looping over a large problem. As it picks off a chunk that needs to be processed—rather than processing it inline—it sends that work out over a message queue and continues on to the next chunk of work. This has several advantages:

  1. Applications can focus on solving a smaller portion of the overall problem.
  2. Code running to solve each part of the problem can run on any (or multiple) servers.
  3. You can add more instances of your processing code to meet demand or mitigate the impact of slower execution times.
  4. Your main routine doesn't have wait for the secondary processes to finish to continue with its work.

Gearman Queues

The message queue in my case here is Gearman, but there are many others out there such as RabbitMQ, ZeroMQ, and with some work Redis.

The message queue acts as a sort of load balancer between your worker scripts allowing the main Delegator to just throw work that needs to be done into the queues. Gearman itself can be configured to use MySQL for message persistence just in case your have more work than Workers. If you see your queues backing up with unprocessed work, it's easy enough to fire up some more Worker processes to get through the backlog quickly.

As the saying goes "Many hands make light work"--the same applies to worker scripts. The more worker scripts you have listening on the queue, the more work you can accomplish at once.

At this point the main bottleneck you have is the Delegator. Once you have scaled up your workers, the real issue is finding ways to give them work as fast as you can.