This page proposes a new feature and reviews design options
This page describes a feature that's in development
This page documents an implemented feature
 Why we want it
- To run "big" jobs in the background, without making users wait. Good candidates:
- Data import
- Search index update
- Digital object processing and normalization (e.g. thumbnailing, converting video/audio to FLV)
- OAI-PMH Harvester
- These jobs don't need to be done immediately, they can happen "eventually"
- Scalability - the job server can be completely separate from the web server and new job servers (workers) can be added as needed
 Sample architecture diagram
Original Libre-Office Draw file (editable): File:Qubit distributed services architecture.odg
 PHP daemon
Write a custom PHP CLI "qubitd" daemon that will run continously in the background and fork a new PHP process when jobs are waiting.
 cron and wget
This is the solution currently implemented by Drupal. Their documentation describes how to add a command like,
$ wget http://.../drupal/cron.php
- as a cron job: http://drupal.org/cron
 Proposed cron and wget solution
This was proposed by Jack and Mathieu (?) for an OAI-PMH job scheduler with web interface.
As we need to move forward with this, here's what is proposed,
- scheduling on a daily/monthly/yearly basis
- jobs to occur once or on repetition
- jobs will be represented as paths to launch, so the jobs will have to be actions with parameters
To illustrate the initial options, here's the first sketch of the proposed new job screen
 Java: Quartz Job Scheduler
Here's scheduling framework for Java: http://www.opensymphony.com/quartz/
Beanstalk is a simple, fast work queue. Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.
Written in C with client libaries for most major languages, including several for PHP. Runs as a daemon. Uses http socket to add/get jobs.
Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events.
 LAC solution: shell scripts
At Library and Archives Canada (LAC) we simply use shell scripts that are executed by a central scheduling application also controlling server usage
We basically launch a PHP instance from command line with a PHP file written as a script
$ php -q /where/ever/is/your/app.php
DJJob allows PHP web applications to process long-running tasks asynchronously. It is a PHP port of delayed_job (developed at Shopify), which has been used in production at SeatGeek since April 2010.
The creators are using symfony tasks (!?) to run workers, and http://god.rubyforge.org/ to monitor and restart workers when necessary.
 David and MJ via email
On 2011-10-25 09:53 AM (PST) User:David wrote:
On 11-10-25 05:43 AM (PST), MJ Suhonos wrote: > Hi all, > > Allow me to take a step back for a moment here. I'm not fully privy to the Archivematica requirements that point to using a queue server, but based on our discussion last week, our Qubit requirements are: > > 1. Process long-running jobs asynchronously from web requests > 2. Minimize deployment requirements beyond LAMP > > *Requiring* a queue server for Qubit breaks (2), and is not strictly required to satisfy (1). Beanstalkd, gearman, rabbitmq, are designed in large part for quickly processing a high volume of parallel tasks, but Qubit jobs don't need to run fast, they just need to be non-blocking. No matter what we do here, it's going to require extra work for deployment - whether its setting up a crontab, firing up a custom php daemon or installing a queue server. I think any sort of background processing should be optional and the default setup of Qubit will do what it's doing now - run tasks serially and make the user wait for completion. I think this is a fine solution for small archives that have < 10,000 records (well except for page load times, which is another problem entirely). > For reference: http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html > (In particular, #2 and #7) I'm not sure how this is supporting your argument - Reddit is using RabbitMQ (#7) as their queue manager. Is the argument that we are not Reddit, so we don't need to use a queue manager? > So, what do we do when a queue server is not an option due to deployment/configuration limitations? We are already beginning to address this by separating Qubit into read- and write- components (roughly speaking), so all we need is a simple persistent queue (eg. a MySQL table or even a binlog) and a way to fork background processes that are self-limiting. ignore_user_abort() and php-fastcgi even make this possible from Apache without blocking web requests. I'm open to this, but I see a few problems: 1) We have to build our OWN queue server using MySQL and PHP, rather then using something "off the shelf" 2) AFAIK, we still have to either add a crontab or run a php daemon to make run a php script in a non-blocking way. I don't think it's any easier to ask users to setup a cronjob or maintain a php daemon (how do you restart it if it dies?) then to install beanstalkd. I've seen a few code snippets  on the PHP site for calling a bash script (which could be a php cli script) then sending the output to /dev/null or a file to make it non blocking, but see point #3 below. 3) My other concern is querying a running job for status - both to provide the user with updates and so we know when a job dies for some reason and needs to be restarted. I haven't seen any way to spawn a new php process from within a script, then be able to check the status of the background process. > TL;DR: I am pro-beanstalkd/german, but I would like to solve the simple case first. Less is more. TL;DR: I've tried to think of a way to do job scheduling without using a queue manager, but I think it actually *adds* complexity instead of reducing it. David  => http://php.net/manual/en/function.system.php
 Jack and Mathieu via chatroom
mathieufortinlac: I was wondering if you had any timed jobs in the application mathieufortinlac: usually, a oai harvester will go periodically to other servers to check for updates Jack Bates: no... we don't have anything like that yet mathieufortinlac: now, here we have a main schedule facility for that sort of thing... mathieufortinlac: I know Java has a framework for it Jack Bates: but i can think of many places where it'd be useful... mathieufortinlac: but I am still looking in PHP... ever heard of a scheduler framework? Jack Bates: ok - the challenge is that PHP is only running when a request is being served... mathieufortinlac: It is useful, but it's not sure we will be able to use cron-like jobs on all installations of the app... So I need to find a way to make it flexible.. Jack Bates: there are basically two "server apis" - ways of running php Jack Bates: : Jack Bates: php-cgi and mod_php mathieufortinlac: Yeah, I believe that the Tomcat server is loading a timer module that looks for config files... Jack Bates: right - you can do it with java because the container (e.g. tomcat) is always running Jack Bates: but php quits after each request is finished mathieufortinlac: ah! apache only calls php to execute... Jack Bates: so there's no guaruntee it's running at 12:00 when the event you want needs to be triggered mathieufortinlac: I did not realize that... I thought it stayed dormant... Jack Bates: mathieufortinlac: zactly mathieufortinlac: yeah. - 11:11 - Jack Bates: so, one way the Drupal PHP app gets around it is putting "wget http://.../timer.php" in a ron job Jack Bates: s/ron/cron mathieufortinlac: Yeah, here we have it all on scripting language to start jobs mathieufortinlac: yeah, that's what I do here as well, but that means that you'll have to maintain a second string of scripts for IIs no^ mathieufortinlac: ? Jack Bates: how is iis different? Jack Bates: what does it not support? Jack Bates: (why do we need second scripts?) mathieufortinlac: shoot, ok, no sorry, I am thinking to close to my implementation, we use shell scripting... mathieufortinlac: But that does not complicate the installation, you don't need permissions to schedule on servers? Jack Bates: mathieufortinlac: i think it does complicate a bit... Jack Bates: you need to be able to add a cron job Jack Bates: but that's the only option i see Jack Bates: (and the one chosen by the largest and most widely deployed php app)
 Gearman deployment
- These instructions are designed for Ubuntu Linux, but it should be similar in other environments)
- Gearman also provides a PHP library written in PHP, via PEAR repository, less dependencies)
- Take a look at https://github.com/brianlmoon/GearmanManager, it can work with both libraries (gearman PHP extension or the PHP library).
- I am using sfGearmanPlugin right now, we can extend this plugin to use GearmanManager internally
The first step is to install the Gearman job server (gearmand). This service can be installed in a different machine as long as its socket listener (by default, TCP port 4730) is available from the Qubit server.
sudo apt-get install gearman-job-server
Gearman also provides two APIs to interact with the job server from our application: the client API, to create new jobs, and the worker API, to process them in a queue. Both APIs are bundled together in a library available for different programming languages. In our case, we'll use the Gearman PHP extension, available through PECL a PHP extension repository.
sudo apt-get install php-pear
The package above contents both pear and pecl client tools. The big difference between both of them is that PEAR just contents libraries and code written in PHP, PECL is used to deliver PHP extensions written in C. Before using pecl, we need to php5-devel and some other packages to make our system able to compile the extension properly.
sudo apt-get install php5-dev libgearman-dev libgearman-server-dev
Now, let's download and build the Gearman PHP extension:
sudo pecl install gearman-beta
Did you get an error related to libuuid during the compilation? Take a look at this post: http://mgribov.blogspot.com/2010/05/gearman-pecl-package-on-ubuntu-lucid.html
The extension should be installed and we can enable it now in our PHP configuration:
echo "extension=gearman.so" | sudo tee /etc/php5/conf.d/gearman.ini
sudo apache2ctl restart
Let's check that the extension was installed correctly:
$ php -i | grep -i gearman /etc/php5/cli/conf.d/gearman.ini, gearman gearman support => enabled libgearman version => 0.10
- Rasmus Lerdorf blog on implementing Gearman with PHP
- Todd Hoff of Reddit on scalable web design
- Blog on Implementing beanstalkd in Ruby-on-Rails by Adam Wiggins
- Comparision of messaging queues by Graham King
- Second life Message Queue Evaluation wiki page
- Ubuntu Upstart daemon start/stop and monitoring + automatic restart
- PHP code to behave as a daemon: http://simas.posterous.com/writing-a-php-daemon-application