Job scheduling

From Qubit Toolkit
Jump to: navigation, search

Design

This page proposes a new feature and reviews design options

Development

This page describes a feature that's in development

Documentation

This page documents an implemented feature

Contents

[edit] Why we want it

  • To run "big" jobs in the background, without making users wait. Good candidates:
    • Data import
    • Search index update
    • Digital object processing and normalization (e.g. thumbnailing, converting video/audio to FLV)
    • OAI-PMH Harvester
  • These jobs don't need to be done immediately, they can happen "eventually"
  • Scalability - the job server can be completely separate from the web server and new job servers (workers) can be added as needed

[edit] Sample architecture diagram

Qubit distributed services architecture.png

Original Libre-Office Draw file (editable): File:Qubit distributed services architecture.odg

[edit] Options

[edit] PHP daemon

Write a custom PHP CLI "qubitd" daemon that will run continously in the background and fork a new PHP process when jobs are waiting.

[edit] cron and wget

This is the solution currently implemented by Drupal. Their documentation describes how to add a command like,

$ wget http://.../drupal/cron.php

- as a cron job: http://drupal.org/cron

[edit] Proposed cron and wget solution

This was proposed by Jack and Mathieu (?) for an OAI-PMH job scheduler with web interface.

As we need to move forward with this, here's what is proposed,

  • scheduling on a daily/monthly/yearly basis
  • jobs to occur once or on repetition
  • jobs will be represented as paths to launch, so the jobs will have to be actions with parameters

To illustrate the initial options, here's the first sketch of the proposed new job screen

Scheduler.png

[edit] Java: Quartz Job Scheduler

Here's scheduling framework for Java: http://www.opensymphony.com/quartz/

[edit] Beanstalkd

http://kr.github.com/beanstalkd/

Beanstalk is a simple, fast work queue. Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.

Written in C with client libaries for most major languages, including several for PHP. Runs as a daemon. Uses http socket to add/get jobs.

[edit] Gearman

http://gearman.org/

Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events.

Written in C with many client/worker APIs including PHP and Python. Gearman is currently being used for Archivematica 0.8-alpha.

[edit] LAC solution: shell scripts

At Library and Archives Canada (LAC) we simply use shell scripts that are executed by a central scheduling application also controlling server usage

We basically launch a PHP instance from command line with a PHP file written as a script

$ php -q /where/ever/is/your/app.php

[edit] DJJob

https://github.com/seatgeek/djjob

DJJob allows PHP web applications to process long-running tasks asynchronously. It is a PHP port of delayed_job (developed at Shopify), which has been used in production at SeatGeek since April 2010.

The creators are using symfony tasks (!?) to run workers, and http://god.rubyforge.org/ to monitor and restart workers when necessary.

[edit] Discussion

[edit] David and MJ via email

On 2011-10-25 09:53 AM (PST) User:David wrote:

On 11-10-25 05:43 AM (PST), MJ Suhonos wrote:
> Hi all,
>
> Allow me to take a step back for a moment here.  I'm not fully privy to the Archivematica requirements that point to using a queue server, but based on our discussion last week, our Qubit requirements are:
>
> 1. Process long-running jobs asynchronously from web requests
> 2. Minimize deployment requirements beyond LAMP
>
> *Requiring* a queue server for Qubit breaks (2), and is not strictly required to satisfy (1).  Beanstalkd, gearman, rabbitmq, are designed in large part for quickly processing a high volume of parallel tasks, but Qubit jobs don't need to run fast, they just need to be non-blocking.

No matter what we do here, it's going to require extra work for deployment - whether its setting up a crontab, firing up a custom php daemon or installing a queue server.   I think any sort of background processing should be optional and the default setup of Qubit will do what it's doing now - run tasks serially and make the user wait for completion.  I think this is a fine solution for small archives that have < 10,000 records (well except for page load times, which is another problem entirely).

> For reference: http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
> (In particular, #2 and #7)

I'm not sure how this is supporting your argument - Reddit is using RabbitMQ (#7) as their queue manager.  Is the argument that we are not Reddit, so we don't need to use a queue manager?

> So, what do we do when a queue server is not an option due to deployment/configuration limitations?  We are already beginning to address this by separating Qubit into read- and write- components (roughly speaking), so all we need is a simple persistent queue (eg. a MySQL table or even a binlog) and a way to fork background processes that are self-limiting.  ignore_user_abort() and php-fastcgi even make this possible from Apache without blocking web requests.

I'm open to this, but I see a few problems:

1) We have to build our OWN queue server using MySQL and PHP, rather then using something "off the shelf"

2) AFAIK, we still have to either add a crontab or run a php daemon to make run a php script in a non-blocking way.  I don't think it's any easier to ask users to setup a cronjob or maintain a php daemon (how do you restart it if it dies?) then to install beanstalkd.  I've seen a few code snippets [1] on the PHP site for calling a bash script (which could be a php cli script) then sending the output to /dev/null or a file to make it non blocking, but see point #3 below.

3) My other concern is querying a running job for status - both to provide the user with updates and so we know when a job dies for some reason and needs to be restarted.  I haven't seen any way to spawn a new php process from within a script, then be able to check the status of the background process.

> TL;DR: I am pro-beanstalkd/german, but I would like to solve the simple case first.  Less is more.

TL;DR: I've tried to think of a way to do job scheduling without using a queue manager, but I think it actually *adds* complexity instead of reducing it.

David

[1] => http://php.net/manual/en/function.system.php

[edit] Jack and Mathieu via chatroom

mathieufortinlac: I was wondering if you had any timed jobs in the application
mathieufortinlac: usually, a oai harvester will go periodically to other servers to check for updates
Jack Bates: no... we don't have anything like that yet
mathieufortinlac: now, here we have a main schedule facility for that sort of thing...
mathieufortinlac: I know Java has a framework for it
Jack Bates: but i can think of many places where it'd be useful...
mathieufortinlac: but I am still looking in PHP... ever heard of a scheduler framework?
Jack Bates: ok - the challenge is that PHP is only running when a request is being served...
mathieufortinlac: It is useful, but it's not sure we will be able to use cron-like jobs on all installations of the app... So I need to find a way to make it flexible..
Jack Bates: there are basically two "server apis" - ways of running php
Jack Bates: :
Jack Bates: php-cgi and mod_php
mathieufortinlac: Yeah, I believe that the Tomcat server is loading a timer module that looks for config files...
Jack Bates: right - you can do it with java because the container (e.g. tomcat) is always running
Jack Bates: but php quits after each request is finished
mathieufortinlac: ah! apache only calls php to execute...
Jack Bates: so there's no guaruntee it's running at 12:00 when the event you want needs to be triggered
mathieufortinlac: I did not realize that... I thought it stayed dormant...
Jack Bates: mathieufortinlac: zactly
mathieufortinlac: yeah.

- 11:11 -
Jack Bates: so, one way the Drupal PHP app gets around it is putting "wget http://.../timer.php" in a ron job
Jack Bates: s/ron/cron
mathieufortinlac: Yeah, here we have it all on scripting language to start jobs
mathieufortinlac: yeah, that's what I do here as well, but that means that you'll have to maintain a second string of scripts for IIs no^
mathieufortinlac: ?
Jack Bates: how is iis different?
Jack Bates: what does it not support?
Jack Bates: (why do we need second scripts?)
mathieufortinlac: shoot, ok, no sorry, I am thinking to close to my implementation, we use shell scripting...
mathieufortinlac: But that does not complicate the installation, you don't need permissions to schedule on servers?
Jack Bates: mathieufortinlac: i think it does complicate a bit...
Jack Bates: you need to be able to add a cron job
Jack Bates: but that's the only option i see
Jack Bates: (and the one chosen by the largest and most widely deployed php app)

[edit] Gearman deployment

Gearman stack

Notes:

  • These instructions are designed for Ubuntu Linux, but it should be similar in other environments)
  • Gearman also provides a PHP library written in PHP, via PEAR repository, less dependencies)
  • Take a look at https://github.com/brianlmoon/GearmanManager, it can work with both libraries (gearman PHP extension or the PHP library).
    • I am using sfGearmanPlugin right now, we can extend this plugin to use GearmanManager internally

The first step is to install the Gearman job server (gearmand). This service can be installed in a different machine as long as its socket listener (by default, TCP port 4730) is available from the Qubit server.

sudo apt-get install gearman-job-server

Gearman also provides two APIs to interact with the job server from our application: the client API, to create new jobs, and the worker API, to process them in a queue. Both APIs are bundled together in a library available for different programming languages. In our case, we'll use the Gearman PHP extension, available through PECL a PHP extension repository.

sudo apt-get install php-pear

The package above contents both pear and pecl client tools. The big difference between both of them is that PEAR just contents libraries and code written in PHP, PECL is used to deliver PHP extensions written in C. Before using pecl, we need to php5-devel and some other packages to make our system able to compile the extension properly.

sudo apt-get install php5-dev libgearman-dev libgearman-server-dev

Now, let's download and build the Gearman PHP extension:

sudo pecl install gearman-beta

Did you get an error related to libuuid during the compilation? Take a look at this post: http://mgribov.blogspot.com/2010/05/gearman-pecl-package-on-ubuntu-lucid.html

The extension should be installed and we can enable it now in our PHP configuration:

echo "extension=gearman.so" | sudo tee /etc/php5/conf.d/gearman.ini

Restart apache:

sudo apache2ctl restart

Let's check that the extension was installed correctly:

$ php -i | grep -i gearman
  /etc/php5/cli/conf.d/gearman.ini,
  gearman
  gearman support => enabled
  libgearman version => 0.10

[edit] Resources

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox