8 Jobs - In Detail

This section describes the internal structure and properties of an HQueue job. You can skip this section if you are just interested in how to submit jobs from Houdini and how to monitor them. This part of the documentation is targeted for people who plan to create their own jobs and submit them from outside of Houdini (i.e. from a Python script). However, knowing a little about the HQueue job design does not hurt as it would help clarify some of the terminology used throughout the web interface.

8.1 A Simple Example

In its most basic form, an HQueue job is merely a set of shell commands. When it is assigned to a client machine by the scheduler, the set of commands is executed. If the commands pass, then the job is said to have succeeded, otherwise, the job is said to have failed.

For example, suppose we want to create a job which simply prints "Hello World!" to the console. In that case, the command set would just be:

echo "Hello World!"

When it is submitted to HQueue, the job passes through several states before it completes. Here is a depiction of the status changes for our Hello World example:

waiting for machine ==> running ==> succeeded

Our job is initially set with the waiting for machine status. At this time, the scheduler searches for an available client machine on the farm. When one is found, the scheduler assigns it to our job and the machine is notified about the assignment. The client responds to the notification and picks up the job from the HQueue server. The job's status is changed to running and its command set is executed on the client. Once execution is complete, the client contacts the HQueue server and the job is finally set to succeeded.

Simple enough?

8.2 The Job Specification

Every HQueue job is defined by a specification -- a simple structure containing the job properties. Specifically, the specification is a Python dictionary where the keys are the property names and the values are the property values.

For our Hello World example, the job specification would look like the following:

{
         "name": "Print Hello World",
         "shell": "bash",
         "command": "echo 'Hello World!'"
}

In this example, we can see that there are 3 job properties -- name, shell and command. The properties indicate that the job is titled "Print Hello World", that the terminal shell to use when executing the commands is "bash" (as oppose to csh, or tcsh, etc.), and that a single command is to be executed on the client machine, specifically, "echo Hello World!".

For a complete list of the job properties that can be included in the job specification, see 8.4 Job Properties .

Submitting the Job Specification

Now that our Hello World job is defined, we can use the specification to submit the job to HQueue. To do this, we can write a simple Python script:

import xmlrpclib

# Connect to the HQueue server.
hq_server = xmlrpclib.ServerProxy("http://hq_server_hostname:5000")

# Define a job which prints "Hello World!".
job_spec = {
         "name": "Print Hello World",
         "shell": "bash",
         "command": "echo 'Hello World!'"
}

# Submit the job to the server.
# newjob() returns a list of job ids (in case multiple jobs are passed in at once).
job_ids = hq_server.newjob(job_spec)

In the above script, we first make a connection to the HQueue server using the xmlrpclib Python module. Once a connection is established, we store a reference to it in the hq_server variable. Next we define our Hello World job and assign the specification to the job_spec variable. Finally, we submit the job to HQueue by calling the newjob() method from the HQueue server's API (see 12 Python API for a complete list of functions). Note that we pass the specification as a parameter to newjob(). Once the job is submitted, HQueue generates a unique id for it and returns the id to the caller. In the example above, we store the id into job_ids.

Commandless Jobs

It should be noted that the only required property in the job specification is name. Not even the command property is required. Jobs with no commands can still run on HQueue, but when assigned to a client machine, they do not perform any work.

Sometimes commandless jobs can be useful. For example, you may want to test calling newjob() on the HQueue server without burdening the farm with any real tasks. Or you may want to create a commandless "container" job with dependencies on other jobs so that the other jobs appear to be grouped together when viewed on the web interface. Job dependencies are explained in detail in the next section.

8.3 Parent-Child Relationships

Sometimes a job, say Job A, cannot execute until another job, say Job B, has completed. In other words, Job A has a dependency on Job B. In HQueue terminology, when such a dependency exists, Job A is said to be the parent of Job B and Job B is said to be the child of Job A.

A Simple Parent-Child Example

Suppose we want to create an AVI video from a few frames, say 3, in a scene. Assuming that we have already generated IFD files for these frames, this task can be done in 2 steps:

  1. Generate the frame images using Mantra.
  2. Encode the images into a video.

For the first step, we define an HQueue job for each frame that would take the IFD and generate the image using Mantra. The job for rendering frame 1 would look like:

{
         "name": "Render Frame 1",
         "shell": "bash",
         "command":
                  "cd $HQROOT/houdini_distros/hfs;
                  source houdini_setup;
                  mantra < $HQROOT/path/to/ifds/frame0001.ifd"
}

The job specifications for rendering the second and third frames would be similar.

Let us suppose that Mantra generates the images to a path on the shared network drive, say $HQROOT/path/to/output/frame*.png. Then for the encoding step, we can pass in these images as input to our utility encoder. So the HQueue job for the encoding step would look like:

{
         "name": "Encode Video",
         "shell": "bash",
         "command":
                  "someEncoder --input=$HQROOT/path/to/output/frame*.png --output=$HQROOT/path/to/output/myVideo.avi"
}

Now we can create a dependency from the encoding job to the 3 render jobs so that the encoding does not start until the rendering is complete. Once the render jobs are finished, then the encoding job is assigned to a client machine and is executed.

To create the dependency, we can use the children property. The children property accepts a list of child job specifications.

So the final specification for our encoding job would look like:

{
         "name": "Encode Video",
         "shell": "bash",
         "command":
                  "someEncoder --input=$HQROOT/path/to/output/frame*.png --output=$HQROOT/path/to/output/myVideo.avi",
         "children": [
                  {
                           "name": "Render Frame 1",
                           "shell": "bash",
                           "command":
                           "cd $HQROOT/houdini_distros/hfs;
                           source houdini_setup;
                           mantra < $HQROOT/path/to/ifds/frame0001.ifd"
                  },
                  {
                           "name": "Render Frame 2",
                           "shell": "bash",
                           "command":
                           "cd $HQROOT/houdini_distros/hfs;
                           source houdini_setup;
                           mantra < $HQROOT/path/to/ifds/frame0002.ifd"
                  },
                  {
                           "name": "Render Frame 3",
                           "shell": "bash",
                           "command":
                           "cd $HQROOT/houdini_distros/hfs;
                           source houdini_setup;
                           mantra < $HQROOT/path/to/ifds/frame0003.ifd"
                  },
         ]
}

Status Changes

With child dependencies, the parent job moves through a slightly different flow of status changes:

waiting for children ==> waiting for machine ==> running ==> succeeded

The parent job is initially has the waiting for children status. At this time, the scheduler assigns client machines to the child jobs and the child jobs begin are executed. When the children finish, if at least one of the children have failed, the parent job's status is set to failed and its command is not run. Otherwise, the parent job's status changes to waiting for machine. The scheduler locates an available client machine and assigns it to the parent job. Once the client picks up the job from the HQueue server, the parent's status changes to running. Finally, when the parent completes, it changes to succeeded.

Submitting Child Jobs from Within The Parent

It is possible to submit new child jobs from within a running job. You may want to do this if for example you have several tasks that need processing but you do not know how many of those tasks you need until runtime. In that case, you can add commands to your parent job that calculate how many children it requires and submit specifications for the children to the HQueue server.

To submit new child jobs, you can use the newjob() API function. To assign the new job as a child, you can pass the current job's id as a parameter. In a running job, the id is stored in the $JOBID environment variable.

Below is an example of a Python script which creates a new job and assigns it as a child to the currently running job:

import os
import xmlrpclib

# Connect to the HQueue server.
hq_server = xmlrpclib.ServerProxy("http://hq_server_hostname:5000")

# Define the child job.
child_job_spec = {
         "name": "The Child Job",
         "shell": "bash",
         "command": "echo 'Hello World!'"
}

# Get the id of the current job.
# It should be defined in the environment.
current_job_id = os.environ["JOBID"]

# Submit the job to the server.
# newjob() returns a list of job ids (in case multiple jobs are passed in at once).
job_ids = hq_server.newjob(job_spec, parentId=current_job_id)

Now if we save this script into a file, say createChild.py, we can add it to the command property of the parent job. So the parent job's specification would look like:

{
         "name": "The Parent Job",
         "shell": "bash",
         "command": "python $HQROOT/path/to/scripts/createChild.py"
}

Note that the status changes for a job that submits its own children is slightly different:

waiting for machine ==> running ==> waiting for children ==> succeeded

The parent job initially has no children and so its status is set to waiting for machine. When a client machine is assigned and executes it, the job is changed to running. During execution, the job submits new children to the HQueue server. While the parent continues to run, its children are assigned to client machines and they run as well. When the parent finishes, it needs to wait for its children before it can declare that it succeeded and so its status is changed to waiting for children. When all the children finish, then the parent job is finally set to succeeded.

8.4 Job Properties

Below is a list of the job properties that can exist in a job specification.

Property Name Description
children A list of job specifications to submit to the server. The new jobs are assigned as children.
childrenIds A list of ids for existing jobs that should be assigned as children.
command The set of shell commands to execute on the assigned client machine.
conditions A list of conditions which tells the HQueue scheduler to assign the job to a restricted set of client machines. For more information, please read 8.5 Job Conditions .
cpus The minimum number of CPUs that the job will use. The default is 1.
description The job description.
emailReasons A comma seperated list of reasons to send emails to the addresses specified by the `emailTo` property. If this is empty or not specified, no emails will be sent. Valid reasons are 'abandoned', 'cancelled', 'failed', 'paused', 'pausing', 'priority changed', 'queued', 'rescheduled', 'resumed', 'resuming', 'runnable', 'running', 'succeeded' and 'waiting'.
emailTo A comma seperated list of email addresses to send emails to based on reasons specified by the `emailReasosn` property.
environment A dictionary of variables to define in the environment when the job's command set is executed on a machine. The keys and values of the dictionary are the variable names and values respectively.
host The hostname of the machine that the job should execute on. If this property is not set, then the job can execute on any machine.
maxHosts The maximum number of client machines required to process the job. The default is 1.
minHosts The minimum number of client machines required to process the job. The default is 1.
name The title of the job.
priority The job's priority. Jobs with higher priorities are scheduled and processed before jobs with lower priorities. 0 is the lowest priority. The default is 0.
shell The terminal shell to use when executing the job's command set.
submittedBy What the name of the submitter should be. For child jobs, if this value is not specified, it is inherited from a parent job.
tags A list of tags to apply to the job. Tags can used to control whether the job requires a dedicated machine or whether it can share a machine with other running jobs. For more information, see 8.6 Job Tags .
triesLeft The number of times the job should be automatically rescheduled in an attempt to make it succeed after a failure. If the job fails after `triesLeft` times, then it remains as failed. The default value is 0.

You may also add arbitrary properities to a job spec. These have no special meaning to the HQueue server but they might be useful to specific or custom jobs. The recommended way to do this is to use a seperate entry for the class of job that has all the custom properties it needs in a dictionary.

For example, the HQueue Render ROP submits jobs that have an HQPARM property which is used by the scripts that are invoked by these job submissions.


Job Properties Example

Below is an example of a job specification which demonstrates the use of some of the properties:

{
         "name": "The Main Job",
         "shell": "bash",
         "environment": {
                  "SHOW_MSG": "1",
                  "MSG": "Hello World!"
         },
         "command":
                  "if [ $SHOW_MSG = 1 ]; then
                           echo $MSG;
                  fi",
         "tags": [ "single" ],
         "maxHosts": 1,
         "minHosts": 1,
         "priority": 0,
         "children": [
                  {
                           "name": "The Child Job",
                           "shell": "bash",
                           "command": "echo 'Hello World!'"
                  }
         ]
}

The example above defines a job named "The Main Job" which has a priority level of 0. It uses the 'bash' shell to execute its command set and it defines two variables, "SHOW_MSG" and "MSG", in the environment. Its command set directly references these two variables. The job requires one dedicated machine as defined by the "single" tag, and the "maxHosts" and "minHosts" properties. Finally, it has a single child job which prints out "Hello World".

8.5 Job Conditions

You can attach conditions on a job which inform the HQueue scheduler to assign the job to a restricted set of client machines. A job condition is defined by a type, name, operator and value. Together they form a comparison test that the scheduler can use to determine whether a machine is acceptable to run the job. If a client machine passes ALL of the assigned conditions, then it can run the job.

Below is a description of each of the condition components:

Component Description
type The type of the condition dictates how the condition should be applied in the HQueue scheduling system. Since HQueue only supports client conditions at the moment, the type should always be set to "client". Client conditions are used to determine whether a client machine is acceptable to run a target job or not.
name The name of the condition identifies the part of the client that should be tested to determine if it is acceptable or. The supported names are:
  • hostname
- The condition should be tested against the client's hostname.
  • group
- The condition should be tested against the client's group memberships.
op The comparison operator that should be used to test the client's attribute (as indicated by the name component) against the condition's value. The supported operators are:
  • ==
- returns true if the client's attribute exactly matches the condition value.
  • !=
- returns true if the client's attribute does not match the condition value.
  • any
- returns true if the client's attribute matches any element in the condition value. Use commas to separate multiple elements in the value.
value The condition value is used to test against the requested client attribute. If the condition operator is "any", then the value can be a list of multiple elements where commas are used to separate elements.

Job Condition Examples

Below is an example which demonstrates how to attach a condition to a job specification:

{
         "name": "A Job with Conditions",
         "shell": "bash",
         "command": "echo 'I should be running on either machine1 or machine2!'",
         "conditions": [
                  { "type" : "client", "name":"hostname", "op":"any", "value":"machine1,machine2" },
         ]
}

The example above defines a job which can only be assigned to a client machine named either "machine1" or "machine2". Note that the conditions property is a list of Python-like dictionaries where each dictionary defines a single condition and its 4 components.

The next example shows how to set a condition where the job can only be assigned to client machines that are members of the "Simulation" group:

{
         "name": "A Job for the Simulation Group",
         "shell": "bash",
         "command": "echo 'I should be running on a machine that is a member of the Simulation group!'",
         "conditions": [
                  { "type" : "client", "name":"group", "op":"==", "value":"Simulation" },
         ]
}

The example above defines a job which can only be assigned to a client machine named either "machine1" or "machine2". Note that the conditions property is a list of Python-like dictionaries where each dictionary defines a single condition and its 4 components.

8.6 Job Tags

Tags can be used to describe whether the job requires a dedicated machine or whether it can share a machine with other running jobs. If no tags are specified, then by default, the job is configured to share the machine it is running on as long as the machine has enough CPUs.

To declare that your job needs a dedicated machine, add the "single" tag to the tags property. This guarantees that no other jobs will be assigned to the machine while your job is running.

You can also create custom single tags to control which sets of running jobs can share machines and which sets cannot. The sharing rule works as follows -- the scheduler will never assign running jobs to the same machine if they have matching tags.

To create a custom single tag, simply prefix the tag name with "single".

For example, suppose we create a custom single tag named "single:1" and assign it to two jobs, say A and B. And suppose we create another custom single tag named "single:2" and assign to two other jobs, say C and D. This entails that jobs A and B cannot execute on the same machine concurrently and the same goes for jobs C and D. However, job A or job B can run concurrently on the same machine that is running job C or job D and vice versa.

8.7 Job Statuses

Below is a list of all the job statuses and their descriptions.

Status Description
abandoned The job is assigned to a client machine but the machine is not reporting on its progress or status. This can happen if the machine fails (i.e. reboots) while running the job.
cancelled The job is no longer on the scheduling queue because the user interrupted it before it finished.
failed The job is finished but an error was reported when executing its command set, or when executing one of its descendent job's command set.
paused The job has been paused by the user. The scheduler does not assign a client machine to the job while it has this status. If the job is already running on a machine, then its execution is halted.
pausing The job is running on a client machine but has been requested by the user to halt execution. The HQueue server is waiting for contact from the client to confirm that the job has been paused.
resuming The job is assigned to a client machine and is currently paused but has been ordered by the user to resume execution. The HQueue server is waiting for contact from the client to confirm that the job has been resumed.
running The job is being processed on an assigned client machine.
running (X clients assigned) One or more of the job's descendents jobs is running and a total of X clients are assigned to run those jobs.
succeeded The job is finished and no errors were reported during command execution.
waiting for machine The job is ready for processing but is waiting for an available machine.

8.8 Job Variables

Below is a list of the built-in variables available in the command environment of a running job.

Environment Variable Description
HQCLIENT The folder path to the client code on the machine running the job.
HQCLIENTARCH The platform of the client machine that is running the job. It consists of the operating system and machine architecture. Here is a quick list of the possible values:
  • linux-x86_64 ==> Linux 64-bit
  • linux-i686 ==> Linux 32-bit
HQROOT The folder path to the HQueue shared network drive. Depending on the platform of the machine running the job, $HQROOT will be set to either one of the hqserver.sharedNetwork.mount.linux, hqserver.sharedNetwork.mount.windows, or hqserver.sharedNetwork.mount.darwin variables found in the HQueue server configuration file (see 9 Configuration ).
HQSERVER The address of the HQueue server. It consists of the HQueue server's hostname and the port number that the server is listening on.
JOBID The id of the current job.