Houdini 20.5 Executing tasks with PDG/TOPs

Troubleshooting PDG scheduler issues on the farm

Useful information to help you troubleshoot scheduling PDG work items on the farm.

On this page

General debugging tips
Work items fail to report results due to connection refused or time out
- Firewalls
- DNS
- MQ
Work items fail due to required files not found
Python not found

This page contains general troubleshooting recommendations for all TOP schedulers. For more in-depth information about schedulers, please see the TOP scheduler documentation.

General debugging tips ¶

Logs ¶

Check the farm logs or job logs for warnings or errors.
Attach any warning or error messages you find to your bug reports.

Deadline scheduler ¶

In the Deadline Scheduler node, turn on the Deadline ▸ Verbose Logging parameter to enable log output from the scheduler. The log output may contain useful warning or error messages. Please note that we will be adding this parameter to the other TOP schedulers soon.
Add the log output to your support tickets or your SideFX forum posts to help SideFX track down the problem.

Farm machines ¶

Make sure that all your farm machines and your submitting machine have access to your network file system.
Ideally, you should have at least two farm machines or nodes available for cooking PDG work items. A single farm machine may not be able to run both the work item tasks and the MQ job, especially for the TOP Deadline Scheduler.

Paths ¶

Do not use single backslashes (\) in paths as these are treated as escape sequences evaluated by Houdini. Instead, please use double backslashes (\\) to accommodate Houdini’s evaluation, or simply use forward slashes (/).
Spaces in paths are not supported by Houdini. Instead, surround your paths with quotation marks (") or use backslashes (\) to escape the space characters.

Work items fail to report results due to connection refused or time out ¶

PDG work items executing on farm machines have to report their results back to the Houdini process that initiated their cook. This Houdini process is typically run on a user’s workstation, also known as the submitting machine, which is not a farm machine and in some cases may even have a different network environment than the farm machines.

The results are reported back via a network socket-based Remote Procedure Call (RPC). To receive these results, a server is automatically started on the submitting machine to listen for these RPCs and to respond back if needed.

That is why the executing work items need to know the IP address (or host name) and port number of the submitting machine, and there needs to be a resolvable network route from each farm machine to the submitting machine.

Firewalls ¶

Problems

Firewalls and host name resolution can cause issues with the PDG work item reporting mechanism.
Firewalls can get in the way of RPCs if they are enabled on any of your farm machines, your submitting machines, or between networks.

To work around this, PDG utilizes the Message Queue MQ server. The MQ server can run as a task or job on your farm machines behind your firewalls. It can also use a limited number of ports (at least 2) if they are allowed through your firewalls to the submitting machine.

Solutions

Contact your IT Administrator to allow a few ports through your firewalls.
Specify these ports in the Task Callback Port and Relay Port parameter fields on your TOP scheduler nodes.

For more information on these nodes, see TOP nodes.

DNS ¶

Problem

Domain Name Resolution (DNS) can cause issues when reporting results via RPCs. Currently, the reporting mechanism uses hostname by default, which needs to be resolved to an actual IP address via a hosts file or DNS.

Solutions

For the hosts file, you can edit:

Windows

C:\Windows\System32\Drivers\etc\hosts

Linux

/etc/hosts

Mac

/etc/hosts
If neither are available (for example, like with an AWS farm without DNS), the RPC mechanism can attempt to resolve the IP address of the MQ server.
- You can enable this by specifying the PDGMQ_USE_IP=1 environment value in the work item job process or the .hip file.

MQ ¶

For Submit Graph as Job cooks, the MQ server runs locally on the submitting job on the farm. As such, this should allow it to avoid any networking issues.
Running MQ as its own job or task takes up a farm machine for some scheduler set-ups. In addition, each scheduler node might run its own MQ server.

Work items fail due to required files not found ¶

PDG on farms requires a network file system that is accessible by all machines involved in the process; this includes the submitting machine as well as all of the farm machines. All the files required by this process are copied to the PDG working directory specified by the scheduler located on the network file system. For more information, please see paths.

Problems

Issues that can interfere with this process are:

Different file paths for submitting machine vs. farm machines.
Non-homogeneous farm machine set-ups (for example, when you have Windows, macOS, and Linux machines in the same farm).

Solution

Each of the TOP scheduler nodes provides parameters to specify the remote file paths separately from the local file paths for the working directory.

Specify the local file path for the submitting machine.
Specify the remote file path that the farm machines can resolve.

HQueue Scheduler

Turn on the Override Local Shared Root parameter on your TOP scheduler node and then specify the appropriate Local Shared Root Paths.

Deadline Scheduler

For the local file path, use the Working Directory ▸ Local Shared Path parameter field on your TOP scheduler node.
For the remote file path, use the Working Directory ▸ Remote Shared Path parameter field on your TOP scheduler node.

Tractor Scheduler

Use the Shared File Root Path parameter fields on your TOP scheduler node.

Python not found ¶

Problem

PDG requires Python for executing work on the farm. As such, the TOP schedulers assume that the Python executable is accessible via the system path.