TOP Deadline scheduler node updated (Houdini 17.5.362)

   3217   16   6
User Avatar
Staff
571 posts
Joined: May 2017
Offline
Hello,

Please update to 17.5.362 or newer.

The TOP Deadline scheduler has gone through a significant update both in terms of scheduling behaviour and UX.

The most significant change is that a Deadline scheduler node will now schedules each work item in a PDG graph as a Deadline Task under a single Deadline Job. This reduces the “job noisy-ness” and improves performance when dealing with a large number of work items.

Another change, and perhaps not directly a user concern, is the use of a new Deadline plugin called PDGDeadline. Previously the default Deadline CommandLine plugin was used for PDG cook, but it had limited use. This new plugin allows for greater control of setting up and managing each task. The plugin is shipped with Houdini, and will be copied over to the job directory to be used for each cook.

The UI has gone through a significant change as well in order to reduce initial setup and have things just work. Though customizing the scheduler is still available but hidden behind override toggles in the new Advanced section.

For the Scheduler UI section:



The Working Directory is simplified to take an absolute path, with variables if needed. If you have a common farm setup, where all slaves have same shared path as your local machine, then you can supply the path directly in Local Shared Path. Otherwise, you can toggle on the Remote Shared Path and specify a variable-based path that can be mapped in Deadline's Mapped Paths which gets resolved on the slave. This is necessary in a mixed farm setup (e.g. windows, mac, linux slaves).



Added Machine Limit, Machine List, and Blacklist job parameters.



A new Deadline section has been added, which for the most part you don't need to worry about. But it is useful in several ways:
  • Verbose Logging - Turn this on to get a log print out of all that is happening when cooking. Very useful for debugging problems. If you have trouble and to submit a bug report, please also copy this verbose log out to a file and attach it. I am thinking of automatically doing this if the toggle is on.
  • PDGMQ Server As Task - Slightly advanced feature, but if you get broken socket errors, turning this on might help. It runs the new task tracker (PDGMQ server) as its own task, instead of as a background process.
  • Force Reload Plugin - This Deadline job parameter is disabled if PDGMQ Server As Task is also disabled. It is meant to clean the slate between tasks which we can't do if running the PDGMQ server as a background process that we have to keep alive.

Advanced Section: Don't change anything here unless you know what you are doing.
  • Task Submit Batch Max - Number of items to submit at a time (on each tick update). You can play with this if you have large number of work items and want faster scheduling vs. less responsive UI.
  • Task Check Batch max - Number of tasks in flight to check status of at a time. Again useful when dealing with large number of work items.
  • Repository - If you have multiple Deadline repositories, and want to specify the non-system default one, then set it here.
  • PDG Deadline Plugin - The scheduler uses a custom new Deadline plugin which is written for the PDG cook process. Only change this if you have written your own plugin that also conforms to the expected behaviour. Please contact me via support if you want to do this as there is no documentation to help with this at the moment.



For the new task tracker, you can specify the ports here. Useful if you have network issues or limitations such as firewalls when communicating with the farm.

For the Job Parms section:



HFS will need to point to the Houdini installation that all slaves will be using. Again, for a straightforward setup where your local machine is same as your slaves, then you can simply leave it as $HFS. But if you have slaves using a different Houdini installation path, then you'll need to supply a variable here that Deadline can map via its Mapped Paths setting. For example, you can set HFS=\$HFS (note the \), and then apply a $HFS= mapping in the Mapped Paths. The \ is used to escape the Houdini evaluation of $HFS.

Similarly, for Python, the default uses the system python. The \$PDG_EXE will be automatically evaluated by the PDGDeadline plugin to replace it with .exe on Windows, and strip it out on other platforms. Deadline requires .exe extension for executable on Windows. If you specify another Python path, you'll need to keep \$PDG_EXE if you are using Windows-based slaves.

Note that Hython field has been removed. The PDGDeadline plugin evaluates it on the slave by formulating it from $HFS (e.g. $HFS/bin/hython on linux).



Added new Pre Task Script and Post Task Script which run before and after each task.



Allows work items to inherit the local environment, set HOUDINI_MAXTHREADS, and add work item specific environment values.

Other new features and changes:
  • Submit Graph As Job now supported which allows to schedule the hip file to be cooked on the farm and quit Houdini.
  • Added Job Name parm to allow specifying the Deadline Job name.
  • Improved performance and error handling.
  • Fixed displaying the correct task report file via Task Info > Show Log.
  • Fixed shell command execution.
  • Removed deadline_jobpreload.py script as its not longer required to be run.

The previous version ran into issues with large number of work items, but this updates handles hundreds of thousands of work items.

The help page has been updated with these changes: https://www.sidefx.com/docs/houdini/nodes/top/deadlinescheduler.html [www.sidefx.com]

Known Issues:

There is a current limitation when using more than 1 TOP Deadline scheduler in a graph. I'll be looking into improving this next.

When using a mixed OS farm (or if using variables for HFS path), the slaves might error out with not finding the PDGDeadline plugin. This is a Deadline bug which they've said they will fix (the bug is that the custom plugin directory path is not evaluated via Mapped Paths before looking for the plugin). For now, the workaround is to copy the entire PDGDeadline folder ($HFS/houdini/pdg/plugins/PDGDeadline) into the Deadline repository's custom plugin folder ($repo/custom/plugin/PDGDeadline). If you do this, then make sure that you also update this when installing a new version of Houdini in case the plugin has changed. It is annoying, but if it becomes a real issue, I can look into improving this.


As always, feedback is always welcome Or submit bug tickets if you run into issues.
Edited by seelan - Sept. 9, 2019 11:27:42

Attachments:
dl_new_ui_working_dir.png (11.0 KB)
dl_new_ui_machine.png (8.0 KB)
dl_new_ui_dl.png (39.6 KB)
dl_new_ui_mq.png (9.8 KB)
dl_new_ui_paths.png (18.2 KB)
dl_new_ui_tasks.png (7.8 KB)
dl_new_ui_taskenv.png (14.3 KB)

User Avatar
Member
6 posts
Joined: Feb. 2016
Offline
Hi, my first time using the TOP Deadline scheduler. When I press Submit Graph As Job I get the error below. I feel that everything is correctly installed. I am not sure what I missed. I attached my file. Any advice will be appreciated. Thanks

Houdini 17.5.360
Deadline 10.0.27.3
 
Traceback (most recent call last):
  File “<stdin>”, line 1, in <module>
  File “opdefTop/deadlinescheduler?PythonModule”, line 3, in submitGraphAsJob
  File “CPROGRA~1/SIDEEF~1/HOUDIN~1.360/houdini/python2.7libs\pdg\scheduler.py”, line 407, in submitGraphAsJob
    topnode.cook(True)
  File “CPROGRA~1/SIDEEF~1/HOUDIN~1.360/houdini/python2.7libs\hou.py”, line 10535, in cook
    return _hou.Node_cook(*args, **kwargs)
OperationFailed: The attempted operation failed.
Error while cooking.
Edited by R0B - Sept. 9, 2019 12:04:11
User Avatar
Staff
571 posts
Joined: May 2017
Offline
Please update to 17.5.362 or newer. There is a bug with 17.5.360.
User Avatar
Member
6 posts
Joined: Feb. 2016
Offline
Hi, I have update. I am not sure what the steps are to submit this correctly. Sorry if the has been covered before.
 
1. place deadlinescheduler inside topnet and pointed/referenced on the outside.
2. inside topnet, deadlinescheduler press “submit graph as job” 
3. only one job named “PDG TOP_Deadline_scheduler” submitted but crashes 

Attached hip file + screen shots 
Thanks

Attachments:
tops_01.JPG (53.1 KB)
tops_02.JPG (97.0 KB)
tops_03.JPG (46.6 KB)
tops_04.JPG (110.9 KB)
TOP_Deadline_scheduler_04.hip (1.6 MB)

User Avatar
Staff
571 posts
Joined: May 2017
Offline
Could you paste the rest of the Deadline log (it's cutoff at half)?

If you simply do “Dirty and Cook Output Node”, what happens? (So not using the “Submit Graph As Job”).
User Avatar
Member
2 posts
Joined: Sept. 2014
Offline
I've got this scheduler working for the most part, but it appears it's ignoring the slaves GPU Affinity in deadline? Redshift ROP and Deadline ROP, same issue.
User Avatar
Staff
571 posts
Joined: May 2017
Offline
From my understanding of how Deadline's GPU affinity works, this is normally set per slave instance via Deadline Monitor, but it can be overridden by the job, if the job supports it (e.g. renderer executable with gpu argument).

Since there isn't a generic way to specify the GPU affinity for any type of job, it falls on the specific job's command (and the renderer) to set the GPU affinity. The TOP Deadline scheduler is generic so it wouldn't be able to specify GPU affinity directly. This means for TOP nodes, you'll need to specify the GPU affinity in the job's command argument. Deadlines shipped plugins take care of this for each renderer they support, since they know exactly the command to run for the job.

One thing I can recommend is to create multiple slave instances on a single machine, and set the GPU affinity via Deadline Monitor for each slave instance. Then group the saves, and specify the group in the TOP Deadline scheduler.

Let me know if you have suggestions to improve this.
User Avatar
Member
451 posts
Joined: Oct. 2011
Offline
I would not use multiple slaves on a single machine. Because you have to specify the actual gpu ids for each slave instance. Its not very flexible and ist kind of cumbersome to maintain on many machines. The way we do it with Redshift and the Deadline ROP node is like this (haven't tried with TOP's yet as were still on 17.5.360):
- We have machines with a mix of 2 and 4 gpus
- We render Redshift jobs using 2 gpu's pr task as this seems to be the most efficient combination.
- In Monitor we set the Slave Property “Concurrent Task Limit Override” to 1 for our 2-gpu-machines and 2 for our 4-gpu-machines.
- When we submit jobs we set “Concurrent Tasks” to 2 but we check “Limit Tasks to Slave's Task Limit” on the Deadline ROP. “Limit Tasks to Slave's Task Limit” is not present on the Deadline Scheduler in 17.5.360 as far as i can see. Not sure how to set this option without the checkbox present in the UI.
- So when the job renders the default is to use 2 gpus pr task and render 2 tasks concurrently on each machine.
So to sum it up:
Machines with 2 gpu's has an override to only render 1 task concurrently and will therefore only use 2 gpu's. All the other machines will render 2 tasks at a time using 2 gpu's each.
When we render CPU jobs we unceck “Limit Tasks to Slave's Task Limit” and set concurrent tasks to 1.

So on the Deadline Scheduler in TOP's you would probably have to make a “Plugin File Key-Value” entry with “GPUsPerTask” with a value of 2 to mimic our setup. And set “Concurrent Tasks” to 2.

It might be that the Deadline ROP has some code to select gpu ids for different task so it might not work in TOP's
Ill test this as soon as we upgrade.

-b
Edited by bonsak - Oct. 2, 2019 09:00:08
http://www.racecar.no [www.racecar.no]
User Avatar
Member
2 posts
Joined: Sept. 2014
Offline
seelan
…..

One thing I can recommend is to create multiple slave instances on a single machine, and set the GPU affinity via Deadline Monitor for each slave instance. Then group the saves, and specify the group in the TOP Deadline scheduler.

Let me know if you have suggestions to improve this.

This is actually how I have it currently set up, and it ignores that gpu affinity which is set per slave. Only ignores it when submitting via TOPS, works in Maya and Houdini using the normal deadline submission methods.

bonsak, that is definitely the recommended route for render nodes and was my old workflow, but I've switched it to multiple slave instances per machine, so I can disable specific instances and keep specific GPU's free on workstations. I've only got three machines though so management is easy.
User Avatar
Staff
571 posts
Joined: May 2017
Offline
Some info on this: https://www.awsthinkbox.com/to-affinity-and-beyond [www.awsthinkbox.com]

Note that unlike CPU affinity, GPU affinity isn’t automatically applied to all renders. It is up to the individual Deadline application plugins to pull this information from the Slave’s at render time and pass them to the renderer.

So there are 2 ways to make use of GPU setting in Deadline:

1. Setting GPU Affinity per Slave via Deadline Monitor.

2. Setting GPUs Per Task and using Concurrent Tasks at the job level (via the plugin interface).

Again, the scheduling plugin needs to translate this to the GPU setting for each renderer command, since it's not an OS setting.

I believe what is lacking for the TOP Deadline scheduler is the mapping from GPU Affinity, or GPUs Per Task, to the GPU setting for each render command. Now since the TOP scheduler is generic and doesn't look at the type of command being run, this wasn't put in. Ideally you'd add that to the command you are running. Since Deadline's render plugins take care of this automatically, as a user, you are used to it just working.

What I can do is to provide options to add the GPU settings if the TOP Deadline scheduler detects a render command is being run, and I can do this for the most common renders, such as the Deadline ROP and Redshift ROP. Please submit an RFE for the renderers you want support for.

If you are curious and want to try it out yourself, these are the environment settings that are used for the Deadline Houdini plugin (for OCL with GPU setting), which you can set via the TOP Deadline scheduler's enviroment key-value mapping:

HOUDINI_OCL_DEVICETYPE=GPU
HOUDINI_OCL_VENDOR=
HOUDINI_OCL_DEVICENUMBER=<insert gpu ids, e.g.: 0,2>

Redshift:
Add ‘-gpu’ argument with id of GPU to command. (e.g: -gpu 0 -gpu 2)

Ultimately, the GPU Affinity and GPUs Per Tasks are doing the above automatically for the respective render command.
Edited by seelan - Oct. 5, 2019 08:48:16
User Avatar
Member
30 posts
Joined: Dec. 2010
Offline
Hi Seelan,

I am in a similar boat as to Chris Denny, and can't work out how to get Deadline PDG Redshhift ROP tasks to respect each individual slave's GPU affinity.

Basically I have 2 slaves configured on my single workstation, each slave is set to use 1 of 2 GPUs each in the Deadline slave GPU affinity settings.
This all works great and as expected when submitting a regular Redshift render job from Houdini.
However when the PDG Deadline scheduler submits a Redshift ROP fetch job the GPU affinity doesn't get respected and I end up in a case where both slaves are running a Redshift render job, but each job is trying to use both GPUs at the same time and VRAM maxes out and renders slow down to a crawl etc.

Looking at the link to the Deadline website you sent I see this paragraph:

“Note that unlike CPU affinity, GPU affinity isn't automatically applied to all renders. It is up to the individual Deadline application plugins to pull this information from the Worker's at render time and pass them to the renderer.”


So does that mean I would have to modify the PDGDeadline.py plugin file to have some logic to respect the GPU affinity settings on the Deadline slaves?
Sorry if this is what you already covered above but I'm having a bit of trouble figuring it out.
What you mentioned about the Redshift -gpu argument looks interesting but how can I get that argument into the Deadline render commands and would it be possible to set it up so that one slave always uses GPU 0 and one slave always uses GPU 1?

Basically at the end of the day I just want to be able to run 2 slaves at the same time, each one only ever using 1 of 2 GPUs at a time..

BTW I'm using Houdini 17.5.425
Upgrading to Houdini 18 isn't an option just yet on this production.

Thanks for any help!
-MC
MC
User Avatar
Staff
571 posts
Joined: May 2017
Offline
I'll add support for Redshift GPU specification for the TOP Deadline scheduler. It'll work similarly to the native Deadline plugin.
User Avatar
Member
30 posts
Joined: Dec. 2010
Offline
“Ask and you shall recieve”

Thanks Seelan!
Would this be a Houdini 18 feature only?
MC
User Avatar
Staff
571 posts
Joined: May 2017
Offline
Aiming for 17.5 as well since you are stuck on it for now. Would be great if you can test it out as soon as its in, which I'll update this thread when that happens.
User Avatar
Member
30 posts
Joined: Dec. 2010
Offline
Absolutely I'm keen to test it out in 17.5 as soon as it's ready

Thanks again!
MC
User Avatar
Staff
571 posts
Joined: May 2017
Offline
Support for setting GPU affinity for Redshift and OpenCL nodes has been added to TOP Deadline scheduler in Houdini 17.5.460 and 18.0.309 daily builds. This works pretty much the same as Deadline's Houdini plugin.

If you have your farm setup such that a Deadline worker has GPU affinity setting, then it will be respected and used when rendering with Redshift or OpenCL.

If you want to override this at a PDG job level, there is a new GPU Affinity Overrides section in the Job Parms tab on the TOP Deadline scheduler (as well as for each TOP processor node under Schedulers > Deadline Scheduler).
  • OpenCL Force GPU Rendering will use the GPU affinity set for each worker for OpenCL work.
  • GPUs Per Task will use the number of GPUs specified, as long as the worker is allowed to use them according to their GPU affinity setting in Deadline's Slave Properties.
  • Select GPU Devices will use the specified GPU device IDs. This is a comma-separated list (e.g. 0,2,5). But the device IDs must be a subset of what is allowed to use for that worker according to their GPU affinity setting in Deadline's Slave Properties

Easiest thing to do is just setup the GPU affinity in the Deadline's Slave Properties for each worker, then render away.
Edited by seelan - Dec. 5, 2019 09:26:23

Attachments:
dl_gpu.png (16.6 KB)

User Avatar
Member
30 posts
Joined: Dec. 2010
Offline
Awesome!
Thanks so much Seelan I will give it a try as soon as I have an opportunity to install the daily build!
MC
  • Quick Links