Please help me understand PDGDeadline

   6517   36   6
User Avatar
Member
8982 posts
Joined: 7月 2007
Offline
Are the partitions you are talking about something on Deadline side?
Since if you mean wrapping all work items into a single partition on TOP side then I imagine you lose granular dependencies on input workitems essentially becoming wait for all, which can be undesirable in many scenarios and also is not very friendly towards true dynamic workitems or mixed with other type of partitioning since this sounds like otherwise unnecessary additional partition wrapper and I dont know if you can nest partitions
Tomas Slancik
FX Supervisor
Method Studios, NY
User Avatar
Member
41 posts
Joined: 5月 2019
Offline
Indeed we mean waiting for all/part of upstream tasks. Partitions could contain batches that are directly translated to batches in Deadline. The way Deadline works, it takes a frame range and a path template, and manages itself to determine job tasks.
It's true that we might lose some efficiencies this way, but it will be solid, an avoid the race condition of dynamically producing DL tasks.
Edited by monomon - 2024年10月8日 03:57:20
User Avatar
Member
85 posts
Joined: 11月 2017
Offline
Partitions are on the houdini side - when we have work items that are per frame, we can partition them into work items that are per job. Deadline needs the frame range for the job, we can provide that from a work item attribute.
User Avatar
スタッフ
1286 posts
Joined: 7月 2005
Offline
Hi All,

Just an update. We recently had discussions with AWS Thinkbox regarding the challenges that Houdini users are facing with the new H20.5 PDG Deadline Scheduler. In light of these discussions as well as new limitations reported by users, we have decided to add a toggle to the PDG Deadline Scheduler that will enable you to switch from the H20.0 behaviour of one-job-of-many-tasks (will be the default) and the H20.5 behaviour of one-batch-of-many-jobs.

Our aim is to roll the new toggle into an upcoming H20.5 production build update, hopefully some time near the end of November or the beginning of December.

Cheers,
Rob
User Avatar
Member
56 posts
Joined: 8月 2017
Offline
That's awesome, I'm definitely going to use the toggle and try my luck with the old approach. It's good to have both solutions.
Edited by alexmajewski - 2024年11月1日 06:36:51
User Avatar
Member
10 posts
Joined: 8月 2017
Offline
rvinluan
Hi All,

Just an update. We recently had discussions with AWS Thinkbox regarding the challenges that Houdini users are facing with the new H20.5 PDG Deadline Scheduler. In light of these discussions as well as new limitations reported by users, we have decided to add a toggle to the PDG Deadline Scheduler that will enable you to switch from the H20.0 behaviour of one-job-of-many-tasks (will be the default) and the H20.5 behaviour of one-batch-of-many-jobs.

Our aim is to roll the new toggle into an upcoming H20.5 production build update, hopefully some time near the end of November or the beginning of December.

Cheers,
Rob

Hey @rvinluan,

Could I ask you a question regarding deadlinescheduler speeds compared to localscheduler, since you seem to be the right person to ask this type a question.

I am comparing a very fast axiom sim that takes 5 sec to simulate for 240 frames, with localscheduler via TOPs the sim takes around 10-20 seconds while the same type of sim takes 1.5 minutes via deadlinescheduler. The difference is that local scheduler prepares 20-30 tasks on the fly (at least it looks like this in the UI), while deadline scheduler always has 1-2 item overhead and my 5 sec sim turns into 1 minute and 40 second one to be precise, is this supposed to work like this? I don't use "Cook Frames As Single Work Item" for the sim here on purpose as an example of speed between the schedulers. I am just wondering if it's supposed to be like this?
I am on H20.5.410 and Deadline 10.4 btw.




Thanks in advance!
Edited by lavrenovlad - 2024年12月7日 13:08:36

Attachments:
localscheduler.png (140.2 KB)
deadlinescheduler.png (140.1 KB)

User Avatar
Member
85 posts
Joined: 11月 2017
Offline
You can try running the axiom sim as a single workitem, so it goes to a single machine on the farm. You can use the frames per batch in the ROPFetch for that.
User Avatar
Member
10 posts
Joined: 8月 2017
Offline
HristoVelev
You can try running the axiom sim as a single workitem, so it goes to a single machine on the farm. You can use the frames per batch in the ROPFetch for that.

Yeah I know that, that works pretty well for sims, I was just testing generally the speeds of both and was asking if it's normal to have that difference, and if it's not something on my side that I set up incorrectly etc.? My deadline set up locally no online delays nothing so I'd assume it'd be fast.
Edited by lavrenovlad - 2024年12月9日 07:18:39
User Avatar
Member
85 posts
Joined: 11月 2017
Offline
Each task boots up a new houdini process, so for short tasks the overhead is significant.
User Avatar
スタッフ
1286 posts
Joined: 7月 2005
Offline
lavrenovlad
Hey @rvinluan,

Could I ask you a question regarding deadlinescheduler speeds compared to localscheduler, since you seem to be the right person to ask this type a question.

I am comparing a very fast axiom sim that takes 5 sec to simulate for 240 frames, with localscheduler via TOPs the sim takes around 10-20 seconds while the same type of sim takes 1.5 minutes via deadlinescheduler. The difference is that local scheduler prepares 20-30 tasks on the fly (at least it looks like this in the UI), while deadline scheduler always has 1-2 item overhead and my 5 sec sim turns into 1 minute and 40 second one to be precise, is this supposed to work like this? I don't use "Cook Frames As Single Work Item" for the sim here on purpose as an example of speed between the schedulers. I am just wondering if it's supposed to be like this?
I am on H20.5.410 and Deadline 10.4 btw.


Thanks in advance!

As @HristoVelev mentioned, each task boots up a Houdini process, which adds overhead to the overall time and can be relatively significant for short tasks. You can batch tasks/frames together or use PDG Services (https://www.sidefx.com/docs/houdini/tops/services.html) to help reduce the overhead attributed to starting up processes.

In general, I would expect Deadline scheduling to be slower than Local scheduling. There is overhead with submitting jobs, waiting for Deadline to provision and assign worker nodes to tasks, and then waiting for the workers to pick up the tasks and execute the task commands. I can't really say how much slower, it varies, but it's definitely slower.

Judging by your attached screenshots, it looks like you may only have 1-2 Deadline workers on the farm compared to many "slots" available when performing local scheduling. Note that the number of concurrent tasks is determined by the Total Slotsparameter on the Local Scheduler TOP node (https://www.sidefx.com/docs/houdini/nodes/top/localscheduler.html#maxprocsmenu) for local scheduling and by the number of available Deadline workers for Deadline scheduling. There is a Concurrent Tasksparameter on the Deadline Scheduler TOP node (https://www.sidefx.com/docs/houdini/nodes/top/deadlinescheduler.html#deadline_concurrenttasks) that you can set to control or increase the number of tasks running concurrently on your workers but it's currently broken at the moment (I'm working on a fix).

Cheers,
Rob
User Avatar
スタッフ
1286 posts
Joined: 7月 2005
Offline
While I'm on here, I'll provide an update to the Deadline scheduling changes I mentioned earlier in this forum thread. I've added a new toggle parameter to the Deadline Scheduler TOP node to control whether you want pre-Houdini 20.5 scheduling of one-job-of-many-tasks or the new Houdini 20.5 scheduling of one-batch-of-many-jobs. The changes are currently in an in-house development build and are undergoing testing. I'm hoping to roll the changes into a Houdini 20.5 build very soon.

With pre-Houdini 20.5 scheduling, the Concurrent Tasksparameter will work once again.

Cheers,
Rob
User Avatar
Member
85 posts
Joined: 11月 2017
Offline
Great, looking forward to that build!
User Avatar
スタッフ
1286 posts
Joined: 7月 2005
Offline
Hi All,

FYI, starting in tomorrow's Houdini 20.5.452 build, the Deadline Scheduler TOP node will behave as it did pre-Houdini 20.5 and will schedule work items as tasks by default. There will also be a new Schedule Work Items As Jobstoggle parameter on the Deadline Scheduler that when checked on, will switch the scheduling mode back to scheduling work items as separate jobs. The toggle parameter provides a workaround to anyone that experiences dropped tasks on Deadline as a result of high concurrent activity on the farm.

Cheers,
Rob
User Avatar
Member
1 posts
Joined: 12月 2024
Offline
Hi All,

I believe I found a workaround for the Race Condition issue which was the reason for SideFX to move from Task-based to Job-based PDG Deadline Scheduler. At least it fixed Deadline Task creation instability for us in Pixomondo.

In Pixomondo we believe in power of collaboration and I am glad to share the solution with the community!

Issue seems to be located in how Scheduler uses AppendJobFrameRange.

Current implementation passes a list of frames separated by comma:
AppendJobFrameRange <JobID> 0,1,2,3
While if you try to append frames via Deadline Monitor UI it seems to do something like this:
AppendJobFrameRange <JobID> 0-3

It looks mostly the same, but the difference appear when you want to append single frame to a Job that already have several frames. For example you want to append 4th frame to a Job that already has 0, 1, 2 and 3rd frames.

Scheduler approach will be:
AppendJobFrameRange <JobID> 4
While Deadline Monitor will stick with the idea of ranges instead of lists:
AppendJobFrameRange <JobID> 0-4

Note that the whole new frame range is mentioned in the second example.

Appending frames this way seems to cure Tasks that were corrupted by Race Condition.

We were able to create a complex, sophisticated scene with very high chance of reproducing Race Condition issue with default Appending Frames as lists approach and with new Appending Frames as full ranges approach we weren't able to reproduce the issue anymore.

However, the nature of Race Condition is still unclear and should be investigated by SideFX and/or Thinkbox.
This is just a workaround for one case of this Race Condition that we faced at PXO, there might be others that we are unaware of.

The line that you want to modify is in <HOUDINI_INSTALL_DIR>/houdini/pdg/types/schedulers/tbdeadline.py:
Just replace
frames = ','.join('{}'.format(str(i)) for i in task_ids)
with
frames = f'0-{task_ids[-1]}'

Two important things are assumed for the patch to work properly:
  1. New frames are appended strictly ascending. Meaning you can't append frame 4 before frame 5.
  2. Last member of task_ids list is always the highest number.

I hope SideFX will tell us if these assumptions are incorrect, but no issues so far.

The fix is published AS IS, but feel free to reach me out here, via LinkedIn (Aleksei Garifov, can't post a link here) or via al.garifov@gmail.com if the fix works for you or if you struggle from other issues related to this Race Condition.

Cheers,
Aleksei Garifov
User Avatar
スタッフ
1286 posts
Joined: 7月 2005
Offline
algarifov
Hi All,

I believe I found a workaround for the Race Condition issue which was the reason for SideFX to move from Task-based to Job-based PDG Deadline Scheduler. At least it fixed Deadline Task creation instability for us in Pixomondo.

In Pixomondo we believe in power of collaboration and I am glad to share the solution with the community!

Issue seems to be located in how Scheduler uses AppendJobFrameRange.

....


Hi Aleksei,

This is a fantastic analysis! What's even more interesting is that the code in the Deadline Scheduler, which executes the `AppendJobFrameRange` command, has a comment that reads:
frame_range should be a string with format: start-end Examples: '0-5' '1-4', '10-20'
but the actual code formats the frame range differently, i.e. 0,1,2,3,4, as you pointed out.

The comment doesn't explain why the frame range should be formatted that way (of course) but it makes me wonder if we discovered this workaround years ago when the Deadline Scheduler was first implemented, but then changed the way we specified the frame range without realizing we re-introduced the race condition? For what it's worth, the Deadline documentation doesn't even mention what the frame range format should be so it's unclear to me where we obtained information on the frame range format. Perhaps it was just based on observations of the Deadline Monitor UI at the time?

Anyway, I'll reach out to our Thinkbox contacts and get their thoughts on this workaround.

As for the proposed code change, I think your two assumptions are correct but I can't say for sure. I'll discuss this with the rest of the PDG dev team here.

Thanks for this!

Cheers,
Rob
User Avatar
Member
41 posts
Joined: 5月 2019
Offline
Hey Aleksei, kudos on the work.

It might still be vulnerable to the race condition, but should at least improve the outcome. The assumption of frames being ascending would hold the vast majority of the time.

The race condition is something along those lines:
- Several tasks are submitted at around the same time.
- Each task has a view of the deadline job's properties.
- If task A and B submit their version of the deadline job, then A's state will overwrite B's in the repository. A's state never saw B.
Each task does something like (pseudocode)
job = GetJob()
job.frames = job.frames + myFrame
job.Save()

One possible solution would be to use a shared Queue, which serializes access to the job - A and B are submitted to the same queue, which executes them in order. The job is "refreshed" (re-obtained from the repo) between the executions.

Ideally, Deadline would have an "Add task" function. But that's a bit semantically fraught, because e.g. what happens if the job completed and you add a task?
Edited by monomon - 2025年1月24日 00:44:33
User Avatar
スタッフ
1286 posts
Joined: 7月 2005
Offline
Hi Aleksei,

I heard back from Thinkbox and as monomonalluded to, Thinkbox said that the race condition can still happen even with the frame range format change.

Thinkbox said that no matter what frame range format (1,2,3 or 1-3) is passed into AppendJobFrameRange, the frame range string is then passed into FrameUtils.Parse(framelist, False)(https://docs.thinkboxsoftware.com/products/deadline/10.4/2_Scripting%20Reference/class_deadline_1_1_scripting_1_1_frame_utils.html#a38a00ffc9defc0c0eeb5a1e503516f9b) and the resulting int array is used by Deadline.

It's still possible your change helps reduce the frequency in which the race condition error occurs but it's hard to say. It would be great if others who have hit the race condition can confirm whether the patch improves their overall experience.

Cheers,
Rob
  • Quick Links