Hi All,
I believe I found
a workaround for the Race Condition issue which was the reason for SideFX to move from Task-based to Job-based PDG Deadline Scheduler. At least it fixed Deadline Task creation instability for us in Pixomondo.
In Pixomondo we believe in power of collaboration and I am glad to share the solution with the community!
Issue seems to be located in how Scheduler uses AppendJobFrameRange.
Current implementation passes a list of frames separated by comma:
AppendJobFrameRange <JobID> 0,1,2,3
While if you try to append frames via Deadline Monitor UI it seems to do something like this:
AppendJobFrameRange <JobID> 0-3
It looks mostly the same, but the difference appear when you want to append single frame to a Job that already have several frames. For example you want to append 4th frame to a Job that already has 0, 1, 2 and 3rd frames.
Scheduler approach will be:
AppendJobFrameRange <JobID> 4
While Deadline Monitor will stick with the idea of ranges instead of lists:
AppendJobFrameRange <JobID> 0-4
Note that the whole new frame range is mentioned in the second example.
Appending frames this way seems to cure Tasks that were corrupted by Race Condition.
We were able to create a complex, sophisticated scene with very high chance of reproducing Race Condition issue with default Appending Frames as lists approach and
with new Appending Frames as full ranges approach we weren't able to reproduce the issue anymore. However, the nature of Race Condition is still unclear and should be investigated by SideFX and/or Thinkbox.
This is just a workaround for one case of this Race Condition that we faced at PXO, there might be others that we are unaware of.
The line that you want to modify is in
<HOUDINI_INSTALL_DIR>/houdini/pdg/types/schedulers/tbdeadline.py
:
Just replace
frames = ','.join('{}'.format(str(i)) for i in task_ids)
with
frames = f'0-{task_ids[-1]}'
Two important things are assumed for the patch to work properly:
- New frames are appended strictly ascending. Meaning you can't append frame 4 before frame 5.
- Last member of task_ids list is always the highest number.
I hope SideFX will tell us if these assumptions are incorrect, but no issues so far.
The fix is published AS IS, but
feel free to reach me out here, via
LinkedIn (Aleksei Garifov, can't post a link here) or via
al.garifov@gmail.com if the fix works for you or if you struggle from other issues related to this Race Condition.
Cheers,
Aleksei Garifov