DjangoBB LoFi version

Full Version: Tops and Tractor diary (plus bugs/rfe/discussion points)

Root » PDG/TOPs » Tops and Tractor diary (plus bugs/rfe/discussion points)

mestela

June 2, 2020 10:44:33

Made a little ranty diary of trying to get this all to work, with a summary of what we needed to adjust for our setup. Will keep building on this as we try more complex setups, but it's pleasing to see that so far its working better than our many prior attempts!

https://www.tokeru.com/cgwiki/index.php?title=HoudiniTops#Tops_and_tractor_diary [www.tokeru.com]

mestela

June 5, 2020 20:49:25

After a week I have a better sense of pdg and tractor. It's definitely more usable than it's ever been, but there's a few things I feel need a bit of work. Only a handful are bugs, most are workflow ideas that I'd like to discuss here with others before wrapping up as proper RFEs. It's a big list, let me know sidefx folk if you'd prefer I break this up:

RFE - visual indicator for which is the default scheduler. It's stored on the top of the topnet itself, but it seems silly to have to keep jumping up and down to check that. Maybe an orange output circle on the default scheduler, or have it in text form on the top status bar
RFE - Better python fallbacks, or a dropdown list of likely options? Ie if $PYTHON is empty, look for /usr/bin/python, but I've found almost nothing works with tractor unless you use hython, maybe that should be the default?
RFE - errors should be better wrapped for the more common issues. Better to have a single line that says ‘cannot connect to message queue, check the callback and relay ports are valid’ instead of trying to decipher a wall of text python traceback.
RFE - better preflight warnings, feels like a lot of stuff could be caught and flagged before stuff gets sent to the farm, eg ‘tractorscheduler library not found’?
RFE - potential preflight to warn on fetch tops ‘hey this is a file cache, if this is after a sim you want to turn on all frames in one batch’
RFE - option for timeout limit on MQ the tractor job for the MQ kills itself if it can't talk to the artist machine after 15 mins, too long to hog a blade on a busy farm.
RFE - the log for ‘workItemState.CookedCache’ isn't clear… something like ‘output exists, skipping recook’?
RFE - MQ should resubmit failed items. A big one for me, I'm manually resubmitting failed jobs, that doesn't feel right. Someone said it used to do this, is that true?
RFE - Tractor Scheduler options for retry attempts, timeout limit. Maybe even a smart timeout limit option, look at average of frames around it?
RFE - Tractor Scheduler memory requirement options. ‘this flip sim needs 128gb ram’, or is that expected to be handled by service keys?
RFE - USD Rop, ‘Error: Layer saved to a location generated from a node path’ should be a warning, not an error - It's non fatal, but currently being flagged as an error can halt downstream pdg tasks
RFE - option for tractor settings per top node rather than globally on the scheduler. Eg a flip sim requires a blade to itself, all the ram, some procedural rock generators can pack 16 items to a blade, a renderman render might pack 2 items to a blade. If they're all chained together, they all have to share the same tractor parameters.
RFE- workflow should be more automated. Because frames don't retry themselves, and resubmitting often fails the first go (see the bug listed below), I spend a lot of time doing a submit, wait, clear temp, resubmit to catch errorred frames, repeat until all frames are done.

And what I think are bugs, I'll send these to support:

BUG - ‘cook output node’ twice will fail 99% of the time, MQ ‘connection refused’ or similar. R.click and ‘delete temp directory’ on tractor scheduler fixes most times, but this should be automatic
BUG - inherit local environment doesn't seem to work like expected. It should be my local envrionment, baked to the farm right? If so, why do renders fail?
BUG - ‘delete files on disk’ doesn't work for fetch nodes linked to file cache sops
BUG - ffmpeg top crashes houdini
BUG - tractor can't seem to handle more than about 50 jobs at once. More than 60 mantra jobs at once, or 40 renderman jobs at once hangs forever on the farm, no warning, no timeout. Setting max limit to 50 fixes, but then we're not utilising the farm effectively. Are other folk seeing this?
BUG - once the tractor scheduler gets an error badge, it stays there, can't be reset

More info and swearing can be found on the wiki page above, any questions just shout. It's the first time I've seen the system work and have a feeling that it could be a powerful and reliable production tool, but right now there's quite a few rough edges that need some love!

Cheers,

-matt

davidoberst

June 9, 2020 23:15:14

For anyone using Tractor 2.2, note that the 2.3 release notes [rmanwiki.pixar.com] indicate a pair of 2.2 bug fixes relating to the “expand chunk” mechanism for dynamically creating tasks.

Address a task state transition race condition in some “expand chunk” use cases.
Fixed full job restart pruning of previously expanded tasks that were created by the “expand chunk” mechanism.

The Houdini PDG source code does indeed have references to using TR_EXPAND_CHUNK. On our 2.2 farm, we are seeing occasional tasks in Houdini tractor scheduler jobs that from their log indicate they finished successfully (and they did), but Tractor still shows their state as something other than done, and so the job wrongly thinks it is still waiting to continue even though it has nothing left to do. This certainly sounds like the “task state transition” bug that 2.3 apparently fixes.

mestela

June 10, 2020 00:01:48

Heh, we just jumped to tractor2.4 yesterday. So far no surprises, but I'll keep an eye out!

chrisgreb

June 10, 2020 10:26:04

Tractor 2.4 actually has more related fixes. I think this fix is for a bug we reported awhile back for 2.3. The workaround in place now is to rate-limit the expanded job submissions.

A more robust job-global system for sorting newly ready commands produced by “expand” tasks. This change addresses the “Cmd not Ready?” error problem - which was due to sorting key collisions (precision) on large recursively expanded jobs.

davidoberst

June 17, 2020 11:14:39

For the rate-limiting you mention, is it still creating all the tasks at once, but just putting some sort of delay in the loop so there's a bit more time between each one being given to Tractor? Or is it creating a batch of tasks, then waiting until most of those are done before creating the next batch, so that Tractor doesn't have all the tasks pending at once?

You don't happen to know when this workaround was introduced? I couldn't seem to find a changelog entry that sounded like that.

chrisgreb

June 17, 2020 11:39:02

davidoberst
For the rate-limiting you mention, is it still creating all the tasks at once, but just putting some sort of delay in the loop so there's a bit more time between each one being given to Tractor? Or is it creating a batch of tasks, then waiting until most of those are done before creating the next batch, so that Tractor doesn't have all the tasks pending at once?

You don't happen to know when this workaround was introduced? I couldn't seem to find a changelog entry that sounded like that.

It's the former. It was added back in 17.5.346

Added batching of job submissions to Tractor binding. Tasks are submitted using the TR_EXPAND_CHUNK method, and each such file will contain up to some number of tasks specs (50 by default), at some minimum period (1 second by default). This can be overriden with environment variables on the job side: $PDG_TR_TASKS_PER_TICK and $PDG_TR_TICK_PERIOD
This change has been added to avoid an error where the Tractor engine would not progress with a job, and would continually add error messages to the engine log with the prefix “assigner Cmd not Ready?”.