After a week I have a better sense of pdg and tractor. It's definitely more usable than it's ever been, but there's a few things I feel need a bit of work. Only a handful are bugs, most are workflow ideas that I'd like to discuss here with others before wrapping up as proper RFEs. It's a big list, let me know sidefx folk if you'd prefer I break this up:
- RFE - visual indicator for which is the default scheduler. It's stored on the top of the topnet itself, but it seems silly to have to keep jumping up and down to check that. Maybe an orange output circle on the default scheduler, or have it in text form on the top status bar
- RFE - Better python fallbacks, or a dropdown list of likely options? Ie if $PYTHON is empty, look for /usr/bin/python, but I've found almost nothing works with tractor unless you use hython, maybe that should be the default?
- RFE - errors should be better wrapped for the more common issues. Better to have a single line that says ‘cannot connect to message queue, check the callback and relay ports are valid’ instead of trying to decipher a wall of text python traceback.
- RFE - better preflight warnings, feels like a lot of stuff could be caught and flagged before stuff gets sent to the farm, eg ‘tractorscheduler library not found’?
- RFE - potential preflight to warn on fetch tops ‘hey this is a file cache, if this is after a sim you want to turn on all frames in one batch’
- RFE - option for timeout limit on MQ the tractor job for the MQ kills itself if it can't talk to the artist machine after 15 mins, too long to hog a blade on a busy farm.
- RFE - the log for ‘workItemState.CookedCache’ isn't clear… something like ‘output exists, skipping recook’?
- RFE - MQ should resubmit failed items. A big one for me, I'm manually resubmitting failed jobs, that doesn't feel right. Someone said it used to do this, is that true?
- RFE - Tractor Scheduler options for retry attempts, timeout limit. Maybe even a smart timeout limit option, look at average of frames around it?
- RFE - Tractor Scheduler memory requirement options. ‘this flip sim needs 128gb ram’, or is that expected to be handled by service keys?
- RFE - USD Rop, ‘Error: Layer saved to a location generated from a node path’ should be a warning, not an error - It's non fatal, but currently being flagged as an error can halt downstream pdg tasks
- RFE - option for tractor settings per top node rather than globally on the scheduler. Eg a flip sim requires a blade to itself, all the ram, some procedural rock generators can pack 16 items to a blade, a renderman render might pack 2 items to a blade. If they're all chained together, they all have to share the same tractor parameters.
- RFE- workflow should be more automated. Because frames don't retry themselves, and resubmitting often fails the first go (see the bug listed below), I spend a lot of time doing a submit, wait, clear temp, resubmit to catch errorred frames, repeat until all frames are done.
And what I think are bugs, I'll send these to support:
- BUG - ‘cook output node’ twice will fail 99% of the time, MQ ‘connection refused’ or similar. R.click and ‘delete temp directory’ on tractor scheduler fixes most times, but this should be automatic
- BUG - inherit local environment doesn't seem to work like expected. It should be my local envrionment, baked to the farm right? If so, why do renders fail?
- BUG - ‘delete files on disk’ doesn't work for fetch nodes linked to file cache sops
- BUG - ffmpeg top crashes houdini
- BUG - tractor can't seem to handle more than about 50 jobs at once. More than 60 mantra jobs at once, or 40 renderman jobs at once hangs forever on the farm, no warning, no timeout. Setting max limit to 50 fixes, but then we're not utilising the farm effectively. Are other folk seeing this?
- BUG - once the tractor scheduler gets an error badge, it stays there, can't be reset
More info and swearing can be found on the wiki page above, any questions just shout. It's the first time I've seen the system work and have a feeling that it could be a powerful and reliable production tool, but right now there's quite a few rough edges that need some love!
Cheers,
-matt