top.py spawned by pdgjobcmd.py, and Tractor cancels

   2825   7   2
User Avatar
Member
27 posts
Joined: March 2020
Offline
I'm investigating houdini-engine licenses not being released from TractorScheduler farm jobs, and had a couple of questions about what I've found. Houdini 17.5.460, and the farm blades are CentOS 7.
A job (“PDG->Submit…”) has a main controlling task (“PDG->Cook…”), and all the various ropmantra_ropfetch frame tasks. The command for the controlling Cook task is something effectively like this (without full paths, setenvs and some other parms, etc):
hython pdgjobcmd.py --norcp ... <hython top.py --report none ...>
Tractor launches that on a blade, which results in hython running pdgjobcmd.py as, say process PID=1234, and starting hserve to get an Engine license. That's the only process the Tractor blade knows about. pdgjobcmd.py eventually spawns that whole second hython…top.py command string (“shell_cmd”), which results in a separate hython process (which will share the Engine license checkout), say PID=5678 and waits for it to finish, before exiting itself:
    proc = subprocess.Popen(shell_cmd, stdin=subprocess.PIPE, shell=True)
    proc.stdin.close()
    proc.wait()
The problem is when the pdgjobcmd.py 1234 process is caused to terminate for some reason - there's nothing that will cause the top.py 5678 to terminate as well. For instance, f we tell Tractor to cancel the job, for instance, it will do a “killsweep” on active tasks, and the blades will send SIGINT (I think) to the command process they have launched. So the pdgjobcmd.py 1234 process will end, Tractor will get a non-zero return code from that, and think all is done. But the top.py 5678 process just keeps chugging away. I believe (at least on linux) that it becomes a child of the system “init” process, which will take over the wait(). A mantra frame-cooking task might eventually finish and terminate its hython, although it would probably be better if it was cancelled at that point. But the top.py 5678 process of the main PDG->Cook controlling task is presumably going to sit there indefinitely, still consuming the Engine license, waiting for reports back from cook blades that will never come?

I would think that pdgjobcmd.py needs to take responsibility for the hython top.py process it spawned, and arrange for it to be terminated if it itself is terminated before the proc.wait() returns? I'm not sure what the most rebust method would be, but saving the result from POpen and using a combination of atexit.register() and signal.signal() to have an exit function terminate the top.py process if necessary might be one approach?
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
Thanks, I've logged an RFE for this cleanup.
User Avatar
Member
27 posts
Joined: March 2020
Offline
Do not many people use TractorScheduler, at least with the “Submit Graph As Job” option? Interrupting or cancelling a Houdini job in Tractor can't be that uncommon, and immediately locks up a license, causes any Tractor limit count (which we configure to the number of Engine licenses) on “hython” or whatever to become incorrect and allows tasks to dispatch and immediately error out because a license won't be available, and generally messes up a Houdini farm. I'm really surprised this hasn't caused people problems long before this?

For anyone interested, this is how I'm working around it for now. We created a modified copy of “pdgjobcmd.py” (we store it in our $HSITE area. To avoid having to replace the original on every workstation and farm blade, instead in our HSITE “pythonrc.py” startup code, we monkey-patch the PyScheduler class with a modified _copyJobSupportFiles() method that copies our modified pdgjobcmd.py instead of the stock Houdini one, something like this:

def ourReplacement_copyJobSupportFiles(self):
    self.transferFile(os.path.expandvars( \
        #'$HFS/houdini/python2.7libs/pdgjob/pdgjobcmd.py'))
        '$HSITE/patches/pdgjobcmd.py'))
    self.transferFile(os.path.expandvars( \
        '$HFS/houdini/python2.7libs/pdgjob/pdgcmd.py'))

from pdg.scheduler import PyScheduler as thingToPatch
refToPatch = "_copyJobSupportFiles"
newVal = ourReplacement_copyJobSupportFiles
origVal = getattr(thingToPatch, refToPatch, None)
setattr(thingToPatch, refToPatch, newVal)
The mods to pdgjobcmd.py are something like this:
#patch just before subprocess.Popen() call in original pdgjobcmd.py code
proc = None
def abnormalExitHandler(signum, stackframe):
    print("in abnormalExitHandler() for signal %s"%(signum))
    if proc and proc.poll() == None:  # make sure process is still running
        proc.send_signal(signal.SIGINT)
# end of first patch lines

proc = subprocess.Popen(shell_cmd,
                stdin=subprocess.PIPE,
                shell=True)
# Avoid inheriting stdin to avoid python bug on Windows 7
# https://bugs.python.org/issue3905
proc.stdin.close()

# a couple more patch lines
signal.signal(signal.SIGINT, abnormalExitHandler)
signal.signal(signal.SIGTERM, abnormalExitHandler)
# end of patch

proc.wait()

#should undo signal handlers here
It could probably be more robust, and as our farm is Linux-only I haven't looked into any Windows process quirks, but in a couple quick tests it has worked to kill the spawned top.py when Tractor terminates the main cooking controller task.
User Avatar
Member
1737 posts
Joined: May 2006
Offline
We're trying to use tractor and tops, and would like to use that ‘submit as job’ feature, but have had a few other issues crop up (as well as the general daily distraction of production fun, teaching students etc).

Hoping as the need to push stuff to the farm mounts we'll focus more on tops. When that happens expect more posts here with more questions about tractor.
http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
Interrupting or cancelling a Houdini job in Tractor can't be that uncommon
I think it usually doesn't show up as a problem because when tractor does the kill-sweep it will kill all the running tasks and since the PDG cook is polling the task states it will generally cascade failures and terminate itself quickly. How are you doing the cancel? In any case your fix makes sense so we'll look at backporting something like that asap.
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
This should now be fixed in 18.0.404 and 17.5.554.
User Avatar
Member
27 posts
Joined: March 2020
Offline
chrisgreb
This should now be fixed in 18.0.404 and 17.5.554.
The changelog shows a SIGINT/SIGTERM fix in 18.0.403 (March 11/2020). But I can't seem to find an equivalent entry for 17.5.554? Did this make it into the 17.5 branch?

Also, the last production build of 17.5 is 17.5.460 (from Dec 5/2019), although there is a daily as recent as 17.5.631 from May 27. Are there plans for another production build of 17.5?
User Avatar
Member
603 posts
Joined: Sept. 2016
Offline
davidoberst
Did this make it into the 17.5 branch?

Also, the last production build of 17.5 is 17.5.460 (from Dec 5/2019), although there is a daily as recent as 17.5.631 from May 27. Are there plans for another production build of 17.5?

It's there:
https://www.sidefx.com/changelog/?journal=17.5&categories=54&body=&version=&build_0=&build_1=&show_versions=on&show_compatibility=on&items_per_page= [www.sidefx.com]

There are no plans right now for another production build.
  • Quick Links