top.py spawned by pdgjobcmd.py, and Tractor cancels

Forums PDG/TOPs top.py spawned by pdgjobcmd.py, and Tractor cancels

2825 7 2


davidoberst: Member; 27 posts; Joined: March 2020; Offline

March 7, 2020 11:42 a.m.

I'm investigating houdini-engine licenses not being released from TractorScheduler farm jobs, and had a couple of questions about what I've found. Houdini 17.5.460, and the farm blades are CentOS 7.
A job (“PDG->Submit…”) has a main controlling task (“PDG->Cook…”), and all the various ropmantra_ropfetch frame tasks. The command for the controlling Cook task is something effectively like this (without full paths, setenvs and some other parms, etc):

hython pdgjobcmd.py --norcp ... <hython top.py --report none ...>

Tractor launches that on a blade, which results in hython running pdgjobcmd.py as, say process PID=1234, and starting hserve to get an Engine license. That's the only process the Tractor blade knows about. pdgjobcmd.py eventually spawns that whole second hython…top.py command string (“shell_cmd”), which results in a separate hython process (which will share the Engine license checkout), say PID=5678 and waits for it to finish, before exiting itself:

    proc = subprocess.Popen(shell_cmd, stdin=subprocess.PIPE, shell=True)
    proc.stdin.close()
    proc.wait()

The problem is when the pdgjobcmd.py 1234 process is caused to terminate for some reason - there's nothing that will cause the top.py 5678 to terminate as well. For instance, f we tell Tractor to cancel the job, for instance, it will do a “killsweep” on active tasks, and the blades will send SIGINT (I think) to the command process they have launched. So the pdgjobcmd.py 1234 process will end, Tractor will get a non-zero return code from that, and think all is done. But the top.py 5678 process just keeps chugging away. I believe (at least on linux) that it becomes a child of the system “init” process, which will take over the wait(). A mantra frame-cooking task might eventually finish and terminate its hython, although it would probably be better if it was cancelled at that point. But the top.py 5678 process of the main PDG->Cook controlling task is presumably going to sit there indefinitely, still consuming the Engine license, waiting for reports back from cook blades that will never come?

I would think that pdgjobcmd.py needs to take responsibility for the hython top.py process it spawned, and arrange for it to be terminated if it itself is terminated before the proc.wait() returns? I'm not sure what the most rebust method would be, but saving the result from POpen and using a combination of atexit.register() and signal.signal() to have an exit function terminate the top.py process if necessary might be one approach?


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

March 10, 2020 1:11 p.m.

Thanks, I've logged an RFE for this cleanup.


davidoberst: Member; 27 posts; Joined: March 2020; Offline

March 10, 2020 11:47 p.m.

Do not many people use TractorScheduler, at least with the “Submit Graph As Job” option? Interrupting or cancelling a Houdini job in Tractor can't be that uncommon, and immediately locks up a license, causes any Tractor limit count (which we configure to the number of Engine licenses) on “hython” or whatever to become incorrect and allows tasks to dispatch and immediately error out because a license won't be available, and generally messes up a Houdini farm. I'm really surprised this hasn't caused people problems long before this?

For anyone interested, this is how I'm working around it for now. We created a modified copy of “pdgjobcmd.py” (we store it in our $HSITE area. To avoid having to replace the original on every workstation and farm blade, instead in our HSITE “pythonrc.py” startup code, we monkey-patch the PyScheduler class with a modified _copyJobSupportFiles() method that copies our modified pdgjobcmd.py instead of the stock Houdini one, something like this:

def ourReplacement_copyJobSupportFiles(self):
    self.transferFile(os.path.expandvars( \
        #'$HFS/houdini/python2.7libs/pdgjob/pdgjobcmd.py'))
        '$HSITE/patches/pdgjobcmd.py'))
    self.transferFile(os.path.expandvars( \
        '$HFS/houdini/python2.7libs/pdgjob/pdgcmd.py'))

from pdg.scheduler import PyScheduler as thingToPatch
refToPatch = "_copyJobSupportFiles"
newVal = ourReplacement_copyJobSupportFiles
origVal = getattr(thingToPatch, refToPatch, None)
setattr(thingToPatch, refToPatch, newVal)

The mods to pdgjobcmd.py are something like this:

#patch just before subprocess.Popen() call in original pdgjobcmd.py code
proc = None
def abnormalExitHandler(signum, stackframe):
    print("in abnormalExitHandler() for signal %s"%(signum))
    if proc and proc.poll() == None:  # make sure process is still running
        proc.send_signal(signal.SIGINT)
# end of first patch lines

proc = subprocess.Popen(shell_cmd,
                stdin=subprocess.PIPE,
                shell=True)
# Avoid inheriting stdin to avoid python bug on Windows 7
# https://bugs.python.org/issue3905
proc.stdin.close()

# a couple more patch lines
signal.signal(signal.SIGINT, abnormalExitHandler)
signal.signal(signal.SIGTERM, abnormalExitHandler)
# end of patch

proc.wait()

#should undo signal handlers here

It could probably be more robust, and as our farm is Linux-only I haven't looked into any Windows process quirks, but in a couple quick tests it has worked to kill the spawned top.py when Tractor terminates the main cooking controller task.


mestela: Member; 1737 posts; Joined: May 2006; Offline

March 11, 2020 6:11 a.m.

We're trying to use tractor and tops, and would like to use that ‘submit as job’ feature, but have had a few other issues crop up (as well as the general daily distraction of production fun, teaching students etc).

Hoping as the need to push stuff to the farm mounts we'll focus more on tops. When that happens expect more posts here with more questions about tractor.

http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

March 11, 2020 10:59 a.m.

Interrupting or cancelling a Houdini job in Tractor can't be that uncommon

I think it usually doesn't show up as a problem because when tractor does the kill-sweep it will kill all the running tasks and since the PDG cook is polling the task states it will generally cascade failures and terminate itself quickly. How are you doing the cancel? In any case your fix makes sense so we'll look at backporting something like that asap.


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

March 13, 2020 9:31 a.m.

This should now be fixed in 18.0.404 and 17.5.554.


davidoberst: Member; 27 posts; Joined: March 2020; Offline

June 4, 2020 6:16 a.m.

chrisgreb
This should now be fixed in 18.0.404 and 17.5.554.

The changelog shows a SIGINT/SIGTERM fix in 18.0.403 (March 11/2020). But I can't seem to find an equivalent entry for 17.5.554? Did this make it into the 17.5 branch?

Also, the last production build of 17.5 is 17.5.460 (from Dec 5/2019), although there is a daily as recent as 17.5.631 from May 27. Are there plans for another production build of 17.5?


chrisgreb: Member; 603 posts; Joined: Sept. 2016; Offline

June 4, 2020 9:05 a.m.

davidoberst
Did this make it into the 17.5 branch?

Also, the last production build of 17.5 is 17.5.460 (from Dec 5/2019), although there is a daily as recent as 17.5.631 from May 27. Are there plans for another production build of 17.5?

It's there:
https://www.sidefx.com/changelog/?journal=17.5&categories=54&body=&version=&build_0=&build_1=&show_versions=on&show_compatibility=on&items_per_page= [www.sidefx.com]

There are no plans right now for another production build.

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts