I'm investigating houdini-engine licenses not being released from TractorScheduler farm jobs, and had a couple of questions about what I've found. Houdini 17.5.460, and the farm blades are CentOS 7.
A job (“PDG->Submit…”) has a main controlling task (“PDG->Cook…”), and all the various ropmantra_ropfetch frame tasks. The command for the controlling Cook task is something effectively like this (without full paths, setenvs and some other parms, etc):
hython pdgjobcmd.py --norcp ... <hython top.py --report none ...>
Tractor launches that on a blade, which results in hython running pdgjobcmd.py as, say process PID=1234, and starting hserve to get an Engine license. That's the only process the Tractor blade knows about. pdgjobcmd.py eventually spawns that whole second hython…top.py command string (“shell_cmd”), which results in a separate hython process (which will share the Engine license checkout), say PID=5678 and waits for it to finish, before exiting itself:
proc = subprocess.Popen(shell_cmd, stdin=subprocess.PIPE, shell=True)
proc.stdin.close()
proc.wait()
The problem is when the pdgjobcmd.py 1234 process is caused to terminate for some reason - there's nothing that will cause the top.py 5678 to terminate as well. For instance, f we tell Tractor to cancel the job, for instance, it will do a “killsweep” on active tasks, and the blades will send SIGINT (I think) to the command process they have launched. So the pdgjobcmd.py 1234 process will end, Tractor will get a non-zero return code from that, and think all is done. But the top.py 5678 process just keeps chugging away. I believe (at least on linux) that it becomes a child of the system “init” process, which will take over the wait(). A mantra frame-cooking task might eventually finish and terminate its hython, although it would probably be better if it was cancelled at that point. But the top.py 5678 process of the main PDG->Cook controlling task is presumably going to sit there indefinitely, still consuming the Engine license, waiting for reports back from cook blades that will never come?
I would think that pdgjobcmd.py needs to take responsibility for the hython top.py process it spawned, and arrange for it to be terminated if it itself is terminated before the proc.wait() returns? I'm not sure what the most rebust method would be, but saving the result from POpen and using a combination of atexit.register() and signal.signal() to have an exit function terminate the top.py process if necessary might be one approach?