Search - User list
Full Version: tractorscheduler submit 17.5 vs 18.0
Root » PDG/TOPs » tractorscheduler submit 17.5 vs 18.0
davidoberst
I'm working to offload some Houdini work from our Windows desktops to our Tractor 2.2 CentOS farm. Our production files apparently don't use TOP nets right now even with just a local scheduler, so what I'm doing programatically is this:
  1. Have them select a node “theirnode” (filecache, ifd, rop_geometry are some likely types) they want to run on the farm
  2. Create a “topnet” node
  3. Create a “tractorscheduler” node in the topnet
  4. Delete the localscheduler that got created with the topnet, and set the topnet default scheduler to the tractorscheduler created above
  5. Create a “ropfetch” node, point “roppath” parm at “theirnode”, “framegeneration” to 1 (frame range) and turn on “Reset $HIP on cook”
  6. call hou.hipFile.save()
  7. programmatically pressButton() the topscheduler.submitjob button to submit graph as job.

This works with 17.5.460, the main PDG->Cook task on the farm job generates additional ropfetch tasks to do each output frame/item. But the same code with 18.0.460 (and a test install of the 18.0.491 daily), it doesn't. The job goes to Tractor, and the main PDG->Cook task starts up, but it doesn't create any subtasks or generate any output, and the main task finishes with output like this:
Given Node ‘farmtop’, Cooking Node ‘ropfetch_farm_torus_transform’
Finished Cook
Work Item States:
==== 2020/06/09 21:45:32 process complete, exit code: 0 ====

I'm guessing it is something about the work items or dirty flags of my topnet or ropfetch that is somehow differently initialized in 18.0 vs 17.5 where it worked. If I do this:

  1. Add a localscheduler to my topnet and make it the default
  2. Click the topnet's “Cook output node” button
  3. Let it start itself up, then do “Cancel Cook”
The ropfetch node shows task counts now, and I may have got some output files depending how long I let it go before cancelling. If I now delete the localscheduler node, reset the tractorscheduler as the default, and press Submit Graph As Job, the main task on the farm job will now sometimes continue on and create output.

If anyone sees something obvious I should be doing in the code between creating the ropfetch node and saving the file and doing the Submit Job As Graph to get this to work again with 18.0, please let me know. I've tried assorted calls to dirtyTasks(), dirtyAllTasks() and cook() on the various topnet nodes, but haven't figured it out yet.
chrisgreb
The log indicates that no work items were generated and the cook finished right away.

Are you showing the complete task log? My 18.0.460 log has more lines:


Running Houdini 18.0.460 with PID 29084
Loading .hip file /...
Given Node 'topnet1', Cooking Node 'smoke_src'
PDG Callback Server at upton.sidefx.com:60095
Finished Cook


Is this only happening with your script? Does the cook work if you open the saved hip file and submit as job manually?
davidoberst
chrisgreb
The log indicates that no work items were generated and the cook finished right away.

Are you showing the complete task log? My 18.0.460 log has more lines:


Running Houdini 18.0.460 with PID 29084
Loading .hip file /...
Given Node 'topnet1', Cooking Node 'smoke_src'
PDG Callback Server at upton.sidefx.com:60095
Finished Cook


Is this only happening with your script? Does the cook work if you open the saved hip file and submit as job manually?
Yes, earlier in the tractor log there are those two lines about the PID, and the Loading .hip file.
My code is just programatically pressing the “Submit Graph as Job” button programatically, but if I do it manually nothing different happens.

I went back to 17.5.460 and another copy of the test file, and ran my code to build the topnet and make its tractorscheduler node the default scheduler. If I then manually choose “Dirty and Cook Selected Node” on the ropfetch, or the topnet's “Generate Static Work Items” button, then the ropfetch node display gets the pretty dots and numbers indicating its work items, and they show up in the node's “Task Graph Table”. But in 18.0 this doesn't happen (no work items) with the tractorscheduler as the default, but if I make a localscheduler the default the work items will get created. So it seems that somehow the ropfetch isn't creating work items in the tractorscheduler case?

Is there something additional I need to do when programatically creating my ropfetch node in 18.0 that I didn't in 17.5 so it works with tractorscheduler? I'll see if I can package up our simple example file and send it along if I can't figure this out.
chrisgreb
I can't seem to reproduce this, could you please attach your hip file here or to a support ticket.
davidoberst
chrisgreb
I can't seem to reproduce this, could you please attach your hip file here or to a support ticket.
I think I may have found the problem. We had to do a monkeypatch of your TractorScheduler._initEngineConnection() routine, to work around the “Tractor session file” issue on our linux farm blades. On Windows, it merely calls your original method, so should be harmless. But somehow, with that in place the work item generation fails when a tractorsheduler is the default, even though it works fine on 17.5, and no error seems to be thrown, and even with the patch in place 18 is still able to submit jobs and function on the farm, etc.

Since you fixed that sessionfile bug in 18.0.421, I'm not going to worry too much about the whys, since I can just disable that monkeypatch for 18. That seems to have fixed the work item generation, and so far our 18.0 Tractor submit tests are now working.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Powered by DjangoBB