Feedback Loop - Dynamic Partitioning

   2070   4   1
User Avatar
Member
66 posts
Joined: Feb. 2017
Offline
A bit of a pickle here. Example attached.

I need to loop over each work_item one-by-one using a feedback loop, respecting work item order.

When “Use Dynamic Partitioning” for the Feedback loop is turned “off”, order is respect, but if one work_item fails, the entire loop fails.

When “Use Dynamic Partitioning” for the Feedback loops is turned “on”, order is NOT respected, but if one work_item fails, the loop continues regardless.

What I'm hoping to have is respected order of the work_items, but if one work_item fails, to have the loop continue regardless (simplified example attached). The “production” scenario described below.


Production Scenario:

Each work_item is responsible for initiating a subprocess (with arguments specific to the work_item) of a program that can only have one instance running at a time. After the work_item is done with the subprocess, the subprocess is completely shut down and the next work_item restarts it.

So with two work_items in the loop, it'd go…

work_item_0 - Open subprocess
work_item_0 - Close subprocess
work_item_1 - Open subprocess
work_item_1 - Close subprocess


I'd like to do this all without crashing the loop if one work_item fails with a raised dependency error. Thank you for any help!

Attachments:
top_example.hip (119.3 KB)

User Avatar
Staff
585 posts
Joined: May 2014
Offline
The problem with the dynamic partitioning case is that isn't actually creating a feedback loop. Dynamic partitioning should be enabled only if the feedback begin node itself is dynamic. The configuration using dynamic partitioning in the file should actually be an error – note that the end block isn't actually producing any partitions, which is causing the whole feedback loop to not work properly.

In PDG, a work item won't evaluate if one of its dependencies has failed. That's the case for work items both inside of a loop and when not using loops. In order to get this working the way you want, you'll need to catch the errors before they cause the work item to fail. If your work item is out of process, you can use the scheduler options to control what happens on task failure. That can be configured on a per node basis using scheduler job parms for that node: https://www.sidefx.com/docs/houdini/tops/schedulers.html#jobparms [www.sidefx.com]

If your writing the process spawning code yourself in a Python Script then the best option is probably to catch any exceptions/failures and record the state as an attribute if you need it downstream.
User Avatar
Member
66 posts
Joined: Feb. 2017
Offline
Thank you for the reply.

In the actual production scenario it's an HDA Processor TOP that is raising the following error when attempting to read in an OBJ file-

Error: Unable to read file "SCREW;HH;;GB-5782(M10-1.5 x 45).obj". Expected array object (near byte offset 1, line 1, column 2)
The error appears to happen inconsistently throughout the TOPs operation and the OBJs that the system hangs up on open fine via the File SOP, outside of TOPs. I did put in a support ticket this morning for the ‘Expected array object’ error on a seemingly perfectly good OBJ ( Side Effects Support Ticket: #91349 ).

I was able to bypass any raised errors caused by the Python Script by using a Split top to bypass the other TOP nodes in the loop in case of error (which seems to work well enough).

I'll attempt pointing the HDA Processor to a different scheduler for appropriate error handling to see if I can gain access to a hook to tell the system to split out the stream on error, like I've done in the past with the Python Script TOP. Thank you for that advice.

I'll give it a try today and run the TOPs process overnight for a stress test. Again, thank you.
User Avatar
Staff
585 posts
Joined: May 2014
Offline
9krausec
I'll attempt pointing the HDA Processor to a different scheduler for appropriate error handling to see if I can gain access to a hook to tell the system to split out the stream on error, like I've done in the past with the Python Script TOP. Thank you for that advice.

You don't need to create a different scheduler. You can override the failure handling scheduler parameters just for that specific HDA Processor node instance by adding the job parameters as described in the documentation link in my last post. That way you can still share the same settings/concurrent job limits with the rest of the nodes in your graph.

Is your .obj file being copied to its destination as part of the PDG cook by an upstream node? Or, is the .obj file on a shared drive? Random failures in reading files like that typically indicates that the file is being accessed or modified by another process at the same time, or hasn't been fully copied when it's being loaded by the HDA Processor.
Edited by tpetrick - May 20, 2020 14:14:37
User Avatar
Member
66 posts
Joined: Feb. 2017
Offline
Thank you Taylor. I followed the documentation and set HDA Processor TOP node from “Report Error” to “Report Warning”.

The obj is being created by a Python Script TOP call to a third party software. So the obj is not being copied, it is being created.

Perhaps before the OBJ is actually saved out, the HDA Processor (which is a downstream item that ingests the OBJ) is trying to call upon the not yet existing OBJ. Perhaps I need to add a slight sleep delay right after the OBJ is generated to give it enough breathing room.

I'll run the batch test overnight and let you know tomorrow how it turns out.

Thanks again.
  • Quick Links