I am using simtracker.py to distribute slices on my Tractor farm, and come across a weird situation. Lets say one of the slices ran out of memory and errored out.. the other slices should either keep going, or stop. Unfortunately what is happening is in between - the other slices log that the error has occurred and are aborting, but then dont actually abort and stay in this hung state.
My code snippet that renders the slice:
try: cache_node.render(verbose=True) print(datetime.now().strftime("%H:%M:%S")) print("Cache completed.") except Exception as exc: print(datetime.now().strftime("%H:%M:%S")) print(f"Slice {args.slice_num} failed execution with exception {exc}") print(datetime.now().strftime("%H:%M:%S")) print(f"Cached and completed. Bye bye.")
The actual logs on the running slices when they detect one of the slices has crashed:
14:54:15 save_slices frame 91 (91 of 240) 14:54:23 save_slices frame 92 (92 of 240) 14:54:31 save_slices frame 93 (93 of 240) Tracker reports an error, aborting Tracker reports an error, aborting Tracker reports an error, aborting Tracker reports an error, aborting Tracker reports an error, aborting Tracker reports an error, aborting EOF at position 285 of 369295617 Error occurred in message 369295617 state is 1 ---- Pump enters error status ---- Tracker reports an error, aborting EOF at position 285 of 369295617 Error occurred in message 369295617 state is 1 ---- Pump enters error status ---- Tracker reports an error, aborting
As you can see it wants to abort, but doesn't throw an exception(as the except print is missing), nor does it continue after the call to render()(as the next print is missing), nor does it exit the process(as the process keeps running and doesn't end on the render node, i still see it when running ps). It just stays hung in the render() call.
Is there anything I may be doing wrong? Or is there any way I can handle this situation in a better way - detect that one slice has gotten jammed up and then end the remaining slice processes gracefully?