PDG local scheduler Redshift fetching issue

Forums PDG/TOPs PDG local scheduler Redshift fetching issue

2672 2 1


AndriiFroloff: Member; 10 posts; Joined: June 2018; Offline

Aug. 7, 2019 8:45 a.m.

Hi, I have an issue with batch redshift rendering using PDG and local scheduler.
PDG local scheduler assigns redshift with a new rendertask before current one is finished, as a result some frames are lost.
Job is set to the ‘single’ and redshift really is assigned with only one workitem at time, however arrording to GPU load redshift is often assigned with a new workitem in the middle of current rendering job and due to subsequent GPU load pattern, redshift ignores this new assigned workitem and continues rendering current one.
When all workitems in ropfetch node are processed (but only about a half images is rendered and written to disk), render process and GPU load is terminated.
Queue jumping is different each time the network is dirtied and recooked, however first workitems are often processed without jumping ahead (or at least without big jumping ahead) and probability of queue jumping increases over time so the last workitems often are assigned up to 10 times faster than the duration of of single workitem rendering.
It seems that local scheduler doesn't handle correctly feedback from redshift about render finish event and decides that redshift is already free while rendering is still in process.

I used:
-houdini 17.5 builds 268, 293, 327;
-redshift 2.6.44, 3.0.0.5;
I tried:
-set job to ‘single’ in localscheduler as well as in ropfetch as well as both;
-use ropfetch in a loop;
-set ‘non-blocking current frame rendering’
-different combinations of settings of ‘max cpus’, ‘cpus per thread’, ‘max threads’
all unsuccessful.

Currently I'm not too deep into pdg tasks and events system but I think that since redshift can execute ‘pre render’ and ‘post render’ scripts in hscript or python, there may be a way to make intermediary scripted processor that copies all workitems for rendering from upstream node and sets their state to ‘scheduled’, then sets first workitem state to ‘cooking’, then launches redshift render and when render is done - redshift post-render script tells scripted processor to change workitem state to ‘cooked’ and to repeat this procedure with next workitem.

Maybe there is some easier way to solve this problem, or if the only way is to make a scripted processor (or even a scripted scheduler) I would appreciate any help with its code or example/tutorial related to ‘manual’ interaction between pdg and renderer, since I haven't found any.


AndriiFroloff: Member; 10 posts; Joined: June 2018; Offline

Aug. 16, 2019 3:12 a.m.

My (temporary) solution is as follows:

Inside the TOP For-Loop, one python script pushes workitem attributes to the target channels in the current houdini session (so we see these changes applied in 3d view and other contexts during batch render). It also pushes the constructed path string with wedged variable values to the rop node's output path.

The next python node in For-Loop has spare parameter ‘render is busy’. First, the script sets this parameter to 1 and then launches the render by calling rop_node.render() method. After launching the render the script starts checking the ‘render is busy’ parameter (with some small delay between checks using time.sleep()) and breaks this checking loop when the parameter is set to ‘0’. While checking the node stays in ‘cooking’ state and blocks TOP For-Loop from launching the next render iteration.

Redshift rop runs the post-frame script that sets ‘render is busy’ to ‘0’. And since redshift rop launches this script always correctly after the render is finished (unlike ropfetch with localscheduler that always generates some amount of false ‘render finish’ events), all frames are always rendered and saved. Unlike checking for a rendered image file appeared on disk, this method allows even rendering to MPlay which is handy on the lookdev stage.

Also, I've noticed that besides losing frames, redshift is more than 2 times slower when launched from ropfetch compared to ‘manual’ launch. In my method since the rop is actually launched in ‘simulated manual’ mode, the speed is equal to manually launched rs rop.

Despite my solution works like a charm on a single machine, it (I guess) can't be easily scaled to the render farm. To be scalable the top network should contain both ropfetch+netscheduler and pythonfetch+localscheduler branches. So fixing this ropfetch+localscheduler+redshift issue would be great.


BrookeA: Staff; 391 posts; Joined: Aug. 2017; Offline

Aug. 16, 2019 10:37 a.m.

Hello! Can you submit this bug to support with the .hip file attached? This will be very helpful in tracking down the problem that you're experiencing. Thank you!

Edited by BrookeA - Aug. 16, 2019 10:38:21

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts