Compile Block slower than For Loop

Forums Technical Discussion Compile Block slower than For Loop

3375 5 2


papsphilip: Member; 388 posts; Joined: July 2018; Offline

Oct. 4, 2022 2:10 p.m.

i was testing one of Matt Estella's examples the cube slicer solver

tried to do it in a for loop and then a compile block around that.

it seems the compile block is not contributing and it is even slower if the geo is heavy, instead of a cube a rubber toy or our beloved squab.
Looking at the performance monitor the compile block is the slowest to cook, while in the for loop example the clip nodes are the heavy ones.

Any idea why that is so?

Edited by papsphilip - Oct. 4, 2022 14:10:35

Attachments:
Cubeslicesolver_TEST.hiplc (370.1 KB)


Enivob: Member; 2658 posts; Joined: June 2008; Offline

Oct. 4, 2022 2:19 p.m.

I believe only certain nodes are "compilable." If you use them inside the loop, you won't gain the benefit. Not sure where the list of non-compilable nodes is, however.

Using Houdini Indie 20.5
Windows 11 64GB Ryzen 16 core.
nVidia 3060RTX 12BG RAM.


tamte: Member; 9386 posts; Joined: July 2007; Offline

Oct. 4, 2022 2:40 p.m.

2 reasons:

1. I'd suggest wrapping both in a subnet and then look at the time next to subnet in Performance monitor
since otherwise while Compile End shows you the time it took for the whole block, the plain foreach will not and therefore you will jut see individual nodes that you'd have to sum up
for me compiled is slightly faster, even though it doesnt matter much because of 2.

2. the Multithreading of for loops can only happen if the Gather Method on block end is Merge Each Iteration, yours is Feedback Each iteration, so each iteration has to wait for previous one to finish to be fed back so they can't be run in parallel
you can still get some boost from compiling, but definitely not because of multithreaded for loop

Edited by tamte - Oct. 4, 2022 14:42:05

Tomas Slancik
CG Supervisor
Framestore, NY


papsphilip: Member; 388 posts; Joined: July 2018; Offline

Oct. 4, 2022 2:45 p.m.

tamte
2. the Multithreading of for loops can only happen if the Gather Method on block end is Merge Each Iteration, yours is Feedback Each iteration, so each iteration has to wait for previous one to finish to be fed back so they can't be run in parallel
you can still get some boost from compiling, but definitely not because of multithreaded for loop

yes that makes perfect sense! thank you!
making it faster in this case would require a different approach since multithreading is not possible


animatrix_: Member; 5100 posts; Joined: Feb. 2012; Offline

Oct. 4, 2022 3:11 p.m.

You are not going to get much performance by compiling a feedback loop. Your best bet is to consolidate the number of operations like performing 1 clip instead of 2, but as is it might be tricky.

An easy way to speed this up would be to process each piece in parallel using another for loop network at the top level inside the same compile network.

If you have a lot of patience, you could also implement the entire thing in VEX. In my implementation of Poly Carve SOP which is far more complex than what Clip SOP is doing, I got about 3x performance against Clip SOP:
https://forums.odforce.net/topic/44143-poly-carve-sop/?do=findComment&comment=232434 [forums.odforce.net]

So if all you want is to split a geometry in half, you could get much faster performance for this operation. I have to preserve all attributes and groups, as well as stitching up adjacent primitives, and many other operations, all of which has a performance cost.

Senior FX TD @ Industrial Light & Magic
Get to the NEXT level in Houdini & VEX with Pragmatic VEX! [www.pragmatic-vfx.com] https://lnk.bio/animatrix [lnk.bio]


elovikov: Member; 151 posts; Joined: June 2019; Offline

Oct. 4, 2022 3:18 p.m.

For me compiled block is also slightly faster.
As tamte said the main reason here is feedback loop. It's just can't be run in parallel.

Compiled blocks are very powerful but also a bit "blackboxed" I'd say.

Afaik they are consisted of several ways of optimizations:
- parallelized run of compiled operation in loops (only if you're merging result)
- inplace operations on geometry (without copy of input). again you can't easily tell if operation supports inplace operation, for example ops without changing the topology could be applied inplace, but something like subdivide wouldn't
- parallelized cook of node inputs, if node has more than one. in your example in theory merge inputs can be parallelized. this one wasn't supported when compiled block was introduced, but reserved "for future". I have no idea if it's supported right now
- very specific use of OpenCL sop and trying to avoid copying data between cpu and gpu

All of this explained here: https://vimeo.com/222881605 [vimeo.com]

The main problem here is that even if you managed to fit nodes in compiled block you just don't know what exactly you get out of it. "Compilation" just tries to create an optimal task graph with all this optimizations but hides what it looks exactly. Most of the time you have to use intuition basically

Being so low level and technical nodes they are definitely lack debugging and detailed reporting functions.

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts