Compile Block slower than For Loop

   3375   5   2
User Avatar
Member
388 posts
Joined: July 2018
Offline
i was testing one of Matt Estella's examples the cube slicer solver

tried to do it in a for loop and then a compile block around that.

it seems the compile block is not contributing and it is even slower if the geo is heavy, instead of a cube a rubber toy or our beloved squab.
Looking at the performance monitor the compile block is the slowest to cook, while in the for loop example the clip nodes are the heavy ones.

Any idea why that is so?
Edited by papsphilip - Oct. 4, 2022 14:10:35

Attachments:
Cubeslicesolver_TEST.hiplc (370.1 KB)

User Avatar
Member
2658 posts
Joined: June 2008
Offline
I believe only certain nodes are "compilable." If you use them inside the loop, you won't gain the benefit. Not sure where the list of non-compilable nodes is, however.
Using Houdini Indie 20.5
Windows 11 64GB Ryzen 16 core.
nVidia 3060RTX 12BG RAM.
User Avatar
Member
9386 posts
Joined: July 2007
Offline
2 reasons:

1. I'd suggest wrapping both in a subnet and then look at the time next to subnet in Performance monitor
since otherwise while Compile End shows you the time it took for the whole block, the plain foreach will not and therefore you will jut see individual nodes that you'd have to sum up
for me compiled is slightly faster, even though it doesnt matter much because of 2.

2. the Multithreading of for loops can only happen if the Gather Method on block end is Merge Each Iteration, yours is Feedback Each iteration, so each iteration has to wait for previous one to finish to be fed back so they can't be run in parallel
you can still get some boost from compiling, but definitely not because of multithreaded for loop
Edited by tamte - Oct. 4, 2022 14:42:05
Tomas Slancik
CG Supervisor
Framestore, NY
User Avatar
Member
388 posts
Joined: July 2018
Offline
tamte
2. the Multithreading of for loops can only happen if the Gather Method on block end is Merge Each Iteration, yours is Feedback Each iteration, so each iteration has to wait for previous one to finish to be fed back so they can't be run in parallel
you can still get some boost from compiling, but definitely not because of multithreaded for loop

yes that makes perfect sense! thank you!
making it faster in this case would require a different approach since multithreading is not possible
User Avatar
Member
5100 posts
Joined: Feb. 2012
Offline
You are not going to get much performance by compiling a feedback loop. Your best bet is to consolidate the number of operations like performing 1 clip instead of 2, but as is it might be tricky.

An easy way to speed this up would be to process each piece in parallel using another for loop network at the top level inside the same compile network.

If you have a lot of patience, you could also implement the entire thing in VEX. In my implementation of Poly Carve SOP which is far more complex than what Clip SOP is doing, I got about 3x performance against Clip SOP:
https://forums.odforce.net/topic/44143-poly-carve-sop/?do=findComment&comment=232434 [forums.odforce.net]

So if all you want is to split a geometry in half, you could get much faster performance for this operation. I have to preserve all attributes and groups, as well as stitching up adjacent primitives, and many other operations, all of which has a performance cost.
Senior FX TD @ Industrial Light & Magic
Get to the NEXT level in Houdini & VEX with Pragmatic VEX! [www.pragmatic-vfx.com] https://lnk.bio/animatrix [lnk.bio]
User Avatar
Member
151 posts
Joined: June 2019
Offline
For me compiled block is also slightly faster.
As tamte said the main reason here is feedback loop. It's just can't be run in parallel.

Compiled blocks are very powerful but also a bit "blackboxed" I'd say.

Afaik they are consisted of several ways of optimizations:
- parallelized run of compiled operation in loops (only if you're merging result)
- inplace operations on geometry (without copy of input). again you can't easily tell if operation supports inplace operation, for example ops without changing the topology could be applied inplace, but something like subdivide wouldn't
- parallelized cook of node inputs, if node has more than one. in your example in theory merge inputs can be parallelized. this one wasn't supported when compiled block was introduced, but reserved "for future". I have no idea if it's supported right now
- very specific use of OpenCL sop and trying to avoid copying data between cpu and gpu

All of this explained here: https://vimeo.com/222881605 [vimeo.com]

The main problem here is that even if you managed to fit nodes in compiled block you just don't know what exactly you get out of it. "Compilation" just tries to create an optimal task graph with all this optimizations but hides what it looks exactly. Most of the time you have to use intuition basically

Being so low level and technical nodes they are definitely lack debugging and detailed reporting functions.
  • Quick Links