Attribute Wrangle performance and grouping

Forums Technical Discussion Attribute Wrangle performance and grouping

5318 9 2


Rubs: Member; 14 posts; Joined: Feb. 2017; Offline

April 7, 2017 7:26 a.m.

Hi,

I keep running into situations similar to these (Houdini 15.5.x):

1. Let's say I have a sop geometry of N points flowing into a chain of 4 attribute wrangle nodes (all time dependent so execution happens every frame). It is significantly faster to have the code in these four independent nodes moved into a single attribute wrangle. I can see how there might be an explanation, but I would like to know exactly why this happens, just to gain some understanding of what's happening in the network when passing geo from node to node.

2. Let's consider the previous geo from 1. and also a geo with double the amount of points organised in two different groups of equal size (half and half, that is N points and N points). If I put the geometry from 1. through an attribute wrangle node I would expect to get roughly the same performance as when I put the geometry with double the amount of points but acting only in one of the groups (using group param). In theory there is only vex execution happening in N points for both cases, but the performance drops significantly with the latter. Why is this? Is there a betterway to act on groups selectively with performance in mind?

Thank you very much beforehand!

Cheers


bonsak: Member; 459 posts; Joined: Oct. 2011; Offline

April 7, 2017 8:14 a.m.

Hi
Much easier to answer if you post an example.

-b

http://www.racecar.no [www.racecar.no]


jlait: Staff; 6225 posts; Joined: July 2005; Offline

April 7, 2017 11:17 a.m.

1) Geometry is *not* passed node-to-node in Houdini. Each node has its own geometry that it copies from its input. This is why you can move the display flag upstream and see intermediate results without recooking. There is a huge amount of optimizations that make this not as crazy as it sounds.

First, note your first-frame time will always be significantly slower as the VEX has to be compiled & optimized.

Barring this, one has to deal with the natural overhead of invoking VEX. At best, all the attributes you bind need to be streamed out of your source gdp into the VEX engine, then streamed back into the original attributes. If your kernel is simple enough, this can take most of the time. Also, if you are working with array attributes, it can be particularly important to make sure you aren't writing back attributes you only read. The “Attributes to Create” mask can be used to ensure you don't do this. We can't determine if an attribute is written to or not, so any attribute you access will be written back if this is the default value.

We're continuously spending effort to minimize this overhead, however. 16.0 has more aggressive caching of code & bindings that should help.

Another cost is the synchronizing the incoming geometry data. This has to be done 4 times in the 4 node example as we keep a separate copy for each node. Polygons are particularly slow, but this has been sped up significantly in 16.0. A 3 MPoly model I just tested went from 0.062s -> 0.010s on switching to 16.0 for a simple @P += 1; kernel.

2) It depends where the time is spent. If it is primarily your kernel, you would see the speed you expect. But if it is in synchronizing the geometry (such as in the 3MPoly example) you still need to synchronize everything, so it will be unchanged. Similarly, if the group is an ad-hoc group (like 0-100000 or @foo>0.5) there is the time to actually create, initialize, and destroy the group to account.

As bonsak suggests, however, a concreted .hip file would help understand what is generating these differences.


Rubs: Member; 14 posts; Joined: Feb. 2017; Offline

April 10, 2017 9:41 a.m.

Hi jlait,

Thanks for your answer, it really helps to know more about what's going on under the hood.

Following up on question 2:
Kernel's code is fairly simple, IMO as optimized as it can be and group exists beforehand so not ad-hoc creation. I have mocked-up a toy sop network where you can see, performance-wise, a similar behavior. Apologies, but it is not possible for me to upload a hip file (see attached image).

(From the image) Both wrangle nodes run the very same code on points, reading @Time just so they are time dependent. On my machine, on Houdini 15.5.x, the wrangle with the input geometry of 100000 points runs at ~110 fps, whereas the node with the input geometry of 200000 points with set to run on a group of 100000 points runs at ~56 fps.

I hope this example illustrates what I was trying to explain.

Cheers,
Ruben

Edited by Rubs - April 10, 2017 09:41:50

Attachments:
groupingPerformanceComparison.png (63.4 KB)


jlait: Staff; 6225 posts; Joined: July 2005; Offline

April 10, 2017 9:58 a.m.

Yes, I can reproduce that difference. The problem here is overhead, iterating over the group is accounting for the performance difference. We have a special fast path for no groups & fully defragmented point lists that can just directly run over the point offsets.


Rubs: Member; 14 posts; Joined: Feb. 2017; Offline

April 10, 2017 12:17 p.m.

when you say ‘overhead’ are you referring to the overhead of synchronizing the geometry as you mentioned on your previous post? (and hence, larger geometry, larger overhead despite of same target geometry sizes?)

It sounds like, acting on small groups sequentially on relatively large geometries might bring performance down even when the size of the groups is small?


jlait: Staff; 6225 posts; Joined: July 2005; Offline

April 10, 2017 2:01 p.m.

I meant the overhead of stepping through the group to see what points are active, rather than just directly working on the entire point list.

So, yes, I would think acting on small groups sequentially will not be as efficient as acting on everything and doing your if test inside of the kernel. But keep in mind this sort of optimization can/will change version-to-version. So I don't like giving too strong of a rule-of-thumb for fear it will be followed long after we speed up whatever caused it to be necessary.


Rubs: Member; 14 posts; Joined: Feb. 2017; Offline

April 11, 2017 6:59 a.m.

I'll keep that in mind. Thanks for your answer!

Something that I've been asking for on a different thread is some reference docs with technical documentation, best practices, performance considerations, etc. Is there anything like that available anywhere?


jlait: Staff; 6225 posts; Joined: July 2005; Offline

April 11, 2017 10:58 a.m.

Unfortunately, not that I know of.

The problem is you get into some messy details very quickly, and these messy details can change abruptly.

For example, the speed of blocks of computation are highly sensitive to whether they get JIT compiled to native code. But the choice to do this is quite a black box, so is hard to provide concrete information to optimize for.

Probably the best example of this may be pbd_granular.h's computePointDistanceDelta:

void
computePointDistanceDelta(vector4 dPr, dPa;
                          const vector pi, pj;
                          const float mass, massj;
                          const float curdist;
                          const float restdist;
                          const float kpr, kpa;
                          const float wr, wa;
                          const int shocktype;
                          const vector shockaxis;
                          const float shockscale)
{
    vector r = pi - pj;
    // Constraint and gradient.
    float C = curdist - restdist;
    vector gradC = r / curdist;
    float ks = 1;
    if (shocktype == 2)         // local
    {
        ks = __taylor_exp( shockscale * dot(r, shockaxis));
    }
    float weight = ks * massj / (mass + ks * massj);
    // Handle opposing weight being 0, ie, infinite
    weight = select(massj == 0, 1, weight);
    // Use weighted distance constraint if within attract distance.
    vector4 dpj = -kpa * C * gradC * (weight * wa);
    dpj.w = wa;
    dPa += dpj;
    // Use weighted (inequality) repel constraint if within repel distance.
    dpj = -kpr * C * gradC * weight;
    dpj.w = wr;
    dpj = select(C < 0, dpj, 0);
    dPr += dpj;
}

This block of code is used by the grain solver so we want it as fast as possible. In particular, we fetch the neighbour point attributes (like mass, position, radius) before this block of computation. The point() functions can't be made native code, so by pulling them out we make sure we have a long block of potentially native code.

The next thing to note is that this is used by stuff like pointDistanceUpdateNoMassUniformPscale() which makes assumptions like mass == 1.0. The computePointDistanceDelta does not have any apparent special cases for mass == 1.0. VEX uses very strong constant propagation, meaning all those mass computations will be cut out.

    float weight = ks * massj / (mass + ks * massj);

This, for example, becomes ks / ( 1 + ks).

But, likewise, shocktype is uniform - it never varies particle by particle. So VEX will build a version of the function with shocktype set to a constant value, making

 if (shocktype == 2)

the equivalent of either if (0) or if (1). If it ends up as if (0), ks is then a constant 1. Which causes weight to be a constant 0.5; further cutting out code…

You will note the use of select() when the condition is actually varying per particle. In this case we don't want an if() as that will break the chain of native code. Since computing both halves of the expression is cheap enough, we do that and use the select() instruction to avoid branching.


Rubs: Member; 14 posts; Joined: Feb. 2017; Offline

April 12, 2017 10:22 a.m.

Thanks for the example and all the detailed information!

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts