Attribute wrangle significantly slower than the old point wrangle?

   6880   14   6
User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
Hello, just wondering, is it normal that the attribute wrangle is significantly slower than the old (hidden in H16) point wrangle?

As an example, the performance dips from 2.5s to 4.9s just by changing the type of a point wrangle to an attribute wrangle - test file attached.
Edited by Lyubomir Popov - Sept. 3, 2017 14:30:40

Attachments:
fast_point_wrangle.png (479.2 KB)
point_vs_attr_wrangle_speed.hiplc (52.2 KB)
slow_attr_wrangler.png (477.4 KB)

User Avatar
Member
2537 posts
Joined: June 2008
Offline
I can confirm your findings. But we are only talking about 1 one hundredth of a second.

Attachments:
Untitled-1.jpg (45.8 KB)

Using Houdini Indie 20.0
Windows 11 64GB Ryzen 16 core.
nVidia 3050RTX 8BG RAM.
User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
Which numbers are you comparing? Seems nearly 2 times slower in my tests.

From your screenshot, the point wrangle is 4.367s, while the attrib wrangler 8.139s.

I measured separately with the other one disabled, to make sure they don't interfere with each other.
Edited by Lyubomir Popov - Sept. 3, 2017 15:04:34
User Avatar
Member
7770 posts
Joined: Sept. 2011
Online
The old point wrangle ran in SOP context, whereas the new attrib wrangle runs in the more generic CVEX context. There are cases where the old SOP context was faster, especially for trivial operations. The older one is now deprecated and in danger of being removed in a future version. For this reason and because of the added power and flexibility of the CVEX based nodes, I wouldn't use the old deprecated one even though it may be faster in some cases.

Edit:
I took a look at your example, and it does appear that pcopen has a significant performance deficit with the cvex based attribvop. Perhaps this is a bug worth investigating further by someone at SideFX.
Edited by jsmack - Sept. 3, 2017 18:19:08
User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
jsmack thanks for the reply, I try to avoid deprecated nodes as much as possible - which is why it took me months to pinpoint the issue.

I was taking something from an example, putting it in an asset full of attrib wrangles and the asset would become 2x slower than the original. Only after eliminating every other difference did I come to the conclusion it has to be the attrib wrangle's fault.

Re your edit, what is the best way to flag this with SideFX?

As to the sop context nodes being removed in a future version, I hope it doesn't happen before the cvex ones are comparably fast.
Edited by Lyubomir Popov - Sept. 4, 2017 14:23:00
User Avatar
Member
4515 posts
Joined: Feb. 2012
Offline
The best way is to submit a bug with your scene to SESI using this link:
https://www.sidefx.com/bugs/submit/ [www.sidefx.com]

Then wait for the bug to be slayed and/or an explanation from the developers
Senior FX TD @ Industrial Light & Magic
Get to the NEXT level in Houdini & VEX with Pragmatic VEX! [www.pragmatic-vfx.com]

youtube.com/@pragmaticvfx | patreon.com/animatrix | animatrix2k7.gumroad.com
User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
animatrix3d Thanks, doing that now.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Times are equal for 1 core but with 4 cores it's diverging. Screenshots with 10001 points.

Attachments:
Screen Shot 2017-09-05 at 6.46.34 AM.png (509.6 KB)
Screen Shot 2017-09-05 at 6.46.21 AM.png (454.0 KB)

User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
Thanks aRtye, I've submitted a bug report, and linked back to this post so they can see everyone's results.
User Avatar
Member
4515 posts
Joined: Feb. 2012
Offline
Another thing to note is the new Wrangles can be compiled while the old ones can not be due to using the good ol VOPSOP, which means there are some scenarios where the new Wrangles can crush the old ones via chain of Wrangles and/or merged For Each loop blocks.
Senior FX TD @ Industrial Light & Magic
Get to the NEXT level in Houdini & VEX with Pragmatic VEX! [www.pragmatic-vfx.com]

youtube.com/@pragmaticvfx | patreon.com/animatrix | animatrix2k7.gumroad.com
User Avatar
Staff
6205 posts
Joined: July 2005
Offline
Thank you! This was a very interesting file. First, I'd propose a different piece of code:

int ptlist[] = pcfind(0, "P", @P, 9999, 9999);
vector pt;
vector t = { 0, 0, 0 };
int count = 0;
foreach (int ptidx; ptlist)
{
pt = point(0, "P", ptidx);
count += 1;
t += pt;
v@t = t;
v@pt = pt;
}
if(count > 0) {
@P += 100 * cos(@Frame) * t / float(count);
}

since I don't like pciterate(). This runs faster in my tests, and I personally like the flow better.

The strange doubling of time is due to thread scheduling.

The old VOP SOP broke the incoming geometry into pages of points and ran on each page. The downside is if you have groups that have only a few points in a page, you are getting lots of overhead.

The new VOP SOP marshalls into 1k buffers and runs on those 1k buffers. This allows sparse groups to be merged together.

The downside of the merging, however, is if you have holes in your geometry. In this example the old scatter sop has left 8 holes in the point list. So the first thread will grab 1016 points from the first page and 8 points from the second page. This then locks the second page until the first thread is done. Normally this isn't noticeable as processing a page is very fast, but in a very expensive task like this it means that after all threads are done, one is stuck waiting for the first thread to complete. I didn't notice any difference on my 8-thread home machine, for example, as one has to wait for the second round to complete anyways; while my 12-thread linux machine has a big difference because we can complete the entire sequence in one burst vs two bursts.

If you use the new scatter sop you'll see the same performance, and slightly faster with my code. If you add a Sort SOP after the scatter, you'll force a defragment and close the holes and likewise get similar speed. I've added some notes to the bug to see if we can't get better heuristics to avoid locking a page like this. But this is all part of the eternal fight between grain-size and multithreading :> You have to chop up your work fine enough to things are balanced, but if you chop it too fine you die due to overhead. With VEX we have absolutely no idea how expensive your task will be. If doing really expensive stuff like this, you might be best using the By Number option to take control of the grain size and allow even finer grain threading.
User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
Thanks for the detailed reply Jeff, I'm trying to understand all the information

Your code runs faster in both types of wrangles, so I re-run the test with your code only.
Depending on the number of points, I get:

1,500pts - equal performance between old and new wrangles.
2,000pts - old wrangle 2x faster
5,000pts - old wrangle 4x times faster (!)
10,000pts - same speed

So the new wrangle seems very sensitive to the number of points it is working on.

To further isolate the issue, I've replaced the scatter with a simple arrangement of the points done in another wrangle, followed by a sort. I'm eager to try the “Run Over Numbers” mode, but couldn't find an example of how to bind attributes in that mode.

I'm attaching the updated file and a couple of screenshots.

P.S. This example might seem a bit contrived, but it is a simple way to illustrate the issue I'm having as part of a bigger flocking sim where I set attributes on points based on those of their neighbours.
Edited by Lyubomir Popov - Sept. 5, 2017 19:33:36

Attachments:
point_vs_attr_wrangle_speed_01.hiplc (58.4 KB)
5000pts.png (390.8 KB)
2000pts.png (358.3 KB)
1500pts.png (355.4 KB)

User Avatar
Member
29 posts
Joined: Oct. 2015
Offline
UPDATE: Just managed to get Run Over Numbers to work, and it is massively faster and more controllable! Thank you for the suggestion.

I've linked the “Number count” to the number of points I have, and set “Thread Job Size” to number of points / available threads(8). To make sure the comparison isn't affected by the wrangles competing for threads, I run each on it's own.

Results with 2000 points:
Attrib wrangle run over number: 0.825s
Old point wrangle: 2.064s
Attrib wrangle run over points: 3.851s

Just one more question - you mentioned holes in the point list left by the old scatter sop, how can I tell there are holes in the point list?
Edited by Lyubomir Popov - Sept. 6, 2017 04:13:21

Attachments:
wrangle_comparison.hiplc (62.8 KB)
comparisson_by_number.jpg (293.0 KB)

User Avatar
Staff
6205 posts
Joined: July 2005
Offline
Lyubomir Popov
Just one more question - you mentioned holes in the point list left by the old scatter sop, how can I tell there are holes in the point list?

In geometry spreadsheet turn on “Map Offset” in the list of attributes. Scroll to the bottom and see if the last offset number matches your last point number. If it does, there are no holes.

I've linked the “Number count” to the number of points I have, and set “Thread Job Size” to number of points / available threads(8).

This is dangerous as you might run into the same situation as before with regard to thread balancing. If your operation is known to be really expensive, you can just use a fixed smaller number like 128, that way balancing across any number of points. (for example, if the last half of your geometry has very sparse points that aren't within interaction distance, you might be able to skip those faster)
User Avatar
Member
1 posts
Joined: Oct. 2021
Offline
jlait
The downside of the merging, however, is if you have holes in your geometry. In this example the old scatter sop has left 8 holes in the point list. So the first thread will grab 1016 points from the first page and 8 points from the second page. This then locks the second page until the first thread is done. Normally this isn't noticeable as processing a page is very fast, but in a very expensive task like this it means that after all threads are done, one is stuck waiting for the first thread to complete. I didn't notice any difference on my 8-thread home machine, for example, as one has to wait for the second round to complete anyways; while my 12-thread linux machine has a big difference because we can complete the entire sequence in one burst vs two bursts.

If you use the new scatter sop you'll see the same performance, and slightly faster with my code. If you add a Sort SOP after the scatter, you'll force a defragment and close the holes and likewise get similar speed. I've added some notes to the bug to see if we can't get better heuristics to avoid locking a page like this. But this is all part of the eternal fight between grain-size and multithreading :> You have to chop up your work fine enough to things are balanced, but if you chop it too fine you die due to overhead. With VEX we have absolutely no idea how expensive your task will be. If doing really expensive stuff like this, you might be best using the By Number option to take control of the grain size and allow even finer grain threading.

The sort sop "Optimize Internal Vertex Order" option says that it will affect the linear vertex, but if you use "i@vertex_index = @vtxnum;" in wrangle, the index will not change while the Map offset will change(close the hole?).
As far as I know, the hole is some unaligned array(which keeps all values of all elements of particular types together.) when you delete some points/prims and sort node can alter the array. So the description of the sort node is wrong or?
  • Quick Links