stamps faster than compiled sops? lies!

   8071   29   5
User Avatar
Member
1498 posts
Joined: May 2006
Offline
Some chat on the discord forum lead to this example. Boxes on a curve, stamping the width vs compile + foreach to set the width.

I'm clearly doing something wrong, because as the number of points increase (say +1000), the stamp is substantially faster than the compiled version; on this machine the stamp runs at 10fps, compiled runs at 2fps.

Ideas?

Attachments:
stamp_vs_compile.hip (83.2 KB)
compile_vs_stamp.jpg (33.2 KB)

http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]
User Avatar
Member
6468 posts
Joined: Sept. 2011
Online
I get 30 fps for the compiled, and 15 for the stamp in your scene.

Edit:
For comparison, I get 85 fps if with copy to points with {foo,1,1} as a scale attribute instead of stamping one by one.
Edited by jsmack - April 17, 2017 23:02:18
User Avatar
Member
460 posts
Joined: July 2005
Offline
Hi everybody
been testing and yes, the old copy stamping is faster then the compile

I'm using 2 Intel Xeon E5-2630 v3 Processors for a total of 32 threads, not latest gen but still good, or so I thought.

another user with a Intel i7-5930K Processor with 12 threads, is getting way faster speeds, like 10x

so not sure what to say, downgrade to one processor?
varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs
User Avatar
Member
460 posts
Joined: July 2005
Offline
we should be talking about cook time not viewport fps

this are my results

Attachments:
copyStamp_profile.png (30.2 KB)

varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs
tamte
Member
7206 posts
Joined: July 2007
Offline
12 fps stamp
5 fps uncompiled for each
21 fps compiled (4 cores)

so I guess lies are that uncompiled for each is as fast/slow as stamping, I've noticed in many other cases that that is not true, which is sad
H16.0.572
Tomas Slancik
FX Supervisor
Method Studios, NY
User Avatar
Member
6468 posts
Joined: Sept. 2011
Online
If you do trivial operations, then stamping can still be fast, but if you actually do something where parallelizing matters, then you can see massive gains. Also, stamping transformations is a bad example, since the copy to points will handle that orders of magnitude faster. I have yet to find a test case where the compiled for loop version is slower than alternatives, excluding stamping transforms with copy to points, or using vex parallelism (modify all data at once.)
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Compiled
-j1 = ~5fps
-j6 = ~13fps
-j12 = ~13 fps
-j24 = ~3 fps

Stamped
-jn ~5fps

Ubuntu - 2 * X5680 @ 3.33GHz × 24
Edited by anon_user_37409885 - April 17, 2017 23:25:29
tamte
Member
7206 posts
Joined: July 2007
Offline
comparing cook times 240f:
i7-6700HQ 2.6GHz 4Cores
Edited by tamte - April 17, 2017 23:19:26

Attachments:
stamp_compile_benchmark.png (22.5 KB)

Tomas Slancik
FX Supervisor
Method Studios, NY
User Avatar
Member
6468 posts
Joined: Sept. 2011
Online
60 frames:

Attachments:
cookstats.jpg (77.5 KB)

User Avatar
Member
6468 posts
Joined: Sept. 2011
Online
Artye
Compiled
-j1 = ~5fps
-j6 = ~13fps
-j12 = ~13 fps
-j24 = ~3 fps

Stamped
-jn ~5fps

Ubuntu

Looks like you found a threading bug, this must be why those guys are seeing crazy slow speeds.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
jsmack
Looks like you found a threading bug, this must be why those guys are seeing crazy slow speeds.

It'll be mighty tasty when fixed
User Avatar
Member
1498 posts
Joined: May 2006
Offline
A threading bug was what I suspected, but wanted to get a few eyes on it to make sure I wasn't doing some really stupid (which is always highly likely).
http://www.tokeru.com/cgwiki [www.tokeru.com]
https://www.patreon.com/mattestela [www.patreon.com]
User Avatar
Staff
5647 posts
Joined: July 2005
Offline
As mentioned, please use perfmonitor or wall clock time, not FPS. FPS not only includes draw time, but has a weird inverse relationship. It is too tempting to say: “I lost 5 FPS!” when that is very different thing when going 15->10 versus 60->55.

Please do also keep in mind that we worked really hard to make copy/stamp fast! And we didn't spend any time trying to slow it down in order for new approaches to seem fast by comparison.

A quick test on my machine (6 cores, 12 threads) gives, for 240 frame playback:

compiled, no threading: 23.3sec
compiled, multithread: 15.3sec
Copy/stamp: 15.2 + 8.7 = 23.9 sec

So we see here only a very small threading performance, which makes sense as the workload is very trivial. This is why we expose things like Job Size in the Attribute Wrangle. We may need to have something like that added to make this easier to handle.

The massive drop at 24 threads suggests something is tripping over itself when the thread count gets high enough in this example.

In the attached I have added a Blocked variant that creates a block attribute to group points into sets of 256. Then it does a parallel foreach over these blocks, and only then iterates over each individual point. This should cut the threading overhead. Can you let me know if it improves on your 24 thread machines?

Attachments:
stamp_vs_compile_blocked.hip (102.0 KB)

User Avatar
Member
460 posts
Joined: July 2005
Offline
This are my results from that new Scene

the new block example Jeff added it's almost exactly the same as the stamp version, which is nice cause I like the stamp way, used to that and the new way is a little complex for new people.

So what's the magic ingredients that makes it faster?

this is on a 32 threads machine
Edited by varomix - April 18, 2017 14:03:45

Attachments:
copyStamp_profileJL.png (35.7 KB)

varomix - Founder | Educator @ Mix Training
Technical Artist @ Meta Reality Labs
User Avatar
Staff
5647 posts
Joined: July 2005
Offline
A hard problem with multithreading is picking “grain size”. This is the size of the amount of work you run on each task. If you make the work too small, like adding two numbers together, you'll spend all your time in task management and end up way, way, slower. So whenever you multithread in C++ you always have to think about grainsize and ensure small datasets are batched together.

It is sort of like submitting jobs to a render farm. If your frames get short enough, it can become necessary to render 10 frames at once per job rather than each frame individually.

The problem is that we have no idea from the outside what the size of your contained network is. If it is a big chunk of work, the task scheduling won't show up and you should see nice scaling. But with really simple examples the opposite happens.

What I have done in that example is manually batch 256 points at a time. With the given point counts, this also means only 6 threads really become active, saving your 32 thread machine from tripping over itself for no benefit.

That said, I will be looking closer at this high-thread behaviour. Currently the point() function has an unnecessary lock in this example that could also be responsible for things locking up.

I've submitted Bug: 82242 to track this.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Using Perf Monitor. Playback of 240 frames -j24

New:
27.849sec

Old:
1min 14sec

CopySop:
48.791 sec

Playback of 240 frames -j6

New:
23.142 sec

Old:
24.437 sec

CopySop:
46.090 sec
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Hmmm MacOs is rubbish. H16.0.577 - same machine as Ubuntu.

New: 1min 12sec
Old: 2min 14sec
CopyS: 44 sec
User Avatar
Member
7285 posts
Joined: July 2005
Offline
I wonder if jemalloc improves this on OSX since we last tried in H14:

DYLD_INSERT_LIBRARIES=$HFS/../Libraries/libjemalloc.1.dylib houdini
User Avatar
Member
7285 posts
Joined: July 2005
Offline
OR perhaps tbbmalloc_proxy:

DYLD_INSERT_LIBRARIES=$HFS/../Libraries/libtbbmalloc_proxy.dylib houdini
Edited by edward - April 18, 2017 23:52:39
tamte
Member
7206 posts
Joined: July 2007
Offline
does anyone else see non-compiled for each to be much slower than Copy Stamp?
since it's not always possible to compile I think it's quite a big deal

Attachments:
stamp_vs_compile_vs_notcompile.hip (133.8 KB)
stamp_compile_benchmark2.png (21.7 KB)

Tomas Slancik
FX Supervisor
Method Studios, NY
  • Quick Links