FLIP pressure projection - threading issues

   5296   9   0
User Avatar
Member
48 posts
Joined: June 2011
Offline
I've been trying to boil down the optimal performance settings for FLIP simulations, and I've come across a difficult trade-off when it comes to the pressure projection part of the solve.

Running on a dual-socket, 16-core workstation, all other parts of the solve tend to make great use of the full core count.

With the preconditioner option on, as expected, it drops to a single core but gets there in the end.
With the preconditioner option off, it will get to about 75% CPU usage, and actually takes almost the same time, maybe a hint longer… also expected.

The significant thing here is, if I launch Houdini with affinity constrained to 8 cores on a single socket, and 8 threads, the pressure projection with preconditioner off uses 100% CPU on that socket, and completes in around two-thirds of the time… a full 50% speed increase while only using half of the machine.
That would be great, but running on 8 cores instead of 16 makes the rest of the sim take fully double, and cancels out the benefit.



It's a problem I've found with Naiad, and various other memory-heavy applications when running across NUMA nodes on a workstation. That “Quick Path Interconnect” between sockets just isn't quick enough for this stuff!

Does anyone know of any way to script the pressure projection microsolver to use a particular affinity/thread count, while leaving the rest of the sim alone?
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Which CPUs are they?

Are you sure it's also not just running at a higher megahertz too with less threads.
User Avatar
Member
48 posts
Joined: June 2011
Offline
They're Xeon E5-2687W - in a Dell T7600 workstation.

I've been comparing like-for-like. They clock up to 3.4GHz, whether I've got both CPUs running or just one (speedstepping only clocks up higher if there are unused cores on the same discrete CPU)

It's definitely an issue with Non-Uniform Memory Access between CPU sockets… the interconnect is fast, but it's nowhere near as fast as each socket's native memory bus, so whenever a task is both hugely CPU intensive and attempting to read/write vast amounts of data to RAM, it struggles to shift the information across the link fast enough to keep the CPUs fed with data.
So if you constrain the process to only that physical processor, the OS is intelligent enough to only use the RAM attached to that socket as well, and it will avoid using the interconnect unless you end up using more than 50% of your system RAM.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Can you share your test scene please.

I'd like run some tests too!
User Avatar
Member
48 posts
Joined: June 2011
Offline
Afraid I can't share these ones, they're part of a project.

It'll be apparent in any scene that has a particularly high resolution grid attached to the solve. I'm running a sim which has around 60 million particles, but also a ~60-megavoxel grid.

The higher the particle count compared to the grid resolution, the less noticable the impact will be… for this sim, I need decent motion-resolution as well as particle count.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Very interesting! help says 4+ sockets are affected:

“During the pressure projection and viscosity solves, the matrices involved can be preconditioned to speed up the solution. However, this is a single threaded process. On machines with 4+ sockets it may be faster to disable this preconditioning and use a simpler Jacobi preconditioner which multithreads well, but can take more iterations to converge.”

http://www.sidefx.com/docs/houdini13.0/nodes/dop/flipsolver#use_preconditioner [sidefx.com]
User Avatar
Member
48 posts
Joined: June 2011
Offline
Yep, I don't know much about what they actually are, but the Jacobi preconditioner (off) certainly makes use of all the cores, while the default Modified Incomplete Cholesky seems to be entirely single-threaded as they suggest in the help… but much faster to converge, so it ends up similar in both cases.

Based on my experience of a 2 socket machine I have a suspicion that a 4+ socket machine would actually have real trouble with the Jacobi method, as in my case, it's actually considerably slower the moment it's allowed to communicate between sockets. I'd imagine coordinating data between 4 sockets would raise overheads even higher, and the solve would take even longer than on a 2-socket machine with (near-)identical CPUs.

The problem is, lots of cores scales pretty much linearly, even across sockets, for processing the particles in the sim, and all of the field processing except the pressure solve… so for all that is gained constraining a sim to a single socket, you lose more due to the extra time it takes to process the rest.


The key thing is, if you've got more sims to run on a project than you have workstations available, it's ultimately much more efficient on a 2-socket machine to run 2 sims in parallel, affinity-constrained to 8 cores each, than to run one sim after the other on all 16.

This is more for those (frequent) times when I need to throw everything I've got at a single mega-sim to push it through as fast as possible (which usually coincides with it being a sim that will happily eat up more than 50% of the machine's RAM, which would also make it impossible to run two in parallel :-P)
User Avatar
Member
4189 posts
Joined: June 2012
Offline
not much help but disabling the pre-conditioner nodes is not the solution

Attachments:
DisabledPreconditioner.png (687.9 KB)

User Avatar
Member
48 posts
Joined: June 2011
Offline
Strange… what were you disabling? I've not noticed any significant difference in the end result, whether running with the preconditioner toggle “on” or “off”. Both of them iterate until they reach a (nearly-)divergence-free solution. One just takes more iterations than the other to get there.

If you're disabling the entire node, it would effectively prevent the sim from performing fluid dynamics at all.
User Avatar
Staff
809 posts
Joined: July 2006
Offline
Very interesting tests, Dan. As much benchmarking as we've done of FLIP, I don't think we've ever tested hardware affinity with Use Preconditioner off. Jacobi preconditioning is a very simple pre-conditioning scheme that is trivial to multithread but consists of a few operations over very large memory buffers, so represents the classic case for being memory bandwidth bound.

There's not currently any way of setting affinity at the level of detail you're describing in Houdini. We're mostly limited to what tbb (Intel's Threaded Building Blocks) gives us, and there's only limited support for hardware affinity (you can read what they offer here [software.intel.com] if interested)

As I understand it, we might see a speedup using the TBB affinity tools if everything fit into the cache, which for large FLIP sims is clearly not the case. There might still be some speedup available from enforcing thread affinity across internal iterations of the pressure solve, but we'd have to do some testing. Frankly I'm skeptical it will benefit, as I think the limiting factor is on which bus the memory is allocated in the first place. I'll put in an RFE to test thread-level affinity, however.

(The TBB forums have several posts asking how to solve the NUMA problem, so you've identified a common one, I'm afraid).
  • Quick Links