Why can't you use local machine for distributed solvers that are poorly threaded?

Forums Technical Discussion Why can't you use local machine for distributed solvers that are poorly threaded?

2390 7 1


cgbeige: Member; 99 posts; Joined: March 2009; Offline

March 18, 2017 7:12 p.m.

If you have a FLIP or pyro solver that doesn't scale well but it can be divided up over many machines to divvy up the work, why can't you do the same thing on one machine for max efficiency? Or can you do this?


malbrecht: Member; 806 posts; Joined: Oct. 2016; Offline

March 19, 2017 4:16 a.m.

Moin,

could you elaborate on what you mean by “do the same thing on one machine”?

Running several machines means you are distributing the calculation efforts to more than one CPU (or GPU). Running the calculations on one machine means that you are distributing the calculations to one machine. There isn't much “efficiency maxing” possible, assuming that your solver already parallelizes properly.
Also note that parallelizing - no matter if one one processor (be it CPU or GPU) or on several, distributed over machines - always adds management issues. This is assuming that you are trying to parallelize calculations. If you are splitting calculations (i.e. you can work on separate areas of a simulation without having to combine the data at all), the best you can get on one machine is good parallel computing (cores/CPU+GPU), which is already doable in Houdini.

Marc

---
Out of here. Being called a dick after having supported Houdini users for years is over my paygrade.
I will work for money, but NOT for "you have to provide people with free products" Indie-artists.
Good bye.
https://www.marc-albrecht.de [www.marc-albrecht.de]


cgbeige: Member; 99 posts; Joined: March 2009; Offline

March 19, 2017 3:01 p.m.

I guess what I'm saying is, let's say you have a scene set up – it's not going to scale to 12 hyper-threaded cores on the host machine so that's why you divide it up for distributed solving but wouldn't it just be as valuable to find the max efficiency threshold of threads for a solve and divide up into buckets for the host machine the same way?


malbrecht: Member; 806 posts; Joined: Oct. 2016; Offline

March 19, 2017 3:11 p.m.

Hi,

cgbeige
I guess what I'm saying is, let's say you have a scene set up – it's not going to scale to 12 hyper-threaded cores on the host machine so that's why you divide it up for distributed solving but wouldn't it just be as valuable to find the max efficiency threshold of threads for a solve and divide up into buckets for the host machine the same way?

that sounds a bit convoluted to me - there must be a reason why crunching the numbers doesn't work (well enough) in parallel. If the reason is, for example, that some threads depend on other threads to have done their calculations, you are looking at a sequential problem anyway.
Parallelizing usually means that you split up your problem into independent sub-problems. “Independent” is the important bit here, because dependance kicks you out of parallel processes (in most cases, with a few exceptions).

In *theory* there shouldn't be a difference between solving a calculation on two separate CPUs versus two cores on one CPU. Obviously there are some differences - like sharing access to data. So it might be that you have to access vast amounts of data and keep that copied onto each single computer. That MAY be faster than using a single CPU with a single data bus.

But I'd say - from my personal and therefor non-universal experience with parallel processing and PARALLEL processing (i.e. on one CPU and using several computers) - it really, really, REALLY depends on the concrete setup. In almost all cases I guess that what you describe, except for the data access thing, you wouldn't gain anything from shifting calculations from one computer to another if you could just as well use additional threads on your first CPU. Assuming the parallelizing basically does the same as the splitting, which I would think.

(Let me stress: Of course data access can be the issue anyway, maybe your CPU's cache can be utilized better for 1-2 cores so that more CPUs running at 1-2 cores but full cache availability are faster. In short: Data access CAN be a thing. But that's really very specific.)

Just my 3.14159265 cents, obviously

Marc


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

March 19, 2017 4:02 p.m.

cgbeige
I guess what I'm saying is, let's say you have a scene set up – it's not going to scale to 12 hyper-threaded cores on the host machine so that's why you divide it up for distributed solving but wouldn't it just be as valuable to find the max efficiency threshold of threads for a solve and divide up into buckets for the host machine the same way?

Yes - people do this already for the wire solver and FEM I think. You are meant to be able to run a few instances of Houdini at the same time.


jlait: Staff; 6187 posts; Joined: July 2005; Offline

March 20, 2017 12:47 p.m.

cgbeige
why can't you do the same thing on one machine for max efficiency? Or can you do this?

You can do this. It's even how I often test distributed sims without needing to wrangle lots of hardware. But if an algorithm can be distributed, it is a much easier task to thread it. So most of the time if you stopped getting any returns with more cores; you are not going to get any more returns using more threads.

Historically, there was a notable exception to this. I had so poorly implemented the SPH multithreading that distributing 4 single-threaded versions on one machine was faster than running a 4-thread simulation. FLIP and Pyro are much better implemented, however, so I would be surprised if you got any improvement by distributing within one machine vs using multiple cores.


cgbeige: Member; 99 posts; Joined: March 2009; Offline

March 20, 2017 1:28 p.m.

jlait
cgbeige
why can't you do the same thing on one machine for max efficiency? Or can you do this?

You can do this. It's even how I often test distributed sims without needing to wrangle lots of hardware. But if an algorithm can be distributed, it is a much easier task to thread it. So most of the time if you stopped getting any returns with more cores; you are not going to get any more returns using more threads.

Historically, there was a notable exception to this. I had so poorly implemented the SPH multithreading that distributing 4 single-threaded versions on one machine was faster than running a 4-thread simulation. FLIP and Pyro are much better implemented, however, so I would be surprised if you got any improvement by distributing within one machine vs using multiple cores.

Right – this was where I was confused I guess. It wouldn't make any sense to be able to distribute a sim job since it wouldn't scale there for the same reason you wouldn't get gains locally with a lot of cores that showed diminishing returns.


coccosoids: Member; 48 posts; Joined: Aug. 2013; Offline

June 15, 2020 9:57 a.m.

So guys, is it possible to distribute flip sims on a single machine, and if it is possible do they need to be run in parallel or can they be serialized, so that after slice 01 finishes simulating, slice 02 can begin?

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts