Hey,
I tried to run OpenCl on CPU as discribed here by SideFX: http://www.sidefx.com/docs/houdini14.0/news/13/opencl [sidefx.com]
It is meant to be 1.5 - 3 times faster, even on a CPU. So I made a test….
38 min insteat of 40. Is a special CPU necessary for this to work proper? I use an i7 quad core, 3,4 GHZ.
OpenCl on CPU
4739 17 0-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- malexander
- Staff
- 5014 posts
- Joined: July 2005
- Offline
What generation of Intel? The i5 2000 series has 256b AVX, but it isn't until Haswell (i 4000 series) that it gets a 512b upgrade. That could be part of the reason.
The other one is that not all parts of the simulation are OpenCL accelerated. Those parts that are run quite a bit faster, even on the CPU, but those that don't stay exactly the same. It sounds like the majority of your benchmark is probably not OpenCL accelerated, or is encountering another bottleneck. Won't know unless we see the file, though.
The other one is that not all parts of the simulation are OpenCL accelerated. Those parts that are run quite a bit faster, even on the CPU, but those that don't stay exactly the same. It sounds like the majority of your benchmark is probably not OpenCL accelerated, or is encountering another bottleneck. Won't know unless we see the file, though.
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- malexander
- Staff
- 5014 posts
- Joined: July 2005
- Offline
OpenCL also benefits larger simulations more than small ones. ie, you'll see a bigger speedup on a 500x500x500 volume than a 80x80x80 one. There's just not enough “work” available to really see a speed up in the small volume case. Since these large cases take a huge amount of time to solve (eg. overnight), even a 50% speedup can net you a gain of hours over a long simulation.
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- johner
- Staff
- 796 posts
- Joined: July 2006
- Offline
There's some information about getting the best performance from OpenCL CPU in this thread [sidefx.com], as well as some benchmark files you can test with. Also here [sidefx.com].
In general you can use the Performance Monitor to see where the time is going. OpenCL should generally be faster for advection and the multigrid solve, which are often the most expensive parts of a large smoke / pyro sim. However if you're spending all your time doing other things such as deforming collisions, sourcing, drawing the viewport, caching the simulation to memory, you'll see less of a speedup.
In general you can use the Performance Monitor to see where the time is going. OpenCL should generally be faster for advection and the multigrid solve, which are often the most expensive parts of a large smoke / pyro sim. However if you're spending all your time doing other things such as deforming collisions, sourcing, drawing the viewport, caching the simulation to memory, you'll see less of a speedup.
-
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
-
- jlait
- Staff
- 5781 posts
- Joined: July 2005
- Offline
In general, C++ on the CPU can be every bit as fast as OpenCL on the CPU. The fact they aren't is because we are either using different datastructures in the C++ case, or we haven't fully SIMDified all of our code.
First, Intel has a lot of work in their OpenCL drivers to SSE/SIMD/AVX the resulting code, which isn't in our C++ version. While VEX operations chain down to SSE instructions eventually, our Multigrid has no such code for key operations like the laplacian smooth.
Second, likely most important for advection, we use a tiled grid format. With OpenCL we are forced to use a flat grid. This is less efficient for memory use, and much slower if you have big empty areas. For example, with FLIP our tiled layout is essential to minimize evaluation outside of the liquid. (A lot of speed gains from H13 -> H14 are from better using this tiled structuer). But in Pyro your grid tends to be filled with stuff, specifically, velocities tend to be non-zero everywhere. We thus don't get the memory advantage, but pay a performance cost whenever we randomly access the grid.
First, Intel has a lot of work in their OpenCL drivers to SSE/SIMD/AVX the resulting code, which isn't in our C++ version. While VEX operations chain down to SSE instructions eventually, our Multigrid has no such code for key operations like the laplacian smooth.
Second, likely most important for advection, we use a tiled grid format. With OpenCL we are forced to use a flat grid. This is less efficient for memory use, and much slower if you have big empty areas. For example, with FLIP our tiled layout is essential to minimize evaluation outside of the liquid. (A lot of speed gains from H13 -> H14 are from better using this tiled structuer). But in Pyro your grid tends to be filled with stuff, specifically, velocities tend to be non-zero everywhere. We thus don't get the memory advantage, but pay a performance cost whenever we randomly access the grid.
-
- johner
- Staff
- 796 posts
- Joined: July 2006
- Offline
What Jeff said.
The only thing I'd add is that we specifically write the OpenCL kernels to avoid branching as much as possible, since divergent code paths drastically slow down the GPU. For example for boundary conditions during the multigrid solve we use padded grids rather than have any “if” statements in the kernels.
Fortunately avoiding branching also makes the kernels much easier to automatically “vectorize” to SSE/AVX on the CPU by the generally excellent Intel compiler technology embedded within their driver. Regular c++ compilers have a more complex language and more general code to deal with, so have a harder time with auto-vectorization, meaning we'd likely have to do any similar vectorization by hand.
The only thing I'd add is that we specifically write the OpenCL kernels to avoid branching as much as possible, since divergent code paths drastically slow down the GPU. For example for boundary conditions during the multigrid solve we use padded grids rather than have any “if” statements in the kernels.
Fortunately avoiding branching also makes the kernels much easier to automatically “vectorize” to SSE/AVX on the CPU by the generally excellent Intel compiler technology embedded within their driver. Regular c++ compilers have a more complex language and more general code to deal with, so have a harder time with auto-vectorization, meaning we'd likely have to do any similar vectorization by hand.
Edited by - Dec. 2, 2015 17:44:45
-
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- malexander
- Staff
- 5014 posts
- Joined: July 2005
- Offline
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- anon_user_37409885
- Member
- 4189 posts
- Joined: June 2012
- Offline
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- Rosko Ron
- Member
- 184 posts
- Joined: March 2015
- Offline
-
- Quick Links