OpenCl on CPU

   5550   17   0
User Avatar
Member
184 posts
Joined: March 2015
Offline
Hey,

I tried to run OpenCl on CPU as discribed here by SideFX: http://www.sidefx.com/docs/houdini14.0/news/13/opencl [sidefx.com]

It is meant to be 1.5 - 3 times faster, even on a CPU. So I made a test….

38 min insteat of 40. Is a special CPU necessary for this to work proper? I use an i7 quad core, 3,4 GHZ.
User Avatar
Staff
5161 posts
Joined: July 2005
Online
What generation of Intel? The i5 2000 series has 256b AVX, but it isn't until Haswell (i 4000 series) that it gets a 512b upgrade. That could be part of the reason.

The other one is that not all parts of the simulation are OpenCL accelerated. Those parts that are run quite a bit faster, even on the CPU, but those that don't stay exactly the same. It sounds like the majority of your benchmark is probably not OpenCL accelerated, or is encountering another bottleneck. Won't know unless we see the file, though.
User Avatar
Member
184 posts
Joined: March 2015
Offline
Its a core I7 3770. The setup is really simple. Its an imported alembic respectively an animated head.

I also tried to put a sphere in the scene and just click on the explosion shelf tool without changing anything but the resolution.

The speed increases are every time very tiny.
User Avatar
Staff
5161 posts
Joined: July 2005
Online
OpenCL also benefits larger simulations more than small ones. ie, you'll see a bigger speedup on a 500x500x500 volume than a 80x80x80 one. There's just not enough “work” available to really see a speed up in the small volume case. Since these large cases take a huge amount of time to solve (eg. overnight), even a 50% speedup can net you a gain of hours over a long simulation.
User Avatar
Member
184 posts
Joined: March 2015
Offline
Well, one version filled up 85% of my memory (24 GB). Thats not SO low res. I tried it with my GPU too. With an old radeon 7770 it took less than half of the time, but not THAT high res (2 GB).
User Avatar
Staff
817 posts
Joined: July 2006
Offline
There's some information about getting the best performance from OpenCL CPU in this thread [sidefx.com], as well as some benchmark files you can test with. Also here [sidefx.com].

In general you can use the Performance Monitor to see where the time is going. OpenCL should generally be faster for advection and the multigrid solve, which are often the most expensive parts of a large smoke / pyro sim. However if you're spending all your time doing other things such as deforming collisions, sourcing, drawing the viewport, caching the simulation to memory, you'll see less of a speedup.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
As a side question: I've been searching for a good knowledge base on how OpenCL on the CPU is more efficient than C++ on the CPU. Is there an easy explanation? i.e. contiguous memory layout, simd, SSE2/3 etc
User Avatar
Staff
6245 posts
Joined: July 2005
Offline
In general, C++ on the CPU can be every bit as fast as OpenCL on the CPU. The fact they aren't is because we are either using different datastructures in the C++ case, or we haven't fully SIMDified all of our code.

First, Intel has a lot of work in their OpenCL drivers to SSE/SIMD/AVX the resulting code, which isn't in our C++ version. While VEX operations chain down to SSE instructions eventually, our Multigrid has no such code for key operations like the laplacian smooth.

Second, likely most important for advection, we use a tiled grid format. With OpenCL we are forced to use a flat grid. This is less efficient for memory use, and much slower if you have big empty areas. For example, with FLIP our tiled layout is essential to minimize evaluation outside of the liquid. (A lot of speed gains from H13 -> H14 are from better using this tiled structuer). But in Pyro your grid tends to be filled with stuff, specifically, velocities tend to be non-zero everywhere. We thus don't get the memory advantage, but pay a performance cost whenever we randomly access the grid.
User Avatar
Staff
817 posts
Joined: July 2006
Offline
What Jeff said.

The only thing I'd add is that we specifically write the OpenCL kernels to avoid branching as much as possible, since divergent code paths drastically slow down the GPU. For example for boundary conditions during the multigrid solve we use padded grids rather than have any “if” statements in the kernels.

Fortunately avoiding branching also makes the kernels much easier to automatically “vectorize” to SSE/AVX on the CPU by the generally excellent Intel compiler technology embedded within their driver. Regular c++ compilers have a more complex language and more general code to deal with, so have a harder time with auto-vectorization, meaning we'd likely have to do any similar vectorization by hand.
Edited by - Dec. 2, 2015 17:44:45
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Awesome! 8) Thanking you both!
User Avatar
Member
184 posts
Joined: March 2015
Offline
Well, I tried all that things…deactivating caching and so on. The speed increases keep beeing tiny. If my GPU had enough memory the increase would be very big in high res sims even without deactivating all that stuff. I get with my old GPU lady at least 2x as much speed.
User Avatar
Staff
5161 posts
Joined: July 2005
Online
If you'd like to post your test to this thread, we can investigate further. But otherwise all we can do is conjecture as to why you're not seeing a larger speedup.
User Avatar
Member
184 posts
Joined: March 2015
Offline
Like I said: you can use a standard setup by just introducing a sphere to the scene and click on the explosion shelf tool. then just activate openCL everywhere possible, dactivate caching and explude all fields from beeing cached except density.
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Just upload you scene file and it can be tested, also post the About Houdini OpenCL section, dialog too. It is quite common to have other OpenCL installations instead of the correct Intel one.
User Avatar
Member
184 posts
Joined: March 2015
Offline
Here is a screenshot of the help.

Attachments:
Unbenannt.png (511.0 KB)

User Avatar
Member
4189 posts
Joined: June 2012
Offline
Thanks -that's the slower AMD OpenCL library - the Intel one is more efficient and faster.

Can't recall the exact steps but that needs to read Intel instead of AMD
User Avatar
Member
184 posts
Joined: March 2015
Offline
Ah, just forgot to ad the second line to the variables:

HOUDINI_OCL_DEVICETYPE=CPU

HOUDINI_OCL_VENDOR=Intel(R) Corporation
User Avatar
Member
184 posts
Joined: March 2015
Offline
Unfortunately another error message appears now:

Anyway….it crashes every now ad then but it is much faster now. So thanks everyone

Attachments:
Unbenannt2.png (34.7 KB)

  • Quick Links