OpenCl on CPU

Forums Technical Discussion OpenCl on CPU

5520 17 0


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 2, 2015 11:43 a.m.

Hey,

I tried to run OpenCl on CPU as discribed here by SideFX: http://www.sidefx.com/docs/houdini14.0/news/13/opencl [sidefx.com]

It is meant to be 1.5 - 3 times faster, even on a CPU. So I made a test….

38 min insteat of 40. Is a special CPU necessary for this to work proper? I use an i7 quad core, 3,4 GHZ.


malexander: Staff; 5158 posts; Joined: July 2005; Offline

Dec. 2, 2015 12:10 p.m.

What generation of Intel? The i5 2000 series has 256b AVX, but it isn't until Haswell (i 4000 series) that it gets a 512b upgrade. That could be part of the reason.

The other one is that not all parts of the simulation are OpenCL accelerated. Those parts that are run quite a bit faster, even on the CPU, but those that don't stay exactly the same. It sounds like the majority of your benchmark is probably not OpenCL accelerated, or is encountering another bottleneck. Won't know unless we see the file, though.


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 2, 2015 12:16 p.m.

Its a core I7 3770. The setup is really simple. Its an imported alembic respectively an animated head.

I also tried to put a sphere in the scene and just click on the explosion shelf tool without changing anything but the resolution.

The speed increases are every time very tiny.


malexander: Staff; 5158 posts; Joined: July 2005; Offline

Dec. 2, 2015 12:23 p.m.

OpenCL also benefits larger simulations more than small ones. ie, you'll see a bigger speedup on a 500x500x500 volume than a 80x80x80 one. There's just not enough “work” available to really see a speed up in the small volume case. Since these large cases take a huge amount of time to solve (eg. overnight), even a 50% speedup can net you a gain of hours over a long simulation.


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 2, 2015 3:46 p.m.

Well, one version filled up 85% of my memory (24 GB). Thats not SO low res. I tried it with my GPU too. With an old radeon 7770 it took less than half of the time, but not THAT high res (2 GB).


johner: Staff; 809 posts; Joined: July 2006; Offline

Dec. 2, 2015 4:15 p.m.

There's some information about getting the best performance from OpenCL CPU in this thread [sidefx.com], as well as some benchmark files you can test with. Also here [sidefx.com].

In general you can use the Performance Monitor to see where the time is going. OpenCL should generally be faster for advection and the multigrid solve, which are often the most expensive parts of a large smoke / pyro sim. However if you're spending all your time doing other things such as deforming collisions, sourcing, drawing the viewport, caching the simulation to memory, you'll see less of a speedup.


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

Dec. 2, 2015 4:29 p.m.

As a side question: I've been searching for a good knowledge base on how OpenCL on the CPU is more efficient than C++ on the CPU. Is there an easy explanation? i.e. contiguous memory layout, simd, SSE2/3 etc


jlait: Staff; 6201 posts; Joined: July 2005; Offline

Dec. 2, 2015 4:41 p.m.

In general, C++ on the CPU can be every bit as fast as OpenCL on the CPU. The fact they aren't is because we are either using different datastructures in the C++ case, or we haven't fully SIMDified all of our code.

First, Intel has a lot of work in their OpenCL drivers to SSE/SIMD/AVX the resulting code, which isn't in our C++ version. While VEX operations chain down to SSE instructions eventually, our Multigrid has no such code for key operations like the laplacian smooth.

Second, likely most important for advection, we use a tiled grid format. With OpenCL we are forced to use a flat grid. This is less efficient for memory use, and much slower if you have big empty areas. For example, with FLIP our tiled layout is essential to minimize evaluation outside of the liquid. (A lot of speed gains from H13 -> H14 are from better using this tiled structuer). But in Pyro your grid tends to be filled with stuff, specifically, velocities tend to be non-zero everywhere. We thus don't get the memory advantage, but pay a performance cost whenever we randomly access the grid.


johner: Staff; 809 posts; Joined: July 2006; Offline

Dec. 2, 2015 4:50 p.m.

What Jeff said.

The only thing I'd add is that we specifically write the OpenCL kernels to avoid branching as much as possible, since divergent code paths drastically slow down the GPU. For example for boundary conditions during the multigrid solve we use padded grids rather than have any “if” statements in the kernels.

Fortunately avoiding branching also makes the kernels much easier to automatically “vectorize” to SSE/AVX on the CPU by the generally excellent Intel compiler technology embedded within their driver. Regular c++ compilers have a more complex language and more general code to deal with, so have a harder time with auto-vectorization, meaning we'd likely have to do any similar vectorization by hand.

Edited by - Dec. 2, 2015 17:44:45


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

Dec. 2, 2015 5:35 p.m.

Awesome! 8) Thanking you both!


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 3, 2015 10:58 a.m.

Well, I tried all that things…deactivating caching and so on. The speed increases keep beeing tiny. If my GPU had enough memory the increase would be very big in high res sims even without deactivating all that stuff. I get with my old GPU lady at least 2x as much speed.


malexander: Staff; 5158 posts; Joined: July 2005; Offline

Dec. 3, 2015 3:29 p.m.

If you'd like to post your test to this thread, we can investigate further. But otherwise all we can do is conjecture as to why you're not seeing a larger speedup.


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 3, 2015 5:46 p.m.

Like I said: you can use a standard setup by just introducing a sphere to the scene and click on the explosion shelf tool. then just activate openCL everywhere possible, dactivate caching and explude all fields from beeing cached except density.


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

Dec. 4, 2015 3:25 a.m.

Just upload you scene file and it can be tested, also post the About Houdini OpenCL section, dialog too. It is quite common to have other OpenCL installations instead of the correct Intel one.


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 6, 2015 2:02 a.m.

Here is a screenshot of the help.

Attachments:
Unbenannt.png (511.0 KB)


anon_user_37409885: Member; 4189 posts; Joined: June 2012; Offline

Dec. 6, 2015 2:17 a.m.

Thanks -that's the slower AMD OpenCL library - the Intel one is more efficient and faster.

Can't recall the exact steps but that needs to read Intel instead of AMD


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 6, 2015 11:09 a.m.

Ah, just forgot to ad the second line to the variables:

HOUDINI_OCL_DEVICETYPE=CPU

HOUDINI_OCL_VENDOR=Intel(R) Corporation


Rosko Ron: Member; 184 posts; Joined: March 2015; Offline

Dec. 6, 2015 2:02 p.m.

Unfortunately another error message appears now:

Anyway….it crashes every now ad then but it is much faster now. So thanks everyone

Attachments:
Unbenannt2.png (34.7 KB)

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts