Intel OpenCL (on Linux) status?

   8913   4   2
User Avatar
Member
7046 posts
Joined: July 2005
Offline
Hi,

Following the Beta thread about the Intel OpenCL CPU driver, we gave it a try with unremarkable results. Just checking to see if I've missed something or if something has changed.

http://www.sidefx.com/index.php?option=com_forum&Itemid=172&page=viewtopic&t=29029&postdays=0&postorder=asc&highlight=intel+opencl&start=25&sid=072cbffc6040eaeaff7756509fc7e5c1 [sidefx.com]

(in a tcshell)
setenv HOUDINI_OCL_DEVICETYPE CPU
setenv LD_PRELOAD /opt/intel/opencl/lib64/libtbb_preview.so.2
hmaster

opencl is linked to the version, 1.2.someThingReallyLong

Houdini is using it, the diagnostics tell me so and turning on the Memory output shows me the diagnostics when I turn on OpenCL.

Test setup:

1. Ctrl Click on Sphere
2. Ctrl Click on Explosion

400x400x400: 24 secs for 20 frames
400x400x400 OCL: 18 secs for 20 frames

OK not bad.

600x600x600: 54 secs for 20 frames
600x600x600 OCL: 49 secs for 20 frames

Negligible difference.

Wondering if there are other variables?

13.0.239 Centos 6.5 (RHEL 6.5)
driver 319.something

Cheers,

Peter B
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Member
4189 posts
Joined: June 2012
Offline
Which CPUs are you using?
User Avatar
Staff
818 posts
Joined: July 2006
Offline
pbowmar
Negligible difference.

Wondering if there are other variables?

The basic idea is to minimize the amount of data that has to be copied from OpenCL memory back to regular memory (even though on the CPU they are the same *type* of memory, they are stored in different formats: dense grids for OpenCL vs. tiled for regular CPU).

That means turning off caching since that copies the data for the entire sim, and not displaying the results of each frame in the viewport. So for interactive work the memory transfer overhead mostly negates the increased CPU performance, as you've found. The main performance gain shows up for offline sims where caching is off and you're just writing density and maybe vel to file with Save In Background on.

Even then, the type of sim makes a difference. A resizing Pyro sim with several complicated sources and collisions is going to spend enough generating those fields in SOPs and then transferring them to OpenCL that again you won't see as much of a speedup. A “pure” smoke sim on the other hand will show a big speedup. That's basically the difference in the two attached charts, one if a pure smoke sim, one is a Pyro sim with a source and collision object. Still faster in OpenCL, but not by as much.

All of the above holds for GPU as well, or even more so, since the processor gain is higher, but the memory transfer goes across the PCI/E bus so is even more expensive.

Attachments:
sphere_sticktion_625.mp4 (515.5 KB)
smoke_turb_benchmark_512.mp4 (155.0 KB)
sphere_sticktion_smoke_Linux.png (35.5 KB)
clexplanation_turb_Linux.png (32.4 KB)
sphere_sticktion_smoke.hip (1.7 MB)
clexplanation_turb.hip (785.4 KB)

User Avatar
Member
7046 posts
Joined: July 2005
Offline
OK that did the trick, thanks Johner.

I notice that HOUDINI_OCL_MEMORY_POOL_SIZE=.5 isn't working for GPU mode, I set it to .5 of my 4gb card, but Houdini reported still only 1gb. Perhaps that's just the diagnostics are wrong?

Cheers,

Peter B
Cheers,

Peter Bowmar
____________
Houdini 20.5.262 Win 10 Py 3.11
User Avatar
Staff
818 posts
Joined: July 2006
Offline
pbowmar
I notice that HOUDINI_OCL_MEMORY_POOL_SIZE=.5 isn't working for GPU mode, I set it to .5 of my 4gb card, but Houdini reported still only 1gb. Perhaps that's just the diagnostics are wrong?

Just so we're using the same terms, the memory pool is just an acceleration method: allocating memory on the GPU is slow, and we re-use lots of identical-sized memory buffers during the solve, so it makes sense to have a pool of recently-used buffers for quick re-use. The HOUDINI_OCL_MEMORY_POOL_SIZE just sets the amount of total GPU memory available for that pool of buffers. The solver works just fine with it set to zero; in fact you'll have more memory available for the solve itself, but you'll pay a small price in performance.

So normally only a small amount of memory is allocated for the memory pool (12.5 % of total device memory IIRC), but you can increase it if you have lots of memory on the GPU. You usually don't, so I wouldn't necessarily recommend increasing it, since you'll run out of memory for the actual solve sooner. On the other hand, the memory pool turns out to increase performance (more than I would have thought) with the Intel OpenCL on CPU as well, where you might actually have some insanely huge amount of memory, in which case increasing the pool size might make sense.

If you've got HOUDINI_OCL_REPORT_MEMORY_USE set, the Total Memory Allocated is the total amount of device memory allocated. If your GPU is also your display device, you'll never get close to 4GB as there's too much other memory being used for display buffers and textures and such. The In Memory Pool is the amount of memory in the memory pool, while Active Memory is the amount currently taken by the buffers in the current sim. These two numbers should add up to to the Total Memory at the end of each timestep.
  • Quick Links