I am trying to convert some vex code to opencl in order to speed it up significantly. The code I play around with right now is quite simple and only write to a 1000x1000x1 float volume (generating an image!). Converting the code was easy enough but I get very slow results, only 3-4 times as fast to use a RTX 2080ti with opencl than a 6 core intel 5820 on vex.
Using global mem in OpenCL is not adviced and so I would like to move to use texture memory instead but don't know how to interface with Houdini then. Are there any info, or examples of efficient opencl code running in a OpenCL sop?
For reference, running the same (more or less) code as an GLSL shader is about 100 times faster on the same graphics card.
Effective opencl
2472 6 2- filipw
- Member
- 135 posts
- Joined: March 2018
- Offline
- kahuna031
- Member
- 897 posts
- Joined: July 2018
- Offline
- filipw
- Member
- 135 posts
- Joined: March 2018
- Offline
After some further tests using both points and volumes as containers for data I would guess I get an approximate of 10x performance gain on my computer. Seems like better investment to get a faster CPU and stay with VEX as it is now. The gains are better then nothing of course, but nowhere near what the potential for simple things like what I am doing here. Again, my reference is the same code (translated) running as a fragment shader.
Maybe there could be some image and volume specific bindings for the OpenCL SOP that uses texture memeory as an option?
Maybe there could be some image and volume specific bindings for the OpenCL SOP that uses texture memeory as an option?
- kahuna031
- Member
- 897 posts
- Joined: July 2018
- Offline
But that's what you got already? But the bottleneck isn't compute but I/O. I've got crazy speeds with OCL when problem was isolated to one compute step: http://beautyapproximations.blogspot.com/2018/01/cloth-demo-gif.html?m=1 [beautyapproximations.blogspot.com]
B.Henriksson, DICE
- filipw
- Member
- 135 posts
- Joined: March 2018
- Offline
@kahuna031, In this test case I just have a generator that is dependent on the world position so my kernel is basically:
I feel like I am missing something here but maybe I just cannot expect GLSL level of performance ?
... kernel void kernelName( int P_length, global float * P, global float * density) { int idx = get_global_id(0); float3 p = vload3(idx, P); density[idx] = massiveArithmetics(p); }
I feel like I am missing something here but maybe I just cannot expect GLSL level of performance ?
- kahuna031
- Member
- 897 posts
- Joined: July 2018
- Offline
It is GLSL performance but VEX is already quite fast and unless massiveArithmetics is very slow on vex then your bottleneck will be the cpu-gpu-cpu copying.
Try comment out massiveArithmetics see what compute time you get, that's basically the overhead, If this number is close to vex then opencl offers nothing.
Try comment out massiveArithmetics see what compute time you get, that's basically the overhead, If this number is close to vex then opencl offers nothing.
B.Henriksson, DICE
- filipw
- Member
- 135 posts
- Joined: March 2018
- Offline
EDIT: ok the biggest error I made in my first test was sloppyness with leaving out trailing f:s on the literals wich I assume resulted in a lot of coercions
There is no doubt that the calc is what takes time. Anyway. I made a test case and you can check this .hip if you want to. The crazy thing is that this behaves more in line with what one would expect…
It is based on this ShaderToy code(https://www.shadertoy.com/view/ldf3DN) that calculates a Mandelbrot with orbit traps:
I converted the code to VEX and OpenCL. In both cases I had to add some functionality that was missing. But all in all it is very similar. I made sure everythig used floats and no conversions from doubles in the CL code. I might still have missed stuff but I think it is quite fair.
Changing the Shadertoy code to not uses antialiasing and setting the iterations to 100000 and runnig the window at 1200x675 gives me in the range 20-30 fps. Thur rendering 60 frames takes 2-3 secs
The same settings in houdini both for vex and opencl and running 60 frames with perf monitor on report that my RTX 2080 ti running OpenCL is about 12x faster doing the calc than my 5820 6 core intel CPU.
so:
GLSL on shadertoy: 2- 3 s
VEX: 31 s
OpenCL: 2.7s
Hmm, maybe it if fine after all
(Making the calc domain larger/higher res made the benefit of OpenCL larger just as expected.)
Still don't get why my earlier example with very similar setup/principle had such a difference between OpenCL and GLSL.
There is no doubt that the calc is what takes time. Anyway. I made a test case and you can check this .hip if you want to. The crazy thing is that this behaves more in line with what one would expect…
It is based on this ShaderToy code(https://www.shadertoy.com/view/ldf3DN) that calculates a Mandelbrot with orbit traps:
I converted the code to VEX and OpenCL. In both cases I had to add some functionality that was missing. But all in all it is very similar. I made sure everythig used floats and no conversions from doubles in the CL code. I might still have missed stuff but I think it is quite fair.
Changing the Shadertoy code to not uses antialiasing and setting the iterations to 100000 and runnig the window at 1200x675 gives me in the range 20-30 fps. Thur rendering 60 frames takes 2-3 secs
The same settings in houdini both for vex and opencl and running 60 frames with perf monitor on report that my RTX 2080 ti running OpenCL is about 12x faster doing the calc than my 5820 6 core intel CPU.
so:
GLSL on shadertoy: 2- 3 s
VEX: 31 s
OpenCL: 2.7s
Hmm, maybe it if fine after all
(Making the calc domain larger/higher res made the benefit of OpenCL larger just as expected.)
Still don't get why my earlier example with very similar setup/principle had such a difference between OpenCL and GLSL.
Edited by filipw - Dec. 20, 2019 14:03:11
-
- Quick Links