Karma XPU - dual 4090 RTX setup - performance issues | Forums

Forums Technical Discussion Karma XPU - dual 4090 RTX setup - performance issues

Karma XPU - dual 4090 RTX setup - performance issues

1342 5 0


timjan: Member; 22 posts; Joined: 10月 2015; Online

2023年11月12日 5:19

Im running 2 * RTX 4090 + 1 * 128 cores threadripper but compared to running the same system with only 1 * RTX 4090 GPU Im seeing only a slight 50% increase in speed.. Is this normal or is there anything I can do to squeeze more juice out of it?? Currently running 19.5.716 Houdini version.

cheers

Edited by timjan - 2023年11月12日 05:28:10


brians: スタッフ; 486 posts; Joined: 5月 2019; Offline

2024年1月8日 23:24

That is a beast of a CPU

You can enable/disable different types of devices
https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#disablingdevices [www.sidefx.com]

It would be great to get some render times from you for (eg)...
- CPUdevice=on GPUdevice0=off GPUdevice1=off
- CPUdevice=off GPUdevice0=on GPUdevice1=off
- CPUdevice=off GPUdevice0=off GPUdevice1=on
- CPUdevice=off GPUdevice0=on GPUdevice1=on
- CPUdevice=on GPUdevice0=on GPUdevice1=on

This way we can verify that the added performance gain using 1 or 2 GPU is correct/expected.

thanks


timjan: Member; 22 posts; Joined: 10月 2015; Online

2024年1月10日 4:22

Yes its indeed powerful for all things CPU related, Im very pleased with the system

Thanks for the update, will check it out!


GnomeToys: Member; 11 posts; Joined: 1月 2017; Offline

2024年1月26日 6:31

If it works like Radeon ProRender only the primary card (an RTX4090 in my case, secondary is a 7900XTX, and yes, surprisingly that actually works) was allowed to do work on samples above the minimum since the adaptive samples require on the spot decisions based on rays / photon casts from the rest of the scene depending on how it's being done whereas the non-adaptive samples only require the data resulting from those which isn't really needed until all the samples are mixed together (so probably doesn't need to be transfered to the other card at all). It's too slow to be beneficial without something like NVLink which NVidia conveniently killed off on everything below the $8500 L40 in Ada cards, would be my guess. That or that's simply as much work as can be offloaded from CPU. The 4090 only has 256 total fp64 cores out of the 16000 something cuda cores and they're spread out amongst all SMs which makes it impossible to get any kind of cache locality working with them so anything that needs higher than fp32 precision is probably ending up on the CPU.


Polybud: Member; 10 posts; Joined: 2月 2022; Offline

2024年2月23日 3:46

GnomeToys
If it works like Radeon ProRender only the primary card (an RTX4090 in my case, secondary is a 7900XTX, and yes, surprisingly that actually works) was allowed to do work on samples above the minimum since the adaptive samples require on the spot decisions based on rays / photon casts from the rest of the scene depending on how it's being done whereas the non-adaptive samples only require the data resulting from those which isn't really needed until all the samples are mixed together (so probably doesn't need to be transfered to the other card at all). It's too slow to be beneficial without something like NVLink which NVidia conveniently killed off on everything below the $8500 L40 in Ada cards, would be my guess. That or that's simply as much work as can be offloaded from CPU. The 4090 only has 256 total fp64 cores out of the 16000 something cuda cores and they're spread out amongst all SMs which makes it impossible to get any kind of cache locality working with them so anything that needs higher than fp32 precision is probably ending up on the CPU.

Referring to the above quote, could anyone help me out with the info, as where and under what circumstances Karma XPU is using higher than fp32 precision? Sorry, maybe its basic knowledge, but I'm a bit lost due to this layer

Thanks in advance if anyone can drop some info!


brians: スタッフ; 486 posts; Joined: 5月 2019; Offline

2024年2月23日 16:03

Polybud
Referring to the above quote, could anyone help me out with the info, as where and under what circumstances Karma XPU is using higher than fp32 precision?

I got lost reading GnomeToys reply sorry, but I’ll try to cover XPUs gpu/multi-device architecture briefly, which will hopefully clarify any understanding.

XPU treats each device (including the CPU device) as a separate entity. There is no memory sharing between devices. They do not know about each other or communicate. They each have a separate copy of the scene data.

Xpu instructs each of them to render separate passes of the image (some will do this faster than others), which it receives and blends into the final image in whatever order they arrive.

This is a failsafe architecture because it doesn’t matter what combination of devices someone has, or if (eg) one of them fails or whatever, we still end up with the same final result.

For this to work, each type of device needs to produce the EXACT same result (including the cpu device). So to this end we only use fp32 calculations across all devices.

Edited by brians - 2024年2月24日 02:19:45

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts