Karma XPU - dual 4090 RTX setup - performance issues

   1342   5   0
User Avatar
Member
22 posts
Joined: 10月 2015
Online
Im running 2 * RTX 4090 + 1 * 128 cores threadripper but compared to running the same system with only 1 * RTX 4090 GPU Im seeing only a slight 50% increase in speed.. Is this normal or is there anything I can do to squeeze more juice out of it?? Currently running 19.5.716 Houdini version.

cheers
Edited by timjan - 2023年11月12日 05:28:10
User Avatar
スタッフ
486 posts
Joined: 5月 2019
Offline
That is a beast of a CPU

You can enable/disable different types of devices
https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#disablingdevices [www.sidefx.com]

It would be great to get some render times from you for (eg)...
- CPUdevice=on GPUdevice0=off GPUdevice1=off
- CPUdevice=off GPUdevice0=on GPUdevice1=off
- CPUdevice=off GPUdevice0=off GPUdevice1=on
- CPUdevice=off GPUdevice0=on GPUdevice1=on
- CPUdevice=on GPUdevice0=on GPUdevice1=on

This way we can verify that the added performance gain using 1 or 2 GPU is correct/expected.

thanks
User Avatar
Member
22 posts
Joined: 10月 2015
Online
Yes its indeed powerful for all things CPU related, Im very pleased with the system Thanks for the update, will check it out!
User Avatar
Member
11 posts
Joined: 1月 2017
Offline
If it works like Radeon ProRender only the primary card (an RTX4090 in my case, secondary is a 7900XTX, and yes, surprisingly that actually works) was allowed to do work on samples above the minimum since the adaptive samples require on the spot decisions based on rays / photon casts from the rest of the scene depending on how it's being done whereas the non-adaptive samples only require the data resulting from those which isn't really needed until all the samples are mixed together (so probably doesn't need to be transfered to the other card at all). It's too slow to be beneficial without something like NVLink which NVidia conveniently killed off on everything below the $8500 L40 in Ada cards, would be my guess. That or that's simply as much work as can be offloaded from CPU. The 4090 only has 256 total fp64 cores out of the 16000 something cuda cores and they're spread out amongst all SMs which makes it impossible to get any kind of cache locality working with them so anything that needs higher than fp32 precision is probably ending up on the CPU.
User Avatar
Member
10 posts
Joined: 2月 2022
Offline
GnomeToys
If it works like Radeon ProRender only the primary card (an RTX4090 in my case, secondary is a 7900XTX, and yes, surprisingly that actually works) was allowed to do work on samples above the minimum since the adaptive samples require on the spot decisions based on rays / photon casts from the rest of the scene depending on how it's being done whereas the non-adaptive samples only require the data resulting from those which isn't really needed until all the samples are mixed together (so probably doesn't need to be transfered to the other card at all). It's too slow to be beneficial without something like NVLink which NVidia conveniently killed off on everything below the $8500 L40 in Ada cards, would be my guess. That or that's simply as much work as can be offloaded from CPU. The 4090 only has 256 total fp64 cores out of the 16000 something cuda cores and they're spread out amongst all SMs which makes it impossible to get any kind of cache locality working with them so anything that needs higher than fp32 precision is probably ending up on the CPU.

Referring to the above quote, could anyone help me out with the info, as where and under what circumstances Karma XPU is using higher than fp32 precision? Sorry, maybe its basic knowledge, but I'm a bit lost due to this layer

Thanks in advance if anyone can drop some info!
User Avatar
スタッフ
486 posts
Joined: 5月 2019
Offline
Polybud
Referring to the above quote, could anyone help me out with the info, as where and under what circumstances Karma XPU is using higher than fp32 precision?

I got lost reading GnomeToys reply sorry, but I’ll try to cover XPUs gpu/multi-device architecture briefly, which will hopefully clarify any understanding.

XPU treats each device (including the CPU device) as a separate entity. There is no memory sharing between devices. They do not know about each other or communicate. They each have a separate copy of the scene data.

Xpu instructs each of them to render separate passes of the image (some will do this faster than others), which it receives and blends into the final image in whatever order they arrive.

This is a failsafe architecture because it doesn’t matter what combination of devices someone has, or if (eg) one of them fails or whatever, we still end up with the same final result.

For this to work, each type of device needs to produce the EXACT same result (including the cpu device). So to this end we only use fp32 calculations across all devices.
Edited by brians - 2024年2月24日 02:19:45
  • Quick Links