Karma XPU feels really slow!

   13516   72   10
User Avatar
Member
355 posts
Joined: Nov. 2015
Offline
jsmack
alexwheezy
traileverse
XPU at 11 mins? could that be a linux thing? Is it slower on windows? and CPU is doing half the work. Your GPU only is still better than my 3090. Why driver 535.129.03? is that the latest on linux?

Yes, the render engine on the XPU rendered the picture in 11 minutes. I can't say about peculiarities of XPU work on Windows or Linux, I haven't made tests on my configuration on Windows yet.
The driver used is the latest stable from the repository.

yeah, something sounds way off. It takes 30 minutes to render with a 3090 doing 90% of the work on Windows. A 3090 is probably 2x as fast as a 3070Ti in rendering too. Also 51% work done by a 5950? I don't buy it. A 3070Ti gpu is at least a few times faster than that CPU. Maybe you rendered it to viewport size and not to MPlay at 1920x1080?

If that 11-mins rendering is correct then something is waay off and someone from sidefx needs to jump in here and check if something is wrong on windows. In my mind that would be the render times I'm expecting, hence on a 3090 (2x faster) that would be about 5-6 or 7mins per frame for that bubblewrap scene.
hou.f*ckatdskmaya().forever()
User Avatar
Member
201 posts
Joined: Jan. 2013
Offline
jsmack
alexwheezy
traileverse
XPU at 11 mins? could that be a linux thing? Is it slower on windows? and CPU is doing half the work. Your GPU only is still better than my 3090. Why driver 535.129.03? is that the latest on linux?

Yes, the render engine on the XPU rendered the picture in 11 minutes. I can't say about peculiarities of XPU work on Windows or Linux, I haven't made tests on my configuration on Windows yet.
The driver used is the latest stable from the repository.

yeah, something sounds way off. It takes 30 minutes to render with a 3090 doing 90% of the work on Windows. A 3090 is probably 2x as fast as a 3070Ti in rendering too. Also 51% work done by a 5950? I don't buy it. A 3070Ti gpu is at least a few times faster than that CPU. Maybe you rendered it to viewport size and not to MPlay at 1920x1080?

Oh yeah, that's my mistake. I should have specified render resolutions of 1280 x 720 in the post. I tried this test on H20 Apprentice the maximum render resolution for that version. I didn't look that the render node settings specified Full HD. Otherwise, the load between CPU and GPU was distributed as described in the previous post.
Edited by alexwheezy - Nov. 13, 2023 14:23:20
User Avatar
Member
33 posts
Joined: Aug. 2017
Offline
ok, I gave the bubble wrapper a spin using the original settings from the file.

Rendering on a threadripper pro 64 core, 2x 4500 RTX (they are not superfast but together should beat a 3090), windows 10

renderdistribution was 21%/25%/52% (optix/optix/embree)

rendertime was 14:38


There is still visible noise in the dof areas. To be honest, I could have stopped at about 50% and cleaned the rest up with denoise. But I guess that's not the point. I also find it intriguing, that the optix devices don't have the same rendercontribution and the cpu is faster then the two combined (with other scenes this is also sometimes the case, but usually, the gpus are much faster that the cpu).

Not sure how fast this would be an a 4000 card as I remember reading a post that they might be faster rendering refractions because of the shader sorting functionality.
Edited by ronald_a - Nov. 14, 2023 03:37:52
User Avatar
Member
433 posts
Joined: April 2018
Offline
Man, I'd imagine the scene would render in 30 seconds on that system! Very interesting.
Subscribe to my Patreon for the best CG tips, tricks and tutorials! https://patreon.com/bhgc [patreon.com]

Twitter: https://twitter.com/brianhanke [twitter.com]
Behance: https://www.behance.net/brianhanke/projects [www.behance.net]
User Avatar
Staff
469 posts
Joined: May 2019
Offline
ronald_a
renderdistribution was 21%/25%/52% (optix/optix/embree)

Anything with lots of nested refraction is fairly problematic for GPU (creates lots of divergence). So I'm not surprised this bubble-wrap scene has speed issues with XPU. But it would be good to make sure there is not something else at play. What happens if you do a full render with the CPU device disabled?
you can do that by setting this environment variable
KARMA_XPU_DISABLE_EMBREE_DEVICE=1
https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#disablingdevices [www.sidefx.com]

ronald_a
Not sure how fast this would be an a 4000 card as I remember reading a post that they might be faster rendering refractions because of the shader sorting functionality.

yes the new ADA cards have something called "shader execution reordering" (SER). Its not activated on XPU yet, but we will be doing that in the coming weeks/months. Its designed to reduce divergence, so should help with scenes like this. It will be interesting to see what the speed increase is like.
https://developer.nvidia.com/blog/improve-shader-performance-and-in-game-frame-rates-with-shader-execution-reordering/ [developer.nvidia.com]

thanks
Edited by brians - Nov. 13, 2023 21:34:45
User Avatar
Member
17 posts
Joined: Feb. 2017
Offline
Hi Brian, very happy to hear SER is being implemented(I'm running 2x4090). Hopefully we see it in dailies/production of 20.0.x, fingers crossed.
User Avatar
Member
98 posts
Joined: Aug. 2015
Online
Rendered scene on 2x4090, ryzen 5950x, linux. One from the start of the thread not sure if someone sent some different versions. 1min7sec
Increasing samples to 1024, took 1min58sec, still noisy but just to give some idea.
Btw do we have any denoising options available in Karma XPU?

Attachments:
Screenshot_20231114_102208.png (1.6 MB)

User Avatar
Staff
34 posts
Joined: May 2022
Offline
Hi Mirko,

You may try the modified hip file from this post [www.sidefx.com] where the portal geometry is added to improve the dome light sampling.

The denoiser can be enabled via
- Karma Render Settings (node) > Image Output > Filters > Denoiser , or
- Display Options > Enable Denoising

There are two denoisers available: Optix (interactive), Intel OIDN (only applies after render is completed).
User Avatar
Member
33 posts
Joined: Aug. 2017
Offline
brians
ronald_a
renderdistribution was 21%/25%/52% (optix/optix/embree)

Anything with lots of nested refraction is fairly problematic for GPU (creates lots of divergence). So I'm not surprised this bubble-wrap scene has speed issues with XPU. But it would be good to make sure there is not something else at play. What happens if you do a full render with the CPU device disabled?
you can do that by setting this environment variable
KARMA_XPU_DISABLE_EMBREE_DEVICE=1
https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#disablingdevices [www.sidefx.com]

Disabling Embree clocks in at 29:06 which I guess is expected.
User Avatar
Member
7770 posts
Joined: Sept. 2011
Offline
traileverse
Could you share that optimized version? Also, are you on a 3090 as well?

Here's my version of a refraction-less transparent material.

and the result:
15:15


true refraction version:
35:08


These materials and render settings are not identical to the one from the content library so times can't be compared. I have increased refraction bounces to 16 to fill in the darkened areas of the true thin transmissive version. Color limit has been increased to 1000 to prevent clamping the blue reflections to white. This introduces additional fireflies, but with the color clamping a lot of energy is lost and doesn't tone map correctly.
Edited by jsmack - Nov. 14, 2023 11:20:04

Attachments:
bubble_wrap_mantra2.hip (2.1 MB)
bubble_wrap_mantra2.fake_thin.0001.jpg (988.3 KB)
bubble_wrap_mantra2.true_thin.0001.jpg (1002.8 KB)

User Avatar
Member
355 posts
Joined: Nov. 2015
Offline
jsmack
traileverse
Could you share that optimized version? Also, are you on a 3090 as well?

Here's my version of a refraction-less transparent material.

and the result:
15:15
Image Not Found


true refraction version:
35:08
Image Not Found


These materials and render settings are not identical to the one from the content library so times can't be compared. I have increased refraction bounces to 16 to fill in the darkened areas of the true thin transmissive version. Color limit has been increased to 1000 to prevent clamping the blue reflections to white. This introduces additional fireflies, but with the color clamping a lot of energy is lost and doesn't tone map correctly.

Thanks for this. I will let you know what my times are. Overall though, atm there isn’t any reason economically to go with XPU, it’s just much slower than I would’ve expected and for what I do, it simply would be foolish not to use redshift. It miles faster!
I was truly hoping to make the switch but not right now! I’m also going to try octane, which from my understanding is also unbiased and I’m seeing tests online where it’s beating redshift in a few scenes. Quality is mightily import, but for a modern renderer and a GPU one at that, speed with a slight hit in quality is more important for motion design! I’m not gonna be spending 2k on a 4090 for these render times! now way, that’s putting me out of business. Karma IMO humble opinion, has ways to go in that department.
hou.f*ckatdskmaya().forever()
User Avatar
Member
355 posts
Joined: Nov. 2015
Offline
Mirko Jankovic
Rendered scene on 2x4090, ryzen 5950x, linux. One from the start of the thread not sure if someone sent some different versions. 1min7sec
Increasing samples to 1024, took 1min58sec, still noisy but just to give some idea.
Btw do we have any denoising options available in Karma XPU?

hey, could you share your motherboard, PSW and what pc case you’re using with dual 4090?
hou.f*ckatdskmaya().forever()
User Avatar
Member
355 posts
Joined: Nov. 2015
Offline
ali_f
Hi Mirko,

You may try the modified hip file from this post [www.sidefx.com] where the portal geometry is added to improve the dome light sampling.

The denoiser can be enabled via
- Karma Render Settings (node) > Image Output > Filters > Denoiser , or
- Display Options > Enable Denoising

There are two denoisers available: Optix (interactive), Intel OIDN (only applies after render is completed).

The modified version of the scene does perform much better but still a ways behind RS. I haven’t tried it with denoise (it’s still a bit noisy too) so I will do that as well.
hou.f*ckatdskmaya().forever()
User Avatar
Member
7770 posts
Joined: Sept. 2011
Offline
traileverse
Thanks for this. I will let you know what my times are. Overall though, atm there isn’t any reason economically to go with XPU, it’s just much slower than I would’ve expected and for what I do, it simply would be foolish not to use redshift. It miles faster!

The choice is between Karma CPU and XPU and Mantra, 3rd party renderers never enter into my consideration as they have an added cost and lack the quality and flexibility compared to Mantra. And XPU is much faster than Mantra most of the time. The only question is, is it flexible enough.
User Avatar
Member
355 posts
Joined: Nov. 2015
Offline
jsmack
traileverse
Thanks for this. I will let you know what my times are. Overall though, atm there isn’t any reason economically to go with XPU, it’s just much slower than I would’ve expected and for what I do, it simply would be foolish not to use redshift. It miles faster!

The choice is between Karma CPU and XPU and Mantra, 3rd party renderers never enter into my consideration as they have an added cost and lack the quality and flexibility compared to Mantra. And XPU is much faster than Mantra most of the time. The only question is, is it flexible enough.

Man I wish I could say that, cause I’m not a fan of 3rd party anything either. Coming from after effects and C4D I don’t even like the words plug-in side by side. It’s one of the main reasons houdini became home because I could escape the need for a ton of overhead crap especially ones to do simple things the tools should do natively.

Except when it came to rendering! clients want things fast. And who am I kidding, I want my renders fast too! Whatever is fastest on the hardware. If XPU is a bit slower that’s fine. But 5-6-7 times slower, naaah! I’ll continue getting better with it because the more you know about the renderer the better times you can get.

Then also hope to see generous speed improvements in the future or maybe only improvements in the hardware will allow, whichever one.
hou.f*ckatdskmaya().forever()
User Avatar
Member
7770 posts
Joined: Sept. 2011
Offline
traileverse
Except when it came to rendering! clients want things fast. And who am I kidding, I want my renders fast too! Whatever is fastest on the hardware. If XPU is a bit slower that’s fine. But 5-6-7 times slower, naaah! I’ll continue getting better with it because the more you know about the renderer the better times you can get.

Then also hope to see generous speed improvements in the future or maybe only improvements in the hardware will allow, whichever one.

If speed is the most important thing, rendering in real time is probably worth consideration.
User Avatar
Member
355 posts
Joined: Nov. 2015
Offline
jsmack
traileverse
Except when it came to rendering! clients want things fast. And who am I kidding, I want my renders fast too! Whatever is fastest on the hardware. If XPU is a bit slower that’s fine. But 5-6-7 times slower, naaah! I’ll continue getting better with it because the more you know about the renderer the better times you can get.

Then also hope to see generous speed improvements in the future or maybe only improvements in the hardware will allow, whichever one.

If speed is the most important thing, rendering in real time is probably worth consideration.

Yeh I started using unreal like 2 months ago, lots of fun, getting better with it day by day and I feel houdini together with unreal is as good as it gets. The balance between quality and speed is what’s most important though, not just speed.
hou.f*ckatdskmaya().forever()
User Avatar
Member
1621 posts
Joined: March 2009
Offline
jsmack
as they have an added cost

The topic of costs will lead to some interesting discussion, once Karma goes full commercial and costs per seat (that mantra deal was always really great in that regard, especially for shops strapped for cash but still with some racks full of nodes, like ours).
Martin Winkler
money man at Alarmstart Germany
User Avatar
Member
7770 posts
Joined: Sept. 2011
Offline
I went and tested the bubble wrap scene with Cycles, I remembered it having similar raw performance when I tested it a few years ago. Although it doesn't have a thin transmissive material, it can simulate it with the solidify modifier by making the surface double walled. After testing, I think there's something broken with XPU, performance wise. The raw sampling speed with cycles seems to be 30x higher than with XPU. I couldn't get the result to look exactly like the XPU one though so maybe there some incorrect thing cycles is doing that makes it faster. 4096 samples takes on the order of 3-5 minutes for the bubble wrap scene depending on how many bells and whistles you enable on the material. XPU was on the order of 30 minutes for 1024 samples with the absolute most basic shader.

this is the result with Cycles. It seems to be losing a lot of energy when it stacks up in depth, even though I have indirect unclamped and 24 bounces. is it just fast because it's cheating and wrong? I would think even 4096 wrong samples would take longer than 1024 good ones.
Edited by jsmack - Nov. 14, 2023 14:45:26

Attachments:
cycles.bubble_wrap_solidify_unclamped_complex_mtl.jpg (912.1 KB)

User Avatar
Member
67 posts
Joined: June 2022
Offline
jsmack
I went and tested the bubble wrap scene with Cycles, I remembered it having similar raw performance when I tested it a few years ago. Although it doesn't have a thin transmissive material, it can simulate it with the solidify modifier by making the surface double walled. After testing, I think there's something broken with XPU, performance wise. The raw sampling speed with cycles seems to be 30x higher than with XPU. I couldn't get the result to look exactly like the XPU one though so maybe there some incorrect thing cycles is doing that makes it faster. 4096 samples takes on the order of 3-5 minutes for the bubble wrap scene depending on how many bells and whistles you enable on the material. XPU was on the order of 30 minutes for 1024 samples with the absolute most basic shader.

this is the result with Cycles. It seems to be losing a lot of energy when it stacks up in depth, even though I have indirect unclamped and 24 bounces. is it just fast because it's cheating and wrong? I would think even 4096 wrong samples would take longer than 1024 good ones.
Image Not Found
Isnt translucent shader in cycles equivalent of "thinwalled"? Also in cycles you set hard limit on transparency rays, I see big difference on the right of the screen, maybe you should set more transparent rays in cycles to get closer results?
Edited by sniegockiszymon - Nov. 14, 2023 15:29:43
  • Quick Links