Karma XPU failure on 3090ti

   11038   41   1
User Avatar
Member
207 posts
Joined: 11月 2015
Offline
Hi;

I have a scene which I'd like to use XPU to render. Prior to today, I was working in this scene on a machine that contained one 1080 and one 1080ti graphics cards. The scene typically failed to render using these GPUs, I think because it was running out of VRAM. The indication I'm using for this is that when I shift over to Karma in the viewport, I see two OPTIX and one EMBREE process(es); after a few moments, the OPTIX ones say "fail", and the render soldiers on with the Embree process only (and, of course, this is punishingly slow).

So, I decided to roll the dice and invest in a shiny new 3090ti with 24GB VRAM, which showed up today. I installed it, ensured I had the latest drivers installed, loaded my scene and switched the viewport to Karma. The scene sat in an "initializing" state for nearly two minutes; I saw both OPTIX and EMBREE processes initializing. Then, crushingly, the OPTIX one said "fail". I tried resetting Karma and restarting the render, only now I see NO Optix process at all. Oddly, my Task Manager shows the GPU under some amount of load, though Karma seems to indicate it isn't seeing the card at all.

Other than trying to divine what's going on using just these viewport indicators, what other things can I do to debug this to better understand what might be going on?

Edited by dhemberg - 2022年10月16日 22:23:37

Attachments:
Screenshot 2022-10-16 211337.png (426.3 KB)

User Avatar
Member
59 posts
Joined: 3月 2012
Offline
Firstly,
Intel HD Graphic 530 is selected inside Houdini maybe?
You can go Edit>Preference>Miscellaneous>OpenCL Device
to check which Graphic card is selected.

Secondary,
Im assuming that your RAM(Pysical one 29.5/31.9GB) is already run out to process OPTIX.

If you have a tons of polygons in your scene, and
if you import the scene via "Scene import" Node in Solaris, maybe you have to save Physical RAM.

So
1.Caching the scene into intermediate cache like .bgeo/.abc/.usd then
2.Importing the cache file using "file" node might save you to render your scene.

Fingers Crossed,
User Avatar
Member
207 posts
Joined: 11月 2015
Offline
Hm, unfortunately neither of these things gets me unstuck, and I'm left still not understanding where I might be going of the rails with my scene (though I am grateful for the reply!)

I have a vague awareness that USD can rapidly create complexity via time samples if one isn't careful. So far, I haven't been particularly careful about the use of wrangles and python nodes in my scene, and I see several of them have little clock icons next to them, despite the intention of the code within to just do something once (i.e. it is not my intent to animate a camera position, though I am using a wrangle to set it).

How can I avoid creating time samples via the use of parameter expressions and wrangles? I can just use a timeShift at the end of everything to effectively kill all animation, but this seems blunt and I WOULD like to animate part of my scene. What are ways I can control this more thoughtfully?

(I'm unsure this question will actually get me going vis a vis my XPU renders, but I gotta start debugging somewhere...)
Edited by dhemberg - 2022年10月18日 14:02:25
User Avatar
スタッフ
466 posts
Joined: 5月 2019
Offline
some thoughts:

1)
Can you get a basic scene going in XPU? (eg a single unshaded torus, with no lights)
If so, that is a start and we can work on from there

2)
if you open the viewport render stats, you'll see a more detailed reason as to why XPU failed

https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#howto [www.sidefx.com]
https://www.sidefx.com/docs/houdini/images/solaris/solaris_xpu_render_stats.png [www.sidefx.com]
Edited by brians - 2022年10月18日 22:58:52
User Avatar
Member
207 posts
Joined: 11月 2015
Offline
Hi;

1) Yes, a basic scene renders fine; I can confirm the card itself is working.

2) this unfortunately doesn't seem to uncover any clues, and my experience so far is inconsistent; sometimes the card fails, other times it doesn't, and the stats and log don't seem to offer clues as to what's happening when. Either way, the render forges ahead; when it works, I get renders in minutes; when the card fails, I get renders in hours. Is there a way to say "Hey, if XPU is failing on a graphics card, bail on the render outright and tell me what went wrong"? Is there a verbosity level on the USD ROP LOP that might help? The logs right now (set to "render stats" don't offer any obvious clues.

Absent this, I'm doing some good ol' rolling-up-the-sleeves TD debugging. The failures seem to happen when I have motion blur enabled on a particular USD layer (in my case: tree animation generated by Labs Trees + Vellum). My spidey sense encourages me to wonder if I have a varying point count in my USD file that might be causing time samples (and, by extension, motion blur) to fail. The only way I can think to test this is by manually spelunking my USD to make sure point counts are not varying frame to frame.

I'm comfortable doing my own debugging; what I'm struggling with is the apparent lack of tools to pinpoint what might be going wrong. I've never dealt with a renderer that has this fallback behavior; I'm used to renders either failing or not, so this thing where the render continues with no messages about why the GPU side of it is failing is puzzling to me.
User Avatar
Member
7740 posts
Joined: 9月 2011
Offline
dhemberg
I'm comfortable doing my own debugging; what I'm struggling with is the apparent lack of tools to pinpoint what might be going wrong. I've never dealt with a renderer that has this fallback behavior; I'm used to renders either failing or not, so this thing where the render continues with no messages about why the GPU side of it is failing is puzzling to me.

disable the embree xpu device, and it won't fall back to it anymore.

dhemberg
Absent this, I'm doing some good ol' rolling-up-the-sleeves TD debugging. The failures seem to happen when I have motion blur enabled on a particular USD layer (in my case: tree animation generated by Labs Trees + Vellum). My spidey sense encourages me to wonder if I have a varying point count in my USD file that might be causing time samples (and, by extension, motion blur) to fail. The only way I can think to test this is by manually spelunking my USD to make sure point counts are not varying frame to frame.

Perhaps the layer is causing otherwise instanced geometry to become unique? 99% of the time Optix fails it's simply running out of VRAM. The other 1% is driver problems. Make sure the layer is only adding position time samples and not topology ones. If the scene had instances, the animation layer might need to be added to a class primitive that specializes the instances.
Edited by jsmack - 2022年10月23日 22:27:52
User Avatar
スタッフ
466 posts
Joined: 5月 2019
Offline
dhemberg
no messages about why the GPU side of it is failing

Are you able to reproduce the problem in the viewport? Or does it only happen offline (ie via husk)?

In Viewport:
if you open the viewport render stats, you'll see a more detailed reason as to why XPU failed
https://www.sidefx.com/docs/houdini/solaris/karma_xpu.html#howto [www.sidefx.com]
https://www.sidefx.com/docs/houdini/images/solaris/solaris_xpu_render_stats.png [www.sidefx.com]

Do you get a more detailed message in the viewport render stats?
Or... is it like... blank or something?


offline/husk:
There should be an error message printed out in the render log (ie to the terminal)
Something like this
"KarmaXPU: device Type:Optix ID:0 has registered a critical error cudaErrorlllegalAddress, so will now stop functioning. Future error messages will be suppressed "
Edited by brians - 2022年10月23日 22:34:57
User Avatar
Member
207 posts
Joined: 11月 2015
Offline
jsmack
disable the embree xpu device, and it won't fall back to it anymore.

Oh this seems quite useful; how does one do this? I don't see options in my LOP nodes, not sure where do look in Preferences...

jsmack
Perhaps the layer is causing otherwise instanced geometry to become unique? 99% of the time Optix fails it's simply running out of VRAM. The other 1% is driver problems.

Ok thank you; both useful clues. The way I tried setting my scene up (predicated on this [www.sidefx.com] thread):

  • I make some trees using Labs trees
  • I get them wiggling around a little using Vellum: by this I mean I get the trunk wiggling, then use the animated trunk to get the leaves wiggling, both with vellum. At this point, nothing is instanced (in Houdini parlance).
  • I export static trees+leaves+materials. This gives me tree01.usd, tree02.usd, etc.
  • I remove N/uv, and export the tree/leaf geo again for a frame range. I *think* the resulting usd (tree_anim01.usd, tree_anim02.usd, etc.) contain only time sample data for the points.
  • In my renderable scene file, I import all tree#.usd.
  • I sublayer the tree_anim#.usds. In my Scene Graph Details, I see the tree points turn green, indicating they are animated (I think).



  • I then use a Lop Instancer to instance these trees where I want them.

This all seems to work fine, I can render a still image on my GPU fine. But, if I switch from rendering Current Frame to Frame Range, the render fails.

If I add a Render Geometry Settings and disable motion blur on the trees, no crashed Optix. If I disable the sublayer of animation, no crashed Optix. So all clues point to something being hosed with what I think it my animation data for the trees.


jsmack
Make sure the layer is only adding position time samples and not topology ones.

What might be a way I can verify this? I can manually scrub the timeline and ensure my point count is the same each frame, though with a 240 frame sequence these seems tedious to verify. Just curious how I might approach this.


jsmack
If the scene had instances, the animation layer might need to be added to a class primitive that specializes the instances.

Hm, still learning USD so I only understand some of this jargon. I *think* what you're saying aligns with what I'm attempting to do as described above, though if that's not right could I trouble you to elaborate?
Edited by dhemberg - 2022年10月24日 11:01:37

Attachments:
trees.png (892.7 KB)

User Avatar
Member
207 posts
Joined: 11月 2015
Offline
brians
Do you get a more detailed message in the viewport render stats?

Hi; I get this. I *think* the relevant thing I should be noticing here is the PeakDeviceMemTotal (?). Is this suggesting my scene is trying to take up 400GB of memory?



brians
offline/husk:
There should be an error message printed out in the render log (ie to the terminal)
Something like this
"KarmaXPU: device Type:Optix ID:0 has registered a critical error cudaErrorlllegalAddress, so will now stop functioning. Future error messages will be suppressed "

Yes, exactly, I see this exact message. I'm afraid it's not very helpful; I can easily understand when the GPU is failing (render times shoot through the roof), it's the "why" that I'm trying to pinpoint.
Edited by dhemberg - 2022年10月24日 11:09:21

Attachments:
trees_fail.png (1.7 MB)

User Avatar
スタッフ
466 posts
Joined: 5月 2019
Offline
"cudaErrorlllegalAddress" is the piece of information we're looking for.
Its no magic bullet, but does rule out some other things.
thanks for the info!
User Avatar
Member
7740 posts
Joined: 9月 2011
Offline
dhemberg
jsmack
Make sure the layer is only adding position time samples and not topology ones.

What might be a way I can verify this? I can manually scrub the timeline and ensure my point count is the same each frame, though with a 240 frame sequence these seems tedious to verify. Just curious how I might approach this.

With USD, the point count being constant is irrelevant. Animated topology can have all the keys with the same values and still be considered animated. Notice in your screen shot that faceVertexCounts and faceVertexIndices are green which indicates that they have time samples.

When writing the animation layer, don't worry about which attributes are stripped off in sops, but look at what attributes are imported to usd with sop import. This way the animation layer can be restricted to containing only points/normals time samples and nothing else that is not animated. On the topology settings, 'None' can be specified which imports no topology, keeping whatever topology is on the layer stack unchanged. I'm not sure if this is the cause of the failure, XPU should support changing topology no problem, but it could lead to memory bloat.
User Avatar
Member
207 posts
Joined: 11月 2015
Offline
I think I found a smoking gun.

Here is an archive of a very simple scene file setup:
https://www.dropbox.com/sh/xt6ozjuvpsi1tqa/AAA6n0SubrgdlgpYmiD5HJ_ba?dl=0 [www.dropbox.com]

The archive contains a hip file, and a folder containing USD files of a single static tree, and an animated tree. There's also a couple of textures in there, and this is where I think it gets interesting.

The scene file just reads the static tree, assigns a MaterialX Shader to the leaves that includes opacity and an albedo map driving subsurface:





When I disconnect either the subsurface or the opacity connection in the shader, the Optix crash disappears.

Am I doing something obviously illegal in this setup? I can't deduce from the Render Stats in the viewport what might be going awry.

Attachments:
tree_failure.png (863.2 KB)
leaf_shader.png (39.9 KB)

User Avatar
Member
7740 posts
Joined: 9月 2011
Offline
Does my simple scene crash for you? It didn't crash for me.

I tried your scene, but the static tree file is empty, so nothing is assigned as a material. The static tree file is just a sublayer of this file, which is local to you:
subLayers = [
@c:/Users/beautifulShelves/geometry/exterior/vegetation/trees/static/tree_static_01.usd@
]
Edited by jsmack - 2022年10月25日 19:22:03

Attachments:
maple_leaf_xpu.hip (638.9 KB)

User Avatar
Member
7740 posts
Joined: 9月 2011
Offline
I was able to render the tree without the static file just fine. The animation file contains the entire tree so the static one is made redundant. I removed the static tree reference and moved the material assignment to after the animation is referenced.

Edit:
It fails with your render settings, but it works when I disable caustics. (how did that get turned on, they pretty much shouldn't ever be used?)

to sidefx and materialX: what do you expect when you confound transparency with opacity by using refraction rays to simulate opacity?
Edited by jsmack - 2022年10月25日 19:38:03

Attachments:
tree_optix.png (2.2 MB)

User Avatar
Member
207 posts
Joined: 11月 2015
Offline
Heyo;

jsmack
Does my simple scene crash for you? It didn't crash for me.

Indeed, this does seem to work! Thank you for a further simplification of this, and sorry about this oversight of mine:

jsmack
I tried your scene, but the static tree file is empty, so nothing is assigned as a material. The static tree file is just a sublayer of this file, which is local to you:
subLayers = [
@c:/Users/beautifulShelves/geometry/exterior/vegetation/trees/static/tree_static_01.usd@
]

Apologies.

jsmack
I was able to render the tree without the static file just fine. The animation file contains the entire tree so the static one is made redundant. I removed the static tree reference and moved the material assignment to after the animation is referenced.

I'm still rebuilding things as you described above, which feels much cleaner and clearer, so thank you for that. I suspected my attempt at setting up animation layering wasn't all the way right, but hesitant about muddying what are already muddy waters trying to get to the bottom of this one, so I tried to hold off on questions about that!


jsmack
Edit:
It fails with your render settings, but it works when I disable caustics. (how did that get turned on, they pretty much shouldn't ever be used?)

Er, hold the phone: say what?

My (larger) renderable scene file is an archvis setup; these trees are outside windows, and I need light to come into my interior, but need to see reflections on the windows. Thin-walled refraction did not seem to be an obvious option in MaterialX, so I'm dealing with windows the hard, terrible way, which is to say I give them proper thickness and deal with awful render times, which is primarily why I'm interested in XPU to help things along. If I disable caustics, I simply get fully shadowed, black interiors.

I haven't suffered any issues with this other than render times, and haven't come across any indication that using caustics causes unusual problems...this is the first I'm learning of this! The pictures themselves look great, though.





jsmack
to sidefx and materialX: what do you expect when you confound transparency with opacity by using refraction rays to simulate opacity?

Yeah this bit has been pretty tricky to pave my way around; I agree that this leaf setup is pretty odd compared to how I might handle it in any other renderer. But I remain confounded that this causes a failure on my GPU but works on on the CPU, that's pretty tricky.

Anyway, much appreciated for the continued debugging help.
Edited by dhemberg - 2022年10月25日 21:32:34

Attachments:
20220817_1630_beautifulShelf.png (2.2 MB)
20220904_2249_beautifulShelf.png (1.9 MB)
20220829_0707_beautifulShelf.png (2.3 MB)

User Avatar
Member
207 posts
Joined: 11月 2015
Offline
Just to add: simply disabling caustics in my Render Settings on my full tree setup here doesn't remedy the Optix failures I'm seeing.
User Avatar
スタッフ
466 posts
Joined: 5月 2019
Offline
jsmack
to sidefx and materialX: what do you expect when you confound transparency with opacity by using refraction rays to simulate opacity?

If you...
- set IOR=1
- transmission=1
- roughness=0

With...
- EnableCausics=0

Then you do get an opacity-like effect. The main reason this works is that we now have "fake caustics" (ie transparent shadows) working by default. So you'll get a nice semitransparent shadow.




With
- EnableCaustics=1

you'll get a hard shadow with caustic fireflies etc...



For reference, here is the same image but using opacity instead of transmission



My 2-cents is that someone should really use opacity for something like tree-leaves. Using transmission comes with other issues. One is that propagating many paths can be problematic/costly. Another is without thin-walled the reverse side of a quad will can look problematic due to IOR issues.

dhemberg
Thin-walled refraction did not seem to be an obvious option in MaterialX

"thinwalled" on MtlxStandardSurface works fine in XPU

dhemberg
If I disable caustics, I simply get fully shadowed, black interiors.

Really? I get the opposite.
Do my attached images and scene look/behave as you'd expect?

thanks
Edited by brians - 2022年10月26日 03:38:23

Attachments:
transmission_with_enable_caustics_off.JPG (52.1 KB)
transmission_with_enable_caustics_on.JPG (56.0 KB)
opacity_with_enable_caustics_off.JPG (55.6 KB)
trans_vs_opacity.hip (384.0 KB)

User Avatar
Member
207 posts
Joined: 11月 2015
Offline
I'm not sure we're barking up the right tree here. @brians: in the shader graph I provide above (as well as in the scene file I provide), I am driving "opacity" of the MaterialX shader with a greyscale map that is intended to say where my tree leaves are visible and where they are not. I'm not driving anything by transmission.

I have Caustics enabled in my renderable scene for reasons entirely separate from these tree leaves. I copy/pasted the Render Settings node from my renderable scene into the hip file I'm using to set up my trees so that I have a clear understanding of how they'll behave in my renderable scene, but I do not specifically rely on caustics to drive the look of my tree.

I assume that @jsmack implied that the underlying code in MaterialX uses transmission code for both opacity and transmission, but that could be me misunderstanding what he's saying.

In any case, I've taken another try at isolating this issue, again here is a (hopefully more self-contained scene):
https://www.dropbox.com/sh/xt6ozjuvpsi1tqa/AAA6n0SubrgdlgpYmiD5HJ_ba?dl=0 [www.dropbox.com]

When I open the scene as is and switch to Karma in the viewport (the RS node is set to use XPU, and importantly, Caustics are disabled on the RS node), I consistently get an Optix failure.




Then, if I:
--disable the material assignment node
OR
--disconnect the Opacity Map from the Opacity input of the MaterialX nde (note again: no transmission is being involved here at all)
OR
--disable the animation sublayer

the Optix failure goes away (I have to restart Houdini to bring Optix back up after a failure, so the above actions are being done after freshly launching the scene).

I don't think this is a caustics/transmission issue. Or, if it is, it's not clear to me how to fix it from this thread so far.

Attachments:
tree_failure.png (1.7 MB)

User Avatar
Member
7740 posts
Joined: 9月 2011
Offline
dhemberg
the Optix failure goes away (I have to restart Houdini to bring Optix back up after a failure, so the above actions are being done after freshly launching the scene).

I don't think this is a caustics/transmission issue. Or, if it is, it's not clear to me how to fix it from this thread so far.

Disabling caustics is all it took for it to work for me. I think there is bug there somewhere.

brians
My 2-cents is that someone should really use opacity for something like tree-leaves. Using transmission comes with other issues. One is that propagating many paths can be problematic/costly. Another is without thin-walled the reverse side of a quad will can look problematic due to IOR issues.

We are using opacity here, for leaf cutout, but I think Karma uses ray continuation for opacity, no? It's not doing a sum all hits like mantra used to anymore. That's refraction in my book.
User Avatar
Member
7740 posts
Joined: 9月 2011
Offline
dhemberg
My (larger) renderable scene file is an archvis setup; these trees are outside windows, and I need light to come into my interior, but need to see reflections on the windows. Thin-walled refraction did not seem to be an obvious option in MaterialX, so I'm dealing with windows the hard, terrible way, which is to say I give them proper thickness and deal with awful render times, which is primarily why I'm interested in XPU to help things along. If I disable caustics, I simply get fully shadowed, black interiors.

You should not get black interiors, MaterialX in Karma enables fake caustics by default. You only need to enable caustics when you want cool patterns, not just to let light into an interior. Even with how fast XPU is, if you use true caustics you'll need 1000-1000000x as many rays to resolve it, as well as having to increase the color limit for indirect light to be as bright as your light sources.
  • Quick Links