XPU stopped working after Houdini upgrade (605 to 653).

   3416   35   4
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
It seems that after upgrading from 20.0.605 to 20.0.653 (production build) I have lost the ability to render with Karma XPU. There are no errors printed to stdout, but Log Viewer contains several errors and warnings (logs are in the attachment). HUD in the upper right corner of the viewport doesn't even mention OptiX. Same thing happens with the newest daily build (20.0.675).

I had to roll back to 20.0.605 where XPU still works.

My specs: Debian 12.5 (Bookworm), nvidia-driver/libnvoptix1 550.54.15-1 (upstream), RTX 3070.

Has anyone else experienced this problem?

Attachments:
houdinilogs.json.tar.gz (1.6 KB)

User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Nobody?

All right, then I guess it's an isolated case. I'll report it to support.
User Avatar
Member
11 posts
Joined: May 2014
Offline
I am using 653 under windows and XPU is working fine.
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Can you tell me which version of NVIDIA driver you are using? It was suggested to me by the support that I should upgrade to 550.67, though this version isn't available in upstream repository yet, so I cannot test this solution out.

Perhaps something was changed in XPU architecture between 605 and 653, and it now requires some functions that exist in newer GPU driver?
User Avatar
Member
1646 posts
Joined: March 2009
Online
I can reproduce some sort of problem.

When using XPU in .653 using nvidia 550.54.14 on Linux (not debian though), I get 100% embree only, no optix.
It renders, just CPU-only. So there is something up.
Martin Winkler
money man at Alarmstart Germany
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Yes, this sounds like the same issue I'm experiencing.
User Avatar
Staff
484 posts
Joined: May 2019
Offline
We are indeed loading a new driver binary.
try rendering via the offline "karma" command, at verbosity level 5, then post the log here.
I'm pretty sure we'll see an error message "Failed to load CUDA DSO ..."
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Bingo.
[20:16:57] KarmaXPU: Failed to load CUDA DSO [libnvidia-ml.so: cannot open shared object file: No such file or directory]

But why does it complain that this library cannot be found? I have it inside /usr/lib/x86_64-linux-gnu/nvidia/current/path. It's a symlink to libnvidia-ml.so.1, which in turn is a symlink to libnvidia-ml.so.550.54.15.
Edited by ajz3d - April 16, 2024 14:46:26

Attachments:
h653.log.tar.gz (6.1 KB)

User Avatar
Staff
484 posts
Joined: May 2019
Offline
Maybe a path issue.
We load two files dynamically at runtime, libcuda.soand libnvidia-ml.so.
Do they live beside each other in the same directory? Or is the libcuda.sofile in a different location?
Or maybe its sym-linked in another location, and that's what we're picking up?
I might put a log-message about where exactly we're picking these files up from, and that might give us more clues about where these files are being found at runtime.
Edited by brians - April 16, 2024 19:41:07
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
/usr/lib/x86_64-linux-gnu/nvidia/current/contains libcuda.soas well as libnvidia-ml.so*set of files.

However, /usr/lib/x86_64-linux-gnu/contains libcuda.so, but only libnvidia-ml.so.1. I believe Houdini is checking this particular path, because after I created libnvidia-ml.soas a symlink to /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1, OptiX started working again.

So yes, it's a problem with paths.
Edited by ajz3d - April 17, 2024 07:38:13
User Avatar
Member
7859 posts
Joined: Sept. 2011
Online
ajz3d
/usr/lib/x86_64-linux-gnu/nvidia/current/contains libcuda.soas well as libnvidia-ml.so*set of files.

However, /usr/lib/x86_64-linux-gnu/contains libcuda.so, but only libnvidia-ml.so.1. I believe Houdini is checking this particular path, because after I created libnvidia-ml.soas a symlink to /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1, OptiX started working again.

So yes, it's a problem with paths.

Does that make it a bug with the distro or the nvidia driver installer?
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Hard to say. I can only speculate, but I would definitely exclude the distro from the blame list, because the only thing I did before Optix stopped working in XPU is to upgrade Houdini from 20.0.605 to 20.0.653. No apt upgrades, no new nvidia-driver installations or anything like that. And, rolling back to Houdini 20.0.605 makes the Optix work again in XPU. Besides, I'm not using nvidia-driver from Debian's repositories, but the one from the upstream repo which Debian team has no control of.

I'd say it's most likely Houdini or NVIDIA. Brians said that they're loading a new driver binary now. I assume he had this libnvidia-ml.soin mind. So maybe they're loading from the wrong path? It might also be that Houdini uses the correct path to dynamically link this library, but NVIDIA misconfigured their .deb packages and that's why the symlink to libnvidia-ml.sowasn't created in /usr/lib/x86_64-linux-gnupath when the nvidia-driver package was installed. Who knows? :/
User Avatar
Member
39 posts
Joined: Nov. 2013
Online
XPU + Optix is working for me in 20.0.653 on Linux Mint with Nvidia driver 535.171.04.
User Avatar
Staff
484 posts
Joined: May 2019
Offline
ajz3d
Brians said that they're loading a new driver binary now. I assume he had this libnvidia-ml.so in mind. So maybe they're loading from the wrong path?

The thing that has changed is that we are loading the libnvidia-ml.sofile. But it should live beside libcuda.someaning that if we can load one, then we should be able to load the other from the same path.

I'm not sure if its the distro or nvidia at fault, but I think we'll just fix Houdini to go looking for the libnvidia-ml.sofile in the location of the actual libcuda.sobinary (not the symlink). Hopefully that should address the issue.
Edited by brians - April 18, 2024 05:58:31
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Great. Let me know once you implement the changes, so I can test them. Of course, I'll remove the manually created symlink beforehand. :P
User Avatar
Staff
484 posts
Joined: May 2019
Offline
Hi guys

I've made this change to 20.0.685
When you get a chance, can you please test and let me know either way.

thanks!
User Avatar
Member
109 posts
Joined: Aug. 2015
Online
I just came in to check about same issue, XPU not working, GPUs not used at all, at the same time redshift still works fine on 20.0.653

20.0.685 also no luck. Nobara linux (Fedora), nvidia drivers 550.67
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
brians
I've made this change to 20.0.685
When you get a chance, can you please test and let me know either way.
Hi Brians,

I removed manually created /usr/lib/x86_64-linux-gnu/libnvidia-ml.sosymlink, restarted Debian (just in case), installed 20.0.685, and ran some XPU test renders from both: the GUI and the offline renderer. The problem seems to be fixed as there were no errors. OptiX kicked in and I had 99% load on the GPU.

I'm still on nvidia-driver 550.54.15-1.
User Avatar
Member
109 posts
Joined: Aug. 2015
Online
I never made anything manually, just installed 20.0.685 and tried running but no luck, it did not work for me.
nvidia drivers 550.67
User Avatar
Member
479 posts
Joined: Aug. 2014
Offline
Mirko, have you tried running an offline render with verbosity of 5 or higher, like Brians suggested in one of his posts? It should provide you with more detailed information than what you normally get from the Log Viewer.

/opt/hfs20.0/bin/karma -V 5 test.usd
  • Quick Links