random fails on deadline

   2910   10   1
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
I'm trying to render bunch of shots / rops. from solaris sent all to deadline. But I randomly get this error, frame fails but then again it gets picked up and rendered. Machines are randomly rendering and failing, same machine both renders one frame fails another.. and so on..
Anyone can decipher what is going on here? Rendering from H GUI works fine but on deadline getting this mess...

2024-08-05 16:06:37: 0: STDOUT: start_thread <libc.so.6>
2024-08-05 16:06:37: 0: STDOUT: __clone3 <libc.so.6>
2024-08-05 16:06:37: 0: STDOUT: -- TRACEBACK END --
2024-08-05 16:06:37: 0: STDOUT: 122904: Fatal error: Segmentation fault (sent by pid 122904)
2024-08-05 16:06:37: 0: STDOUT: -- TRACEBACK BEGIN --
2024-08-05 16:06:37: 0: STDOUT: Traceback from karma 20.5.278 (Compiled on linux-x86_64-gcc11.2):
2024-08-05 16:06:37: 0: STDOUT: stackTrace(UTsignalHandlerArg) <libHoudiniUT.so>
2024-08-05 16:06:37: 0: STDOUT: signalCallback(UTsignalHandlerArg) <libHoudiniUT.so>
2024-08-05 16:06:37: 0: STDOUT: UT_Signal::UT_ComboSignalHandler::operator()(int, siginfo_t*, void*) const <libHoudiniUT.so>
2024-08-05 16:06:37: 0: STDOUT: UT_Signal::processSignal(int, siginfo_t*, void*) <libHoudiniUT.so>
2024-08-05 16:06:37: 0: STDOUT: __GI___sched_yield <libc.so.6>
2024-08-05 16:06:37: 0: STDOUT: __GI___sched_yield <libc.so.6>
2024-08-05 16:06:37: 0: STDOUT: tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task(long&, long) (custom_scheduler.h:313)
2024-08-05 16:06:37: 0: STDOUT: tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) (custom_scheduler.h:713)
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::WorkDispatcher::Wait() <libpxr_work.so>
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::WorkDispatcher::~WorkDispatcher() <libpxr_work.so>
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::HdRenderIndex::SyncAll(std::vector<std::shared_ptr<pxrInternal_v0_24__pxrReserved__::HdTask>, std::allocator<std::shared_ptr<pxrInternal_v0_24__pxrReserved__::HdTask> > >*, std::unordered_map<pxrInternal_v0_24__pxrReserved__::TfToken, pxrInternal_v0_24__pxrReserved__::VtValue, pxrInternal_v0_24__pxrReserved__::TfToken::HashFunctor, std::equal_to<pxrInternal_v0_24__pxrReserved__::TfToken>, std::allocator<std::pair<pxrInternal_v0_24__pxrReserved__::TfToken const, pxrInternal_v0_24__pxrReserved__::VtValue> > >*)::{lambda()#2}::operator()() const <libpxr_hd.so>
2024-08-05 16:06:37: 0: STDOUT: tbb::interface7::internal::isolate_within_arena(tbb::interface7::internal::delegate_base&, long) <libtbb.so.2>
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::HdRenderIndex::SyncAll(std::vector<std::shared_ptr<pxrInternal_v0_24__pxrReserved__::HdTask>, std::allocator<std::shared_ptr<pxrInternal_v0_24__pxrReserved__::HdTask> > >*, std::unordered_map<pxrInternal_v0_24__pxrReserved__::TfToken, pxrInternal_v0_24__pxrReserved__::VtValue, pxrInternal_v0_24__pxrReserved__::TfToken::HashFunctor, std::equal_to<pxrInternal_v0_24__pxrReserved__::TfToken>, std::allocator<std::pair<pxrInternal_v0_24__pxrReserved__::TfToken const, pxrInternal_v0_24__pxrReserved__::VtValue> > >*) <libpxr_hd.so>
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::HdEngine::Execute(pxrInternal_v0_24__pxrReserved__::HdRenderIndex*, std::vector<std::shared_ptr<pxrInternal_v0_24__pxrReserved__::HdTask>, std::allocator<std::shared_ptr<pxrInternal_v0_24__pxrReserved__::HdTask> > >*) <libpxr_hd.so>
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::XUSD_HuskEngine::doRender() <libHoudiniUSD.so>
2024-08-05 16:06:37: 0: STDOUT: pxrInternal_v0_24__pxrReserved__::XUSD_HuskEngine::Render(double) <libHoudiniUSD.so>
2024-08-05 16:06:37: 0: STDOUT: <husk>
2024-08-05 16:06:37: 0: STDOUT: <husk>
2024-08-05 16:06:37: 0: STDOUT: <husk>
2024-08-05 16:06:37: 0: STDOUT: __libc_start_call_main <libc.so.6>
2024-08-05 16:06:37: 0: STDOUT: __libc_start_main_alias_2 <libc.so.6>
2024-08-05 16:06:37: 0: STDOUT: <husk>
2024-08-05 16:06:37: 0: STDOUT: -- TRACEBACK END --
2024-08-05 16:06:37: 0: STDOUT: Error: Caught exception: The attempted operation failed.
2024-08-05 16:06:37: 0: STDOUT: Error: Command Exit Code: 139
2024-08-05 16:06:37: 0: STDOUT: Failed to complete render: exit code 139
2024-08-05 16:06:37: 0: STDOUT: Use a Log Viewer with External Render Processes enabled for more information.
2024-08-05 16:06:37: 0: STDOUT: Traceback (most recent call last):
2024-08-05 16:06:37: 0: STDOUT: File "/var/lib/Thinkbox/Deadline10/workers/cgoven110/plugins/66b0c2817455a30055aefce0/hrender_dl.py", line 882, in <module>
2024-08-05 16:06:37: 0: STDOUT: rop.render( frameTuple, resolution, ignore_inputs=ignoreInputs )
2024-08-05 16:06:37: 0: STDOUT: File "/opt/hfs20.5.278/houdini/python3.11libs/hou.py", line 80706, in render
2024-08-05 16:06:37: 0: STDOUT: return _hou.RopNode_render(self, *args, **kwargs)
2024-08-05 16:06:37: 0: STDOUT: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-05 16:06:37: 0: STDOUT: hou.OperationFailed: The attempted operation failed.
2024-08-05 16:06:37: 0: STDOUT: Error: Command Exit Code: 139
2024-08-05 16:06:37: 0: STDOUT: Failed to complete render: exit code 139
2024-08-05 16:06:37: 0: STDOUT: Use a Log Viewer with External Render Processes enabled for more information.
2024-08-05 16:06:37: 0: Done executing plugin command of type 'Render Task'
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
So it is weird, crashing happens when I connect materials to characters. If I don't have any material sand textures on them it renders fine.. Anyway going through them one by one to see whats happening but weird is it renders fine in 90% cases and fails randomly just from time to time... segmentation fault error
User Avatar
Member
66 posts
Joined: March 2012
Offline
I normally got this error when the file(or Caches, usd? something) are missing.

There maybe network issues (too much machines are looking for one textures, files) if so, you might lower down the max machine limit for stability.

Also you may check into version of Python. like choosing Python 3.10.
https://www.youtube.com/watch?v=UmQ9TflX4wU [www.youtube.com] (this one is 3.9 though)
https://forums.thinkboxsoftware.com/t/houdini-20-patch-files/31411/18 [forums.thinkboxsoftware.com] (Another Thread about H20 patch.. not 20.5 though)

Setting chunk size to 1,(Rendering 1 frame per machine) this might solve the problem. :/
Edited by icecreamumai - Aug. 7, 2024 05:31:55
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
But how then it fails once a then renders that frame without a problem again.
Don't think it is network problem I've tried with single machine as well, same thing happens.
And also only 6 machines on 10 gig network, 20 gig aggregation on NAS... that part is solid.

Right now using chunk size 1, so when it fails it fails once and then it gets rendered so at least it is moving but not a solution. Something is messy here
Although.. I can test on windows boot on my main machines.. all are Linux at the moment Nobara 40 (Fedora 40 practically)
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
Jus ttested on windwos and same random failure on deadline jsut a bit diferent set of .ddl isntead of linux .so

2024-08-08 11:08:04: 0: STDOUT: Rendering frame 1433
2024-08-08 11:08:22: 0: STDOUT: KarmaXPU: ShaderGraph _sg_s_B3323BE34FCEF510 has rootnode-type not handled by XPU, skipped
2024-08-08 11:08:23: 0: STDOUT: Invalid InsertValueInst operands!
2024-08-08 11:08:23: 0: STDOUT: %0 = insertvalue %mx_surfaceshader undef, %mx_bsdf %mxF1, 0
2024-08-08 11:08:23: 0: STDOUT: Invalid ExtractValueInst operands!
2024-08-08 11:08:23: 0: STDOUT: %mxF = extractvalue %mx_surfaceshader %mx_surfaceshader, 0
2024-08-08 11:08:23: 0: STDOUT: Invalid InsertValueInst operands!
2024-08-08 11:08:23: 0: STDOUT: %0 = insertvalue %mx_surfaceshader undef, %mx_bsdf %mxF1, 0
2024-08-08 11:08:23: 0: STDOUT: Invalid ExtractValueInst operands!
2024-08-08 11:08:23: 0: STDOUT: %mxF = extractvalue %mx_surfaceshader %mx_surfaceshader, 0
2024-08-08 11:08:23: 0: STDOUT: 29608: Fatal error: Segmentation fault
.....
.....
.....
2024-08-08 11:08:24: 0: STDOUT: +0x7ffc5912e0b4 C:\Windows\SYSTEM32\ntdll.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffc56df4030 C:\Windows\System32\KERNELBASE.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffc56df3f2e C:\Windows\System32\KERNELBASE.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffbc9a03d34 C:\Windows\system32\DriverStore\FileRepository\nv_dispi.inf_amd64_34f9511bafd21ff9\nvcuda64.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffbc9ad3fa2 C:\Windows\system32\DriverStore\FileRepository\nv_dispi.inf_amd64_34f9511bafd21ff9\nvcuda64.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffbc9a03757 C:\Windows\system32\DriverStore\FileRepository\nv_dispi.inf_amd64_34f9511bafd21ff9\nvcuda64.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffbc9ddb1a8 C:\Windows\system32\DriverStore\FileRepository\nv_dispi.inf_amd64_34f9511bafd21ff9\nvcuda64.dll
2024-08-08 11:08:24: 0: STDOUT: +0x7ffc57517374 C:\Windows\System32\KERNEL32.DLL
2024-08-08 11:08:24: 0: STDOUT: +0x7ffc590dcc91 C:\Windows\SYSTEM32\ntdll.dll
2024-08-08 11:08:24: 0: STDOUT: -- TRACEBACK END --
2024-08-08 11:08:24: 0: STDOUT: Error: Caught exception: The attempted operation failed.
2024-08-08 11:08:24: 0: STDOUT: Error: Command Exit Code: 139
2024-08-08 11:08:24: 0: STDOUT: Failed to complete render: exit code 139
2024-08-08 11:08:24: 0: STDOUT: Use a Log Viewer with External Render Processes enabled for more information.
2024-08-08 11:08:24: 0: STDOUT: Traceback (most recent call last):
2024-08-08 11:08:24: 0: STDOUT: File "C:\ProgramData\Thinkbox\Deadline10\workers\cgoven110\plugins\66b46fbebe55c73242aee6e6\hrender_dl.py", line 882, in <module>
2024-08-08 11:08:24: 0: STDOUT: rop.render( frameTuple, resolution, ignore_inputs=ignoreInputs )
2024-08-08 11:08:24: 0: STDOUT: File "C:\PROGRA~1/SIDEEF~1/HOUDIN~1.278/houdini/python3.11libs\hou.py", line 80706, in render
2024-08-08 11:08:24: 0: STDOUT: return _hou.RopNode_render(self, *args, **kwargs)
2024-08-08 11:08:24: 0: STDOUT: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-08-08 11:08:24: 0: STDOUT: hou.OperationFailed: The attempted operation failed.
2024-08-08 11:08:24: 0: STDOUT: Error: Command Exit Code: 139
2024-08-08 11:08:24: 0: STDOUT: Failed to complete render: exit code 139
2024-08-08 11:08:24: 0: STDOUT: Use a Log Viewer with External Render Processes enabled for more information.


But most likely lokos like some issue with deadline itself. Wierd to be so randm
User Avatar
Member
62 posts
Joined: July 2007
Offline
I'm getting this too. It's odd since it works on certain machines (ironically the ones with older GPU's) but it kicks up seg faults on my machines with 4090's in them. Same job, rendered via deadline. The same job renders successfully when rendered locally on the same machine. Weird.
Edited by Lenscowboy - Aug. 8, 2024 08:36:20
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
What I figured so far is that it fails when rendering from Hudini GUI as well, both on linux and windows.
Right now also it happens when I connect materials to the characters brought in by USD from maya.
Without materials it renders fine...
Just tried latest daily build as we.l 20.5.320 same thing:
2024-08-08 14:57:57: 0: STDOUT: Invalid InsertValueInst operands!
2024-08-08 14:57:57: 0: STDOUT: %0 = insertvalue %mx_surfaceshader undef, %mx_bsdf %mxF1, 0
2024-08-08 14:57:57: 0: STDOUT: Invalid ExtractValueInst operands!
2024-08-08 14:57:57: 0: STDOUT: %mxF = extractvalue %mx_surfaceshader %mx_surfaceshader, 0
2024-08-08 14:57:57: 0: STDOUT: Invalid InsertValueInst operands!
2024-08-08 14:57:57: 0: STDOUT: %0 = insertvalue %mx_surfaceshader undef, %mx_bsdf %mxF1, 0
2024-08-08 14:57:57: 0: STDOUT: Invalid ExtractValueInst operands!
2024-08-08 14:57:57: 0: STDOUT: %mxF = extractvalue %mx_surfaceshader %mx_surfaceshader, 0
2024-08-08 14:57:57: 0: STDOUT: 24346 ThreadId=0x7f3710e00680: Fatal error: Segmentation fault

Looks like there is something going on with materials...
User Avatar
Member
62 posts
Joined: July 2007
Offline
So begins the debug!
I can successfully render a clean scene with simple cube and plane and no materials.
I will eliminate materials from my complex scene and let you know results
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
OK I think I've found it.
This material connected to surface was the problem. once removed mtlxsubsurface_bsdf1 it is all fine again
At least as it seems for now will test more but so far seems to be ok...

Attachments:
Screenshot_20240808_155204.png (156.6 KB)

User Avatar
Member
62 posts
Joined: July 2007
Offline
Mine turned out to be an ocean surface and interior. When deleted the file renders in karma xpu. I'll try dig deeper when i get a gap.
User Avatar
Member
279 posts
Joined: Aug. 2015
Offline
it was definitely this bsdf material connected to surface.
weird is that it still rendered fine in 90% of frames failing in some randomly. solved this one firvthe moment
  • Quick Links