brians
Perhaps they're not being released when the GPU goes into an error-state. In the meantime I'll have a look at the code to see if this might be the case
I've checked this, and although we could be better at re-allocating resources after a device has failed (perhaps 1-2 threads we could be doing better with), this wouldn't explain the slowdown you're getting on your threadripper.
It seems definitely something to do with your GPU memory maxing out. So lets try few more things.
1)
Can you try stripping your scene down somewhat (so that it fits all in GPU memory), and then do a render? And report back what kinds of times you get with/without the GPU (ie by using that envvar)
2)
There is a new Windows/NVidia feature where Optix/CUDA GPU memory can "spill" over to the CPU ram when the GPU gets full.
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion [
nvidia.custhelp.com]
This is still a very new feature and we're having trouble getting information about it. But I can see your machine is doing this because the "Shared GPU memory" is at 18gig. What happens if you set "CUDA system fallback policy" to "Prefer No Sysmem fallback" in the NVidia control panel? (which will disable this feature)
3)
Are you able to watch the task-manager stats during a render where the GPU fails. And perhaps take a screenshot. I want to see what happens at the point of failure. We should be releasing GPU (and also the "Shared GPU memory") resources once it fails, but perhaps we are not.
4)
We output stats for the devices into the EXR header (as long as the new driver is used, not legacy). They can be viewed using the command
iinfo -v myfile.exr
. Its very dense/unreadable sorry, but could you perhaps post the result of that here, for a gpu-failing render? We can see a little more about what is happening to each device.
We really should be outputting more information regarding device utilization/failure into the regular stats too. I'll get that prioritized.
cheers