Karma XPU - Machines with GPU present take longer to render | Forums

Forums Solaris and Karma Karma XPU - Machines with GPU present take longer to render

Karma XPU - Machines with GPU present take longer to render

1072 15 0


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 5, 2024 4:37 a.m.

Good day,

I've been doing some render testing with the Karma XPU.
Gotten some successful frames rendered however here are the tests:
Same XPU job submitted to both machines.

#1 Machine
CPU - 1 x AMD Ryzen Threadripper 3970X 32-Core Processor (32 cores, 64 logical) with 130945MB
GPU - No GPU.
Average of per frame - 4mins 20secs

#2 Machine
CPU - 1 x AMD Ryzen Threadripper 3970X 32-Core Processor (32 cores, 64 logical) with 130945MB
GPU - NVIDIA GeForce RTX 4090 @ 2520MHz (compute 8.9) with 24563MB (23279MB available) (NVLink:0)
Average of per frame - 7mins

Seems crazy that adding a 4090 GPU to help render the job yields a slower render time?
Is this expected behavior, is there anything I can look at adjusting?

The Karma/Husk logs don't seem to report any useful information regarding CPU/GPU usuage or what time is spent where.
Out render manager log at the end seems to show that the GPU was being used:

R257| Total processing time: 00:07.00 h:m.s 
R258| Max memory usage: 101.6GB (of 130GB installed, total system usage at same time: 109.4GB.   Apr 04. 18:18.35) 
R259| System free memory left: 19.0GB (Apr 04. 18:18.35) 
R260| Max CPU usage: 96%   61.7 of 64 cores    (total system usage at same time: 0.0.   Apr 04. 18:19.04) 
R261| Max GPU usage:100%  1.0 GPUs (of 1 GPUs assigned to this thread. Apr 04. 18:15.29.    Max system GPU usage at same time: 1.0/1. Apr 04. 18:15.29)

all the best,
amwilkins

Edited by am_wilkins - April 5, 2024 10:26:17


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 5, 2024 10:27 a.m.

Hi again,

Apologies, for those who saw this already and was likely confused. I got the average times mixed up.
I've edited the original post now.

Key take away, is that I'm getting a 66% slower render time on the machine with a 4090 in it.

Thanks!
amwilkins

Edited by am_wilkins - April 5, 2024 10:28:10


jsmack: Member; 7785 posts; Joined: Sept. 2011; Online

April 5, 2024 12:12 p.m.

Weird, it's about 10x faster with a GPU from my experience. Are you sure it's using the GPU at all? Your log shows over 100Gigs of usage but the 4090 only has 24G of memory, so it must be failing a few seconds in anyway.

Also, is this time the first time, or an average of runs? The first time there might be compile times for shaders which could take several minutes.


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 8, 2024 4 a.m.

Hi jsmack,

Yeah the only thing I can see in the logs to do with GPU or memory utilization is the last log from our render manager. So that does show 100GB but that's all mostly what it loads into system RAM.
Is there any better logs to get from a Karma Husk render?

The times are from an average of runs.

I definitely wasn't expecting slower render times with a 4090 involved. I logged into the machine to check what was happening while the job was running. It does try to load some GPU memory but for a few frames never fully uses the CUDA cores or the GPU to render it seems.

Here is one frame that was using the CUDA cores for a bit, but memory usage is very high. So maybe wasn't getting everything into memory that it needs to render at max efficiency.

Looks like we are also getting an error on some of the frames.

R 76| StdErr: [09:51:43] KarmaXPU: device Type:Optix ID:0 has registered a critical error [cudaErrorMemoryAllocation], so will now stop functioning.  Future error messages will be suppressed

This now might be in relation to my other post:
https://www.sidefx.com/forum/topic/95306/?page=1#post-418837 [www.sidefx.com]

I had first suspected they were unrelated since I wasn't seeing that "cudaErrorMemoryAllocation".
Let me try solve the memory usage first and then I can update here new averages.

thanks,
amwilkins

Edited by am_wilkins - April 8, 2024 04:48:01

Attachments:
xpu_render_stats_01.png (264.8 KB)


brians: Staff; 470 posts; Joined: May 2019; Offline

April 8, 2024 4:42 a.m.

I think the first thing to try is to render with only the EmbreeCPU device active on that machine, to make sure we get the same performance as the machine without the GPU.
So with this envvar here

KARMA_XPU_DISABLE_OPTIX_DEVICE=1

Please report back if you get the same performance, thanks.

We reserve some threads for GPU-shader-compilation, and release them for the EmbreeCPU device once compilation is done. Perhaps they're not being released when the GPU goes into an error-state. In the meantime I'll have a look at the code to see if this might be the case


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 8, 2024 9:51 a.m.

Hi brians,

Thanks for the recommendations.

brians
I think the first thing to try is to render with only the EmbreeCPU device active on that machine

Rendering with this EnvVar enabled on the same machine that has the 4090 yielded a much faster render time using purely the CPU.
No envvar = an average of 6mins 00secs
With envvar = an average of 3mins 40secs

After resetting the job a few times with and without the envvars, I also noted some inconsistent behavior.
No envvar.
frame 101 = 2min 50secs
frame 102 = 4min 55secs
frame 103 = 3min 10secs
frame 104 = 2min 45secs
frame 105 = 12min 30secs

We actually see the first and 4th frame using the GPU more than the CPU, which outputs the kind of render time we're expecting to see. However subsequent frames are much longer again. Most frequently averaging as mentioned above.
So possibly slowing down when GPU VRAM is overloaded? Or something else?

A separate set of frames (Karma XPU, same settings, same scene, no envvar. This had a "catmullClark" scheme on the geometries. So more likely to use extra memory) for example:
frame 396 = 7min 25secs
frame 397 = 6min 25secs
frame 398 = 6min 25secs
frame 399 = 7min 30secs
frame 400 = 7min 30secs

Over on the other thread regarding the excessive memory usage:
https://www.sidefx.com/forum/topic/95306/?page=1#post-419135 [www.sidefx.com]

I removed all the subdvisions for the scene, which reduced memory usage (still not as low as it should be) but the 24GB VRAM is still maxing out on the GPU.
No "cudaErrorMemoryAllocation" reported in the log.

I also increased the verbosity, but not seeing much useful in the logs:
Karma XPU
No envvar.

R622| [14:45:39] Unified Cache: 363.29 MiB of 31.97 GiB used 
R623| [14:45:39]               In Cache Faults        Generated     
R624| [14:45:39] 363.29 MiB    48,825   363.29 MiB    7.62 KiB      
R625| [14:45:39] TBF Cache: 310 file opens (133.66 MiB read) 
R626| [14:45:39] RAT Disk Cache: 60 hits 
R627| [14:45:39] accept_unmipped       : 1 
R628| [14:45:39] accept_untiled        : 1 
R629| [14:45:39] automip               : 0 
R630| [14:45:39] autoscanline          : 1 
R631| [14:45:39] autotile              : 512 
R632| [14:45:39] deduplicate           : 1 
R633| [14:45:39] failure_retries       : 0 
R634| [14:45:39] forcefloat            : 0 
R635| [14:45:39] max_errors_per_file   : 100 
R636| [14:45:39] max_memory_MB         : 63.94 GiB 
R637| [14:45:39] max_mip_res           : 1,073,741,824 
R638| [14:45:39] max_open_files        : 1024 
R639| [14:45:39] searchpath            : '' 
R640| [14:45:39] trust_file_extensions : 0 
R641| [14:45:39] unassociatedalpha     : 0 

R658| OpenImageIO ImageCache statistics (shared) ver 2.3.14 
R659|   Options:  max_memory_MB=65473.0 max_open_files=1024 autotile=512 
R660|             autoscanline=1 automip=0 forcefloat=0 accept_untiled=1 
R661|             accept_unmipped=1 deduplicate=1 unassociatedalpha=0 
R662|             failure_retries=0  
R663|   Images : 63 unique 
R664|     ImageInputs : 62 created, 2 current, 12 peak 
R665|     Total pixel data size of all images referenced : 2.6 GB 
R666|     Total actual file size of all images referenced : 979.9 MB 
R667|     Pixel data read : 239.7 MB 
R668|     File I/O time : 14.1s (0.2s average per thread, for 81 threads) 
R669|     File open time only : 0.2s 
R670|   Tiles: 475 created, 472 current, 472 peak 
R671|     total tile requests : 157261461 
R672|     micro-cache misses : 92888 (0.059066%) 
R673|     main cache misses : 475 (0.000302045%) 
R674|     redundant reads: 0 tiles, 0 B 
R675|     Peak cache memory : 239.7 MB 

R761| [14:45:39] Object Counts: 
R762| [14:45:39]              Cameras:           1 
R763| [14:45:39]    Coordinate Spaces:           0 
R764| [14:45:39]         Curve Meshes:           0 
R765| [14:45:39]           Light Tree:           0 
R766| [14:45:39]               Lights:           2 
R767| [14:45:39]         Point Meshes:           0 
R768| [14:45:39]       Polygon Meshes: 285,891 total 1,627 unique 
R769| [14:45:39]              Volumes:           0 
R770| [14:45:39] Geometry Counts: 
R771| [14:45:39]               Curves:           0 
R772| [14:45:39]               Points:           0 
R773| [14:45:39]             Polygons: 60,271,321 total 14,230,441 unique 
R774| [14:45:39]     Polygons (Diced):   2,070,273 
R775| [14:45:39] Light Types: 
R776| [14:45:39]             Cylinder:           0 
R777| [14:45:39]                 Disk:           0 
R778| [14:45:39]              Distant:           1 
R779| [14:45:39]                 Dome:           1 
R780| [14:45:39]             Geometry:           0 
R781| [14:45:39]                 Line:           0 
R782| [14:45:39]                Point:           0 
R783| [14:45:39]            Rectangle:           0 
R784| [14:45:39]               Sphere:           0 
R785| [14:45:39] Shader Nodes: 
R786| [14:45:39]          CPU Shaders: 107 total 15 unique 
R787| [14:45:39]      Function Errors:           0 
R788| [14:45:39]     Functions Loaded:          93 
R789| [14:45:39]       Largest Shader:          64 
R790| [14:45:39]         Shader Nodes:       5,502 
R791| [14:45:39]              Shaders:         396 
R792| [14:45:39]  USD Preview Shaders:           2 
R793| [14:45:39] Ray Counts: 
R794| [14:45:39]          Camera Rays: 265,420,800 
R795| [14:45:39]             Indirect: 283,525,303 
R796| [14:45:39]       Light Geometry:           0 
R797| [14:45:39]            Occlusion: 1,009,531,091 
R798| [14:45:39]                Probe:   5,733,485 
R799| [14:45:39]                Total: 1,564,210,679 
R800| [14:45:39] Shader Calls: 
R801| [14:45:39]         Displacement: 188,534,307 
R802| [14:45:39]             Emission:           0 
R803| [14:45:39]                Light:           0 
R804| [14:45:39]              Opacity:           0 
R805| [14:45:39]              Surface:           0 
R806| [14:45:39]               Volume:           0 
R807| [14:45:39] Primvar Cache: 35,386 hits, 12,414 misses 
R808| [14:45:39]             Primvar Memory Usage       Actual    Uncompressed 
R809| [14:45:39]            real32[3] <dicedmesh>     9.42 GiB        9.43 GiB 
R810| [14:45:39]                 int32 <topology>     1.99 GiB        2.25 GiB 
R811| [14:45:39]                      real32[3] N    63.30 MiB      598.36 MiB 
R812| [14:45:39]                   real32[3] Pref    18.25 MiB      166.57 MiB 
R813| [14:45:39]                      real32[3] P    18.25 MiB      166.44 MiB 
R814| [14:45:39]                  int32 QuadVerts     8.23 MiB      115.07 MiB 
R815| [14:45:39]                   int32 TriVerts   823.73 KiB       11.25 MiB 
R816| [14:45:39]                     real32[2] st   594.35 KiB        3.34 MiB 

R858| [14:45:39] Bucket Time Breakdown: 
R859| [14:45:39]               Category         Time  Percentage 
R860| [14:45:39]                 Dicing   0:00:14.62      100.00 
R861| [14:45:39]              Filtering      0:00:00        0.00 
R862| [14:45:39]          Indirect rays      0:00:00        0.00 
R863| [14:45:39]               Lighting      0:00:00        0.00 
R864| [14:45:39]           Primary rays      0:00:00        0.00 
R865| [14:45:39]            SSS samples      0:00:00        0.00 
R866| [14:45:39]                Shading      0:00:00        0.00 
R867| [14:45:39]                Shadows      0:00:00        0.00 
R868| [14:45:39]            Unaccounted      0:00:00        0.00 
R869| [14:45:39] Total Wall Clock Time: 0:09:29.68 
R870| [14:45:39]        Total CPU Time: 0:00:07.55 
R871| [14:45:39]  System CPU Time Only: 0:00:05.58 
R872| [14:45:39]  Current Memory Usage: 87.45 GiB 
R873| [14:45:39]     Peak Memory Usage: 87.45 GiB 
R874| [14:45:40] Image save time: 0:00:00.47

Karma XPU
With envvar. (KARMA_XPU_DISABLE_OPTIX_DEVICE=1)

R562| [15:00:14] Unified Cache: 355.30 MiB of 31.97 GiB used 
R563| [15:00:14]               In Cache Faults        Generated     
R564| [15:00:14] 355.30 MiB    47,752   355.30 MiB    7.62 KiB      
R565| [15:00:14] TBF Cache: 441 file opens (127.25 MiB read) 
R566| [15:00:14] RAT Disk Cache: 56 hits 
R567| [15:00:14] accept_unmipped       : 1 
R568| [15:00:14] accept_untiled        : 1 
R569| [15:00:14] automip               : 0 
R570| [15:00:14] autoscanline          : 1 
R571| [15:00:14] autotile              : 512 
R572| [15:00:14] deduplicate           : 1 
R573| [15:00:14] failure_retries       : 0 
R574| [15:00:14] forcefloat            : 0 
R575| [15:00:14] max_errors_per_file   : 100 
R576| [15:00:14] max_memory_MB         : 63.94 GiB 
R577| [15:00:14] max_mip_res           : 1,073,741,824 
R578| [15:00:14] max_open_files        : 4096 
R579| [15:00:14] searchpath            : '' 
R580| [15:00:14] trust_file_extensions : 0 
R581| [15:00:14] unassociatedalpha     : 0 
R582| [15:00:14] OpenImageIO Stats String: 
R583| [15:00:14] OpenImageIO Texture statistics 
R584|   Options:  gray_to_rgb=0 flip_t=0 max_tile_channels=6  
R585|   Queries/batches :  
R586|     texture     :  156777427 queries in 156777427 batches 
R587|     texture 3d  :  0 queries in 0 batches 
R588|     shadow      :  0 queries in 0 batches 
R589|     environment :  0 queries in 0 batches 
R590|     gettextureinfo :  0 queries 
R591|   Interpolations : 
R592|     closest  : 0 
R593|     bilinear : 156777427 
R594|     bicubic  : 0 
R595|   Average anisotropic probes : 1 
R596|   Max anisotropy in the wild : 1 
R597|  
R598| OpenImageIO ImageCache statistics (shared) ver 2.3.14 
R599|   Options:  max_memory_MB=65473.0 max_open_files=4096 autotile=512 
R600|             autoscanline=1 automip=0 forcefloat=0 accept_untiled=1 
R601|             accept_unmipped=1 deduplicate=1 unassociatedalpha=0 
R602|             failure_retries=0  
R603|   Images : 63 unique 
R604|     ImageInputs : 62 created, 6 current, 11 peak 
R605|     Total pixel data size of all images referenced : 2.6 GB 
R606|     Total actual file size of all images referenced : 979.9 MB 
R607|     Pixel data read : 239.7 MB 
R608|     File I/O time : 19.8s (0.2s average per thread, for 99 threads) 
R609|     File open time only : 0.2s 
R610|   Tiles: 476 created, 472 current, 472 peak 
R611|     total tile requests : 157261461 
R612|     micro-cache misses : 87199 (0.0554484%) 
R613|     main cache misses : 476 (0.000302681%) 
R614|     redundant reads: 0 tiles, 0 B 
R615|     Peak cache memory : 239.7 MB 

R701| [15:00:14] Object Counts: 
R702| [15:00:14]              Cameras:           1 
R703| [15:00:14]    Coordinate Spaces:           0 
R704| [15:00:14]         Curve Meshes:           0 
R705| [15:00:14]           Light Tree:           0 
R706| [15:00:14]               Lights:           2 
R707| [15:00:14]         Point Meshes:           0 
R708| [15:00:14]       Polygon Meshes: 285,891 total 1,627 unique 
R709| [15:00:14]              Volumes:           0 
R710| [15:00:14] Geometry Counts: 
R711| [15:00:14]               Curves:           0 
R712| [15:00:14]               Points:           0 
R713| [15:00:14]             Polygons: 60,271,321 total 14,230,441 unique 
R714| [15:00:14]     Polygons (Diced):   1,657,614 
R715| [15:00:14] Light Types: 
R716| [15:00:14]             Cylinder:           0 
R717| [15:00:14]                 Disk:           0 
R718| [15:00:14]              Distant:           1 
R719| [15:00:14]                 Dome:           1 
R720| [15:00:14]             Geometry:           0 
R721| [15:00:14]                 Line:           0 
R722| [15:00:14]                Point:           0 
R723| [15:00:14]            Rectangle:           0 
R724| [15:00:14]               Sphere:           0 
R725| [15:00:14] Shader Nodes: 
R726| [15:00:14]          CPU Shaders: 107 total 15 unique 
R727| [15:00:14]      Function Errors:           0 
R728| [15:00:14]     Functions Loaded:          93 
R729| [15:00:14]       Largest Shader:          64 
R730| [15:00:14]         Shader Nodes:       5,502 
R731| [15:00:14]              Shaders:         396 
R732| [15:00:14]  USD Preview Shaders:           2 
R733| [15:00:14] Ray Counts: 
R734| [15:00:14]          Camera Rays: 265,420,800 
R735| [15:00:14]             Indirect: 283,525,303 
R736| [15:00:14]       Light Geometry:           0 
R737| [15:00:14]            Occlusion: 1,009,531,114 
R738| [15:00:14]                Probe:   5,733,485 
R739| [15:00:14]                Total: 1,564,210,702 
R740| [15:00:14] Shader Calls: 
R741| [15:00:14]         Displacement: 188,534,307 
R742| [15:00:14]             Emission:           0 
R743| [15:00:14]                Light:           0 
R744| [15:00:14]              Opacity:           0 
R745| [15:00:14]              Surface:           0 
R746| [15:00:14]               Volume:           0 
R747| [15:00:14] Primvar Cache: 35,386 hits, 12,414 misses 
R748| [15:00:14]             Primvar Memory Usage       Actual    Uncompressed 
R749| [15:00:14]            real32[3] <dicedmesh>     9.42 GiB        9.43 GiB 
R750| [15:00:14]                 int32 <topology>     1.99 GiB        2.25 GiB 
R751| [15:00:14]                      real32[3] N    63.30 MiB      598.36 MiB 
R752| [15:00:14]                   real32[3] Pref    18.25 MiB      166.57 MiB 
R753| [15:00:14]                      real32[3] P    18.25 MiB      166.44 MiB 
R754| [15:00:14]                  int32 QuadVerts     8.23 MiB      115.07 MiB 
R755| [15:00:14]                   int32 TriVerts   823.73 KiB       11.25 MiB 
R756| [15:00:14]                     real32[2] st   594.35 KiB        3.34 MiB 

R798| [15:00:14] Bucket Time Breakdown: 
R799| [15:00:14]               Category         Time  Percentage 
R800| [15:00:14]                 Dicing   0:00:14.58      100.00 
R801| [15:00:14]              Filtering      0:00:00        0.00 
R802| [15:00:14]          Indirect rays      0:00:00        0.00 
R803| [15:00:14]               Lighting      0:00:00        0.00 
R804| [15:00:14]           Primary rays      0:00:00        0.00 
R805| [15:00:14]            SSS samples      0:00:00        0.00 
R806| [15:00:14]                Shading      0:00:00        0.00 
R807| [15:00:14]                Shadows      0:00:00        0.00 
R808| [15:00:14]            Unaccounted      0:00:00        0.00 
R809| [15:00:14] Total Wall Clock Time: 0:03:47.37 
R810| [15:00:14]        Total CPU Time: 0:00:08.39 
R811| [15:00:14]  System CPU Time Only: 0:00:02.08 
R812| [15:00:14]  Current Memory Usage: 61.54 GiB 
R813| [15:00:14]     Peak Memory Usage: 61.54 GiB 
R814| [15:00:15] Image save time: 0:00:00.47

thanks,
amwilkins


brians: Staff; 470 posts; Joined: May 2019; Offline

April 8, 2024 11:30 p.m.

brians
Perhaps they're not being released when the GPU goes into an error-state. In the meantime I'll have a look at the code to see if this might be the case

I've checked this, and although we could be better at re-allocating resources after a device has failed (perhaps 1-2 threads we could be doing better with), this wouldn't explain the slowdown you're getting on your threadripper.

It seems definitely something to do with your GPU memory maxing out. So lets try few more things.

1)
Can you try stripping your scene down somewhat (so that it fits all in GPU memory), and then do a render? And report back what kinds of times you get with/without the GPU (ie by using that envvar)

2)
There is a new Windows/NVidia feature where Optix/CUDA GPU memory can "spill" over to the CPU ram when the GPU gets full.
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion [nvidia.custhelp.com]

This is still a very new feature and we're having trouble getting information about it. But I can see your machine is doing this because the "Shared GPU memory" is at 18gig. What happens if you set "CUDA system fallback policy" to "Prefer No Sysmem fallback" in the NVidia control panel? (which will disable this feature)

3)
Are you able to watch the task-manager stats during a render where the GPU fails. And perhaps take a screenshot. I want to see what happens at the point of failure. We should be releasing GPU (and also the "Shared GPU memory") resources once it fails, but perhaps we are not.

4)
We output stats for the devices into the EXR header (as long as the new driver is used, not legacy). They can be viewed using the command iinfo -v myfile.exr. Its very dense/unreadable sorry, but could you perhaps post the result of that here, for a gpu-failing render? We can see a little more about what is happening to each device.

We really should be outputting more information regarding device utilization/failure into the regular stats too. I'll get that prioritized.

cheers

Edited by brians - April 8, 2024 23:32:25


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 9, 2024 6:05 a.m.

Hi,

1)Okay, cool...I stripped out all the ground, background and grass scatters. (just leaving a few hand-placed trees and bushes)
No envvar.
However memory usage was still high:

R883| [11:19:53] Total Wall Clock Time: 0:06:24.98 
R884| [11:19:53]        Total CPU Time: 0:00:05.87 
R885| [11:19:53]  System CPU Time Only: 0:00:06.46 
R886| [11:19:53]  Current Memory Usage: 107.69 GiB 
R887| [11:19:53]     Peak Memory Usage: 107.69 GiB

So I went further, removing all hand-placed assets. Leaving only a small ground and a character.
No envvar.

R755| [11:30:47]        Total CPU Time: 0:00:04.50 
R756| [11:30:47]  System CPU Time Only: 0:00:02.87 
R757| [11:30:47]  Current Memory Usage: 16.09 GiB 
R758| [11:30:47]     Peak Memory Usage: 16.09 GiB

(Around 6.6GB VRAM checking task manager while render in progress. 100% CPU / around 40% GPU usage)

Render time is now very fast.
Still more RAM than I would have expected compared to what's in my scene.
Subdivsions still disabled. Very small piece of ground and one low-poly character with simple shaders (maps connected to shader node) and that's it.

I brought everything back except hand-placed assets.
Still fast render times. So likely it's just the x10 bushes, x8 small rocks and x14 trees that are causing the super high memory use. (with subdivisions disabled) ie. hand-placed using a Stage Manager node.

R944| [11:45:13] Total Wall Clock Time: 0:00:50.67 
R945| [11:45:13]        Total CPU Time: 0:00:03.31 
R946| [11:45:13]  System CPU Time Only: 0:00:05.37 
R947| [11:45:13]  Current Memory Usage: 17.95 GiB 
R948| [11:45:13]     Peak Memory Usage: 17.95 GiB

2) Interesting, good to know. With that setting enabled, the GPU appears to fail to load much/keep VRAM loaded and then appears to just render purely on the CPU.

3)Attached 4 images, trying to capture some of what the GPU is doing during and after the frame is completed.
As a point of failure, only really noted that about 25% into the frame rendering the CUDA cores stopped reporting utilization, however the GPU would still list itself at 100% sometimes.
(ps. this was a mostly stripped down scene, but the memory usage is still high)

4) Attached please find the outputs stats from the EXR.
This was from a job in which we received the "cudaErrorMemoryAllocation" error.

So atm, it really does seem like certain hand-placed geometry is using excessive amounts of memory (just in Karma) which is maxing out the GPU memory and causing it to take longer to render overall than just the CPU.

brians
We really should be outputting more information regarding device utilization/failure into the regular stats too. I'll get that prioritized.

Thanks, that would be awesome!

all the best,
amwilkins

Attachments:
frame_0398_log.txt (26.0 KB)
xpu_render_stats_during_01.png (314.7 KB)
xpu_render_stats_during_02.png (320.8 KB)
xpu_render_stats_during_03.png (306.8 KB)
xpu_render_stats_after_01.png (286.3 KB)
cuda_fallback_setting.png (105.1 KB)
cuda_fallback.png (222.7 KB)


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 9, 2024 8:29 a.m.

Hi again,

Narrowed it down to a tree asset, lets called it "TreeLarge".
Just x6 of these hand-placed in the scene is spiking the RAM usage only for Karma.

R739| [13:59:06] Total Wall Clock Time: 0:04:52.15 
R740| [13:59:06]        Total CPU Time: 0:00:10.67 
R741| [13:59:06]  System CPU Time Only: 0:00:04.27 
R742| [13:59:06]  Current Memory Usage: 98.73 GiB 
R743| [13:59:06]     Peak Memory Usage: 98.73 GiB 
R744| [13:59:06] Image save time: 0:00:00.35

The tree itself is quite simple, but has lots of geometry shells.

Shader tree is super basic, has some displacement.

Without materials:

R390| [14:07:07] Total Wall Clock Time: 0:00:17.56 
R391| [14:07:07]        Total CPU Time: 0:00:03.33 
R392| [14:07:07]  System CPU Time Only: 0:00:02.87 
R393| [14:07:07]  Current Memory Usage: 17.30 GiB 
R394| [14:07:07]     Peak Memory Usage: 17.30 GiB 
R395| [14:07:07] Image save time: 0:00:00.06

I noticed as soon as I move the camera closer from framing the full tree, to about mid-distance the RAM usage spikes dramatically.
From the above 17-20GB up to 50GB

R604| [14:13:48] Total Wall Clock Time: 0:02:50.93 
R605| [14:13:48]        Total CPU Time: 0:00:03.17 
R606| [14:13:48]  System CPU Time Only: 0:00:09.67 
R607| [14:13:48]  Current Memory Usage: 50.63 GiB 
R608| [14:13:48]     Peak Memory Usage: 50.63 GiB

As I go even closer:

R607| [14:27:42] Total Wall Clock Time: 0:04:28.28 
R608| [14:27:42]        Total CPU Time: 0:00:05.05 
R609| [14:27:42]  System CPU Time Only: 0:00:03.02 
R610| [14:27:42]  Current Memory Usage: 123.71 GiB 
R611| [14:27:42]     Peak Memory Usage: 123.71 GiB 
R612| [14:27:42] Image save time: 0:00:00.08

It's definitely camera distance related, but this is with the "subdivisionScheme" set to "none" which should be unsubdivided if I understand Karma correctly. Rendering in scene shows this is the case.
So what would increase RAM based on distance to camera?
Is Karma still doing some sort of dicing?

Edit:
Here is some testing with the dicing quality, definitely appears to be a factor

dicing quality 0 on all geometry

dicing quality default on all geometry

subdivisionScheme set to catclark

thanks,
amwilkins

Edited by am_wilkins - April 9, 2024 09:08:23

Attachments:
tree_large.png (1.2 MB)
tree_large_shaders.png (259.0 KB)
dicing_quality_0.png (306.7 KB)
dicing_quality_default.png (289.2 KB)
subdivScheme_catclark.png (288.4 KB)


jsmack: Member; 7785 posts; Joined: Sept. 2011; Online

April 9, 2024 10:35 a.m.

am_wilkins
has some displacement.

there's your problem


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 9, 2024 11:35 a.m.

jsmack
am_wilkins
has some displacement.

there's your problem

I have displacement on the ground, rocks and different tree type trunk.
None of those assets appear to be spiking the memory like this "Tree Large".

Displacement shouldn't inherently be a problem...other engines aren't having any trouble with this scene or with this asset.
Does Karma handle displacement poorly?
Even if I remove the material, the RAM usage is still very high.

Did you see the camera distance testing, how that is increasing memory by huge amounts?

I'm currently attempting to tweak the "Dicing Quality" settings to hopefully achieve a more manageable result.

Edit:
Some success!

It looks like turning down "Dicing Quality" to around 0.1 on all the geometry related to this tree asset appears to drastically reduce the memory usage.

I'm now getting better memory usage and render times.
No envvar.
With everything active/visible and with subdiv scheme.

R1945| [17:49:44] Total Wall Clock Time: 0:01:36.35 
R1946| [17:49:44]        Total CPU Time: 0:00:04.71 
R1947| [17:49:44]  System CPU Time Only: 0:00:02.46 
R1948| [17:49:44]  Current Memory Usage: 37.20 GiB 
R1949| [17:49:44]     Peak Memory Usage: 37.20 GiB 
R1950| [17:49:45] Image save time: 0:00:00.46

The memory was increasing as the camera moved closer to this tree asset, and that also made it harder to diagnose.

Coming from using mainly Arnold for years, this dicing quality is a little confusing.
For example what's the equivalent of 1 subdiv iteration in terms of dicing quality?

thanks,
amwilkins

Edited by am_wilkins - April 9, 2024 11:52:29


jsmack: Member; 7785 posts; Joined: Sept. 2011; Online

April 9, 2024 12:16 p.m.

Dicing is in screen space, so each mesh is chopped until the triangles are pixel sized at dicing level 1. If that tree has displacement on the leaves, thats a lot of surface area to be diced.


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 9, 2024 12:22 p.m.

jsmack
Dicing is in screen space, so each mesh is chopped until the triangles are pixel sized at dicing level 1. If that tree has displacement on the leaves, thats a lot of surface area to be diced.

Still new to Karma, isn't Dicing separate to whether something has displacement or not?

Yeah, the leaves and twigs don't have displacement thankfully... that'd be overkill. Just the "trunk"/


jsmack: Member; 7785 posts; Joined: Sept. 2011; Online

April 9, 2024 12:26 p.m.

am_wilkins
Still new to Karma, isn't Dicing separate to whether something has displacement or not?

I'm not sure what you mean. Yes, it's separate, but the presence of displacement or subdivision will trigger dicing up to the quality level allowed. I'm not sure how it interacts with instancing in Karma. I know in mantra it would warn you that the shared dicing across instances could lead to inappropriate dicing since you couldn't control which instance was the reference for dicing.


brians: Staff; 470 posts; Joined: May 2019; Offline

April 9, 2024 10:08 p.m.

Thanks for doing the tests

am_wilkins
2) Interesting, good to know. With that setting enabled, the GPU appears to fail to load much/keep VRAM loaded and then appears to just render purely on the CPU.

What render time did you get when using that NVidia driver setting?
Did it go back to being as fast as CPU-only again?
I'm wondering if its the NVidia driver memory swapping stuff that is slowing down your scene with GPU+CPU vs CPU-only

am_wilkins
As a point of failure, only really noted that about 25% into the frame rendering the CUDA cores stopped reporting utilization, however the GPU would still list itself at 100% sometimes.

Yea... we've found the Windows "GPU-utilization" metrics UI to be very incorrect for NVidia/Optix :/
For example, when rendering with two GPUs, the 2nd GPU often registers as not having any load, even though its working full-steam.

That last image is curious for me, xpu_render_stats_after_01.png
Does that show the stats after the GPU has failed, but while the CPU is still rendering the frame? Or has the whole frame finished at this stage (eg husk/karma has finished and closed). The thing I'm trying to determine is if the GPU releases all its memory at the point of failure, or if it only happens at the end of the frame.

Thanks

Edited by brians - April 9, 2024 22:09:27


am_wilkins: Member; 27 posts; Joined: March 2023; Offline

April 12, 2024 4:25 a.m.

jsmack
Yes, it's separate, but the presence of displacement or subdivision will trigger dicing up to the quality level allowed. I'm not sure how it interacts with instancing in Karma. I know in mantra it would warn you that the shared dicing across instances could lead to inappropriate dicing since you couldn't control which instance was the reference for dicing.

Yeah, I was just thinking it was the same as Arnolds "Adaptive Subdivision" or Redshifts "Minimum Edge Loop" & "Maximum Subdivision" kind of settings. So it doesn't matter if there is displacement or not it just controls how the mesh gets subdivided.

In both those engines if you use adaptive subdivision then they give a warning at render time if you're using it on instance geometry it will cause issues and I think just disables it.
On productions we tend to just not use it on anything that's going to get instanced.

I was wondering how Karma's dicing would handle this too...

brians
What render time did you get when using that NVidia driver setting?
Did it go back to being as fast as CPU-only again?

Yes, back to the same speed as CPU-only again. ie. that +-4min frame time from the first post.

brians
Does that show the stats after the GPU has failed, but while the CPU is still rendering the frame? Or has the whole frame finished at this stage (eg husk/karma has finished and closed).

Yes, that screen-grab was from when the frame was done in pixel rendering I believe. Waiting for the new frame to start, likely busy trying to save out the frame etc. all that post-process stuff.
As I do further testing an notice any other behaviors Ill let you know.

amwilkins

Quick Links

                    
                        Search links
                        Show recent posts
                        Show unanswered posts