Karma XPU - Machines with GPU present take longer to render

   1072   15   0
User Avatar
Member
27 posts
Joined: March 2023
Offline
Good day,

I've been doing some render testing with the Karma XPU.
Gotten some successful frames rendered however here are the tests:
Same XPU job submitted to both machines.

#1 Machine
CPU - 1 x AMD Ryzen Threadripper 3970X 32-Core Processor (32 cores, 64 logical) with 130945MB
GPU - No GPU.
Average of per frame - 4mins 20secs

#2 Machine
CPU - 1 x AMD Ryzen Threadripper 3970X 32-Core Processor (32 cores, 64 logical) with 130945MB
GPU - NVIDIA GeForce RTX 4090 @ 2520MHz (compute 8.9) with 24563MB (23279MB available) (NVLink:0)
Average of per frame - 7mins



Seems crazy that adding a 4090 GPU to help render the job yields a slower render time?
Is this expected behavior, is there anything I can look at adjusting?

The Karma/Husk logs don't seem to report any useful information regarding CPU/GPU usuage or what time is spent where.
Out render manager log at the end seems to show that the GPU was being used:
R257| Total processing time: 00:07.00 h:m.s 
R258| Max memory usage: 101.6GB (of 130GB installed, total system usage at same time: 109.4GB. Apr 04. 18:18.35)
R259| System free memory left: 19.0GB (Apr 04. 18:18.35)
R260| Max CPU usage: 96% 61.7 of 64 cores (total system usage at same time: 0.0. Apr 04. 18:19.04)
R261| Max GPU usage:100% 1.0 GPUs (of 1 GPUs assigned to this thread. Apr 04. 18:15.29. Max system GPU usage at same time: 1.0/1. Apr 04. 18:15.29)


all the best,
amwilkins
Edited by am_wilkins - April 5, 2024 10:26:17
User Avatar
Member
27 posts
Joined: March 2023
Offline
Hi again,

Apologies, for those who saw this already and was likely confused. I got the average times mixed up.
I've edited the original post now.

Key take away, is that I'm getting a 66% slower render time on the machine with a 4090 in it.


Thanks!
amwilkins
Edited by am_wilkins - April 5, 2024 10:28:10
User Avatar
Member
7785 posts
Joined: Sept. 2011
Online
Weird, it's about 10x faster with a GPU from my experience. Are you sure it's using the GPU at all? Your log shows over 100Gigs of usage but the 4090 only has 24G of memory, so it must be failing a few seconds in anyway.

Also, is this time the first time, or an average of runs? The first time there might be compile times for shaders which could take several minutes.
User Avatar
Member
27 posts
Joined: March 2023
Offline
Hi jsmack,

Yeah the only thing I can see in the logs to do with GPU or memory utilization is the last log from our render manager. So that does show 100GB but that's all mostly what it loads into system RAM.
Is there any better logs to get from a Karma Husk render?

The times are from an average of runs.

I definitely wasn't expecting slower render times with a 4090 involved. I logged into the machine to check what was happening while the job was running. It does try to load some GPU memory but for a few frames never fully uses the CUDA cores or the GPU to render it seems.

Here is one frame that was using the CUDA cores for a bit, but memory usage is very high. So maybe wasn't getting everything into memory that it needs to render at max efficiency.



Looks like we are also getting an error on some of the frames.
R 76| StdErr: [09:51:43] KarmaXPU: device Type:Optix ID:0 has registered a critical error [cudaErrorMemoryAllocation], so will now stop functioning.  Future error messages will be suppressed

This now might be in relation to my other post:
https://www.sidefx.com/forum/topic/95306/?page=1#post-418837 [www.sidefx.com]

I had first suspected they were unrelated since I wasn't seeing that "cudaErrorMemoryAllocation".
Let me try solve the memory usage first and then I can update here new averages.


thanks,
amwilkins
Edited by am_wilkins - April 8, 2024 04:48:01

Attachments:
xpu_render_stats_01.png (264.8 KB)

User Avatar
Staff
470 posts
Joined: May 2019
Offline
I think the first thing to try is to render with only the EmbreeCPU device active on that machine, to make sure we get the same performance as the machine without the GPU.
So with this envvar here
KARMA_XPU_DISABLE_OPTIX_DEVICE=1

Please report back if you get the same performance, thanks.

We reserve some threads for GPU-shader-compilation, and release them for the EmbreeCPU device once compilation is done. Perhaps they're not being released when the GPU goes into an error-state. In the meantime I'll have a look at the code to see if this might be the case
User Avatar
Member
27 posts
Joined: March 2023
Offline
Hi brians,

Thanks for the recommendations.

brians
I think the first thing to try is to render with only the EmbreeCPU device active on that machine
Rendering with this EnvVar enabled on the same machine that has the 4090 yielded a much faster render time using purely the CPU.
No envvar = an average of 6mins 00secs
With envvar = an average of 3mins 40secs

After resetting the job a few times with and without the envvars, I also noted some inconsistent behavior.
No envvar.
frame 101 = 2min 50secs
frame 102 = 4min 55secs
frame 103 = 3min 10secs
frame 104 = 2min 45secs
frame 105 = 12min 30secs

We actually see the first and 4th frame using the GPU more than the CPU, which outputs the kind of render time we're expecting to see. However subsequent frames are much longer again. Most frequently averaging as mentioned above.
So possibly slowing down when GPU VRAM is overloaded? Or something else?

A separate set of frames (Karma XPU, same settings, same scene, no envvar. This had a "catmullClark" scheme on the geometries. So more likely to use extra memory) for example:
frame 396 = 7min 25secs
frame 397 = 6min 25secs
frame 398 = 6min 25secs
frame 399 = 7min 30secs
frame 400 = 7min 30secs


Over on the other thread regarding the excessive memory usage:
https://www.sidefx.com/forum/topic/95306/?page=1#post-419135 [www.sidefx.com]

I removed all the subdvisions for the scene, which reduced memory usage (still not as low as it should be) but the 24GB VRAM is still maxing out on the GPU.
No "cudaErrorMemoryAllocation" reported in the log.

I also increased the verbosity, but not seeing much useful in the logs:
Karma XPU
No envvar.
R622| [14:45:39] Unified Cache: 363.29 MiB of 31.97 GiB used 
R623| [14:45:39] In Cache Faults Generated
R624| [14:45:39] 363.29 MiB 48,825 363.29 MiB 7.62 KiB
R625| [14:45:39] TBF Cache: 310 file opens (133.66 MiB read)
R626| [14:45:39] RAT Disk Cache: 60 hits
R627| [14:45:39] accept_unmipped : 1
R628| [14:45:39] accept_untiled : 1
R629| [14:45:39] automip : 0
R630| [14:45:39] autoscanline : 1
R631| [14:45:39] autotile : 512
R632| [14:45:39] deduplicate : 1
R633| [14:45:39] failure_retries : 0
R634| [14:45:39] forcefloat : 0
R635| [14:45:39] max_errors_per_file : 100
R636| [14:45:39] max_memory_MB : 63.94 GiB
R637| [14:45:39] max_mip_res : 1,073,741,824
R638| [14:45:39] max_open_files : 1024
R639| [14:45:39] searchpath : ''
R640| [14:45:39] trust_file_extensions : 0
R641| [14:45:39] unassociatedalpha : 0

R658| OpenImageIO ImageCache statistics (shared) ver 2.3.14
R659| Options: max_memory_MB=65473.0 max_open_files=1024 autotile=512
R660| autoscanline=1 automip=0 forcefloat=0 accept_untiled=1
R661| accept_unmipped=1 deduplicate=1 unassociatedalpha=0
R662| failure_retries=0
R663| Images : 63 unique
R664| ImageInputs : 62 created, 2 current, 12 peak
R665| Total pixel data size of all images referenced : 2.6 GB
R666| Total actual file size of all images referenced : 979.9 MB
R667| Pixel data read : 239.7 MB
R668| File I/O time : 14.1s (0.2s average per thread, for 81 threads)
R669| File open time only : 0.2s
R670| Tiles: 475 created, 472 current, 472 peak
R671| total tile requests : 157261461
R672| micro-cache misses : 92888 (0.059066%)
R673| main cache misses : 475 (0.000302045%)
R674| redundant reads: 0 tiles, 0 B
R675| Peak cache memory : 239.7 MB

R761| [14:45:39] Object Counts:
R762| [14:45:39] Cameras: 1
R763| [14:45:39] Coordinate Spaces: 0
R764| [14:45:39] Curve Meshes: 0
R765| [14:45:39] Light Tree: 0
R766| [14:45:39] Lights: 2
R767| [14:45:39] Point Meshes: 0
R768| [14:45:39] Polygon Meshes: 285,891 total 1,627 unique
R769| [14:45:39] Volumes: 0
R770| [14:45:39] Geometry Counts:
R771| [14:45:39] Curves: 0
R772| [14:45:39] Points: 0
R773| [14:45:39] Polygons: 60,271,321 total 14,230,441 unique
R774| [14:45:39] Polygons (Diced): 2,070,273
R775| [14:45:39] Light Types:
R776| [14:45:39] Cylinder: 0
R777| [14:45:39] Disk: 0
R778| [14:45:39] Distant: 1
R779| [14:45:39] Dome: 1
R780| [14:45:39] Geometry: 0
R781| [14:45:39] Line: 0
R782| [14:45:39] Point: 0
R783| [14:45:39] Rectangle: 0
R784| [14:45:39] Sphere: 0
R785| [14:45:39] Shader Nodes:
R786| [14:45:39] CPU Shaders: 107 total 15 unique
R787| [14:45:39] Function Errors: 0
R788| [14:45:39] Functions Loaded: 93
R789| [14:45:39] Largest Shader: 64
R790| [14:45:39] Shader Nodes: 5,502
R791| [14:45:39] Shaders: 396
R792| [14:45:39] USD Preview Shaders: 2
R793| [14:45:39] Ray Counts:
R794| [14:45:39] Camera Rays: 265,420,800
R795| [14:45:39] Indirect: 283,525,303
R796| [14:45:39] Light Geometry: 0
R797| [14:45:39] Occlusion: 1,009,531,091
R798| [14:45:39] Probe: 5,733,485
R799| [14:45:39] Total: 1,564,210,679
R800| [14:45:39] Shader Calls:
R801| [14:45:39] Displacement: 188,534,307
R802| [14:45:39] Emission: 0
R803| [14:45:39] Light: 0
R804| [14:45:39] Opacity: 0
R805| [14:45:39] Surface: 0
R806| [14:45:39] Volume: 0
R807| [14:45:39] Primvar Cache: 35,386 hits, 12,414 misses
R808| [14:45:39] Primvar Memory Usage Actual Uncompressed
R809| [14:45:39] real32[3] <dicedmesh> 9.42 GiB 9.43 GiB
R810| [14:45:39] int32 <topology> 1.99 GiB 2.25 GiB
R811| [14:45:39] real32[3] N 63.30 MiB 598.36 MiB
R812| [14:45:39] real32[3] Pref 18.25 MiB 166.57 MiB
R813| [14:45:39] real32[3] P 18.25 MiB 166.44 MiB
R814| [14:45:39] int32 QuadVerts 8.23 MiB 115.07 MiB
R815| [14:45:39] int32 TriVerts 823.73 KiB 11.25 MiB
R816| [14:45:39] real32[2] st 594.35 KiB 3.34 MiB

R858| [14:45:39] Bucket Time Breakdown:
R859| [14:45:39] Category Time Percentage
R860| [14:45:39] Dicing 0:00:14.62 100.00
R861| [14:45:39] Filtering 0:00:00 0.00
R862| [14:45:39] Indirect rays 0:00:00 0.00
R863| [14:45:39] Lighting 0:00:00 0.00
R864| [14:45:39] Primary rays 0:00:00 0.00
R865| [14:45:39] SSS samples 0:00:00 0.00
R866| [14:45:39] Shading 0:00:00 0.00
R867| [14:45:39] Shadows 0:00:00 0.00
R868| [14:45:39] Unaccounted 0:00:00 0.00
R869| [14:45:39] Total Wall Clock Time: 0:09:29.68
R870| [14:45:39] Total CPU Time: 0:00:07.55
R871| [14:45:39] System CPU Time Only: 0:00:05.58
R872| [14:45:39] Current Memory Usage: 87.45 GiB
R873| [14:45:39] Peak Memory Usage: 87.45 GiB
R874| [14:45:40] Image save time: 0:00:00.47

Karma XPU
With envvar. (KARMA_XPU_DISABLE_OPTIX_DEVICE=1)
R562| [15:00:14] Unified Cache: 355.30 MiB of 31.97 GiB used 
R563| [15:00:14] In Cache Faults Generated
R564| [15:00:14] 355.30 MiB 47,752 355.30 MiB 7.62 KiB
R565| [15:00:14] TBF Cache: 441 file opens (127.25 MiB read)
R566| [15:00:14] RAT Disk Cache: 56 hits
R567| [15:00:14] accept_unmipped : 1
R568| [15:00:14] accept_untiled : 1
R569| [15:00:14] automip : 0
R570| [15:00:14] autoscanline : 1
R571| [15:00:14] autotile : 512
R572| [15:00:14] deduplicate : 1
R573| [15:00:14] failure_retries : 0
R574| [15:00:14] forcefloat : 0
R575| [15:00:14] max_errors_per_file : 100
R576| [15:00:14] max_memory_MB : 63.94 GiB
R577| [15:00:14] max_mip_res : 1,073,741,824
R578| [15:00:14] max_open_files : 4096
R579| [15:00:14] searchpath : ''
R580| [15:00:14] trust_file_extensions : 0
R581| [15:00:14] unassociatedalpha : 0
R582| [15:00:14] OpenImageIO Stats String:
R583| [15:00:14] OpenImageIO Texture statistics
R584| Options: gray_to_rgb=0 flip_t=0 max_tile_channels=6
R585| Queries/batches :
R586| texture : 156777427 queries in 156777427 batches
R587| texture 3d : 0 queries in 0 batches
R588| shadow : 0 queries in 0 batches
R589| environment : 0 queries in 0 batches
R590| gettextureinfo : 0 queries
R591| Interpolations :
R592| closest : 0
R593| bilinear : 156777427
R594| bicubic : 0
R595| Average anisotropic probes : 1
R596| Max anisotropy in the wild : 1
R597|
R598| OpenImageIO ImageCache statistics (shared) ver 2.3.14
R599| Options: max_memory_MB=65473.0 max_open_files=4096 autotile=512
R600| autoscanline=1 automip=0 forcefloat=0 accept_untiled=1
R601| accept_unmipped=1 deduplicate=1 unassociatedalpha=0
R602| failure_retries=0
R603| Images : 63 unique
R604| ImageInputs : 62 created, 6 current, 11 peak
R605| Total pixel data size of all images referenced : 2.6 GB
R606| Total actual file size of all images referenced : 979.9 MB
R607| Pixel data read : 239.7 MB
R608| File I/O time : 19.8s (0.2s average per thread, for 99 threads)
R609| File open time only : 0.2s
R610| Tiles: 476 created, 472 current, 472 peak
R611| total tile requests : 157261461
R612| micro-cache misses : 87199 (0.0554484%)
R613| main cache misses : 476 (0.000302681%)
R614| redundant reads: 0 tiles, 0 B
R615| Peak cache memory : 239.7 MB

R701| [15:00:14] Object Counts:
R702| [15:00:14] Cameras: 1
R703| [15:00:14] Coordinate Spaces: 0
R704| [15:00:14] Curve Meshes: 0
R705| [15:00:14] Light Tree: 0
R706| [15:00:14] Lights: 2
R707| [15:00:14] Point Meshes: 0
R708| [15:00:14] Polygon Meshes: 285,891 total 1,627 unique
R709| [15:00:14] Volumes: 0
R710| [15:00:14] Geometry Counts:
R711| [15:00:14] Curves: 0
R712| [15:00:14] Points: 0
R713| [15:00:14] Polygons: 60,271,321 total 14,230,441 unique
R714| [15:00:14] Polygons (Diced): 1,657,614
R715| [15:00:14] Light Types:
R716| [15:00:14] Cylinder: 0
R717| [15:00:14] Disk: 0
R718| [15:00:14] Distant: 1
R719| [15:00:14] Dome: 1
R720| [15:00:14] Geometry: 0
R721| [15:00:14] Line: 0
R722| [15:00:14] Point: 0
R723| [15:00:14] Rectangle: 0
R724| [15:00:14] Sphere: 0
R725| [15:00:14] Shader Nodes:
R726| [15:00:14] CPU Shaders: 107 total 15 unique
R727| [15:00:14] Function Errors: 0
R728| [15:00:14] Functions Loaded: 93
R729| [15:00:14] Largest Shader: 64
R730| [15:00:14] Shader Nodes: 5,502
R731| [15:00:14] Shaders: 396
R732| [15:00:14] USD Preview Shaders: 2
R733| [15:00:14] Ray Counts:
R734| [15:00:14] Camera Rays: 265,420,800
R735| [15:00:14] Indirect: 283,525,303
R736| [15:00:14] Light Geometry: 0
R737| [15:00:14] Occlusion: 1,009,531,114
R738| [15:00:14] Probe: 5,733,485
R739| [15:00:14] Total: 1,564,210,702
R740| [15:00:14] Shader Calls:
R741| [15:00:14] Displacement: 188,534,307
R742| [15:00:14] Emission: 0
R743| [15:00:14] Light: 0
R744| [15:00:14] Opacity: 0
R745| [15:00:14] Surface: 0
R746| [15:00:14] Volume: 0
R747| [15:00:14] Primvar Cache: 35,386 hits, 12,414 misses
R748| [15:00:14] Primvar Memory Usage Actual Uncompressed
R749| [15:00:14] real32[3] <dicedmesh> 9.42 GiB 9.43 GiB
R750| [15:00:14] int32 <topology> 1.99 GiB 2.25 GiB
R751| [15:00:14] real32[3] N 63.30 MiB 598.36 MiB
R752| [15:00:14] real32[3] Pref 18.25 MiB 166.57 MiB
R753| [15:00:14] real32[3] P 18.25 MiB 166.44 MiB
R754| [15:00:14] int32 QuadVerts 8.23 MiB 115.07 MiB
R755| [15:00:14] int32 TriVerts 823.73 KiB 11.25 MiB
R756| [15:00:14] real32[2] st 594.35 KiB 3.34 MiB

R798| [15:00:14] Bucket Time Breakdown:
R799| [15:00:14] Category Time Percentage
R800| [15:00:14] Dicing 0:00:14.58 100.00
R801| [15:00:14] Filtering 0:00:00 0.00
R802| [15:00:14] Indirect rays 0:00:00 0.00
R803| [15:00:14] Lighting 0:00:00 0.00
R804| [15:00:14] Primary rays 0:00:00 0.00
R805| [15:00:14] SSS samples 0:00:00 0.00
R806| [15:00:14] Shading 0:00:00 0.00
R807| [15:00:14] Shadows 0:00:00 0.00
R808| [15:00:14] Unaccounted 0:00:00 0.00
R809| [15:00:14] Total Wall Clock Time: 0:03:47.37
R810| [15:00:14] Total CPU Time: 0:00:08.39
R811| [15:00:14] System CPU Time Only: 0:00:02.08
R812| [15:00:14] Current Memory Usage: 61.54 GiB
R813| [15:00:14] Peak Memory Usage: 61.54 GiB
R814| [15:00:15] Image save time: 0:00:00.47


thanks,
amwilkins
User Avatar
Staff
470 posts
Joined: May 2019
Offline
brians
Perhaps they're not being released when the GPU goes into an error-state. In the meantime I'll have a look at the code to see if this might be the case

I've checked this, and although we could be better at re-allocating resources after a device has failed (perhaps 1-2 threads we could be doing better with), this wouldn't explain the slowdown you're getting on your threadripper.

It seems definitely something to do with your GPU memory maxing out. So lets try few more things.

1)
Can you try stripping your scene down somewhat (so that it fits all in GPU memory), and then do a render? And report back what kinds of times you get with/without the GPU (ie by using that envvar)

2)
There is a new Windows/NVidia feature where Optix/CUDA GPU memory can "spill" over to the CPU ram when the GPU gets full.
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion [nvidia.custhelp.com]

This is still a very new feature and we're having trouble getting information about it. But I can see your machine is doing this because the "Shared GPU memory" is at 18gig. What happens if you set "CUDA system fallback policy" to "Prefer No Sysmem fallback" in the NVidia control panel? (which will disable this feature)

3)
Are you able to watch the task-manager stats during a render where the GPU fails. And perhaps take a screenshot. I want to see what happens at the point of failure. We should be releasing GPU (and also the "Shared GPU memory") resources once it fails, but perhaps we are not.

4)
We output stats for the devices into the EXR header (as long as the new driver is used, not legacy). They can be viewed using the command iinfo -v myfile.exr. Its very dense/unreadable sorry, but could you perhaps post the result of that here, for a gpu-failing render? We can see a little more about what is happening to each device.

We really should be outputting more information regarding device utilization/failure into the regular stats too. I'll get that prioritized.

cheers
Edited by brians - April 8, 2024 23:32:25
User Avatar
Member
27 posts
Joined: March 2023
Offline
Hi,

1)Okay, cool...I stripped out all the ground, background and grass scatters. (just leaving a few hand-placed trees and bushes)
No envvar.
However memory usage was still high:
R883| [11:19:53] Total Wall Clock Time: 0:06:24.98 
R884| [11:19:53] Total CPU Time: 0:00:05.87
R885| [11:19:53] System CPU Time Only: 0:00:06.46
R886| [11:19:53] Current Memory Usage: 107.69 GiB
R887| [11:19:53] Peak Memory Usage: 107.69 GiB

So I went further, removing all hand-placed assets. Leaving only a small ground and a character.
No envvar.
R755| [11:30:47]        Total CPU Time: 0:00:04.50 
R756| [11:30:47] System CPU Time Only: 0:00:02.87
R757| [11:30:47] Current Memory Usage: 16.09 GiB
R758| [11:30:47] Peak Memory Usage: 16.09 GiB
(Around 6.6GB VRAM checking task manager while render in progress. 100% CPU / around 40% GPU usage)

Render time is now very fast.
Still more RAM than I would have expected compared to what's in my scene.
Subdivsions still disabled. Very small piece of ground and one low-poly character with simple shaders (maps connected to shader node) and that's it.

I brought everything back except hand-placed assets.
Still fast render times. So likely it's just the x10 bushes, x8 small rocks and x14 trees that are causing the super high memory use. (with subdivisions disabled) ie. hand-placed using a Stage Manager node.
R944| [11:45:13] Total Wall Clock Time: 0:00:50.67 
R945| [11:45:13] Total CPU Time: 0:00:03.31
R946| [11:45:13] System CPU Time Only: 0:00:05.37
R947| [11:45:13] Current Memory Usage: 17.95 GiB
R948| [11:45:13] Peak Memory Usage: 17.95 GiB


2) Interesting, good to know. With that setting enabled, the GPU appears to fail to load much/keep VRAM loaded and then appears to just render purely on the CPU.



3)Attached 4 images, trying to capture some of what the GPU is doing during and after the frame is completed.
As a point of failure, only really noted that about 25% into the frame rendering the CUDA cores stopped reporting utilization, however the GPU would still list itself at 100% sometimes.
(ps. this was a mostly stripped down scene, but the memory usage is still high)

4) Attached please find the outputs stats from the EXR.
This was from a job in which we received the "cudaErrorMemoryAllocation" error.


So atm, it really does seem like certain hand-placed geometry is using excessive amounts of memory (just in Karma) which is maxing out the GPU memory and causing it to take longer to render overall than just the CPU.

brians
We really should be outputting more information regarding device utilization/failure into the regular stats too. I'll get that prioritized.
Thanks, that would be awesome!



all the best,
amwilkins

Attachments:
frame_0398_log.txt (26.0 KB)
xpu_render_stats_during_01.png (314.7 KB)
xpu_render_stats_during_02.png (320.8 KB)
xpu_render_stats_during_03.png (306.8 KB)
xpu_render_stats_after_01.png (286.3 KB)
cuda_fallback_setting.png (105.1 KB)
cuda_fallback.png (222.7 KB)

User Avatar
Member
27 posts
Joined: March 2023
Offline
Hi again,

Narrowed it down to a tree asset, lets called it "TreeLarge".
Just x6 of these hand-placed in the scene is spiking the RAM usage only for Karma.
R739| [13:59:06] Total Wall Clock Time: 0:04:52.15 
R740| [13:59:06] Total CPU Time: 0:00:10.67
R741| [13:59:06] System CPU Time Only: 0:00:04.27
R742| [13:59:06] Current Memory Usage: 98.73 GiB
R743| [13:59:06] Peak Memory Usage: 98.73 GiB
R744| [13:59:06] Image save time: 0:00:00.35

The tree itself is quite simple, but has lots of geometry shells.



Shader tree is super basic, has some displacement.


Without materials:
R390| [14:07:07] Total Wall Clock Time: 0:00:17.56 
R391| [14:07:07] Total CPU Time: 0:00:03.33
R392| [14:07:07] System CPU Time Only: 0:00:02.87
R393| [14:07:07] Current Memory Usage: 17.30 GiB
R394| [14:07:07] Peak Memory Usage: 17.30 GiB
R395| [14:07:07] Image save time: 0:00:00.06


I noticed as soon as I move the camera closer from framing the full tree, to about mid-distance the RAM usage spikes dramatically.
From the above 17-20GB up to 50GB
R604| [14:13:48] Total Wall Clock Time: 0:02:50.93 
R605| [14:13:48] Total CPU Time: 0:00:03.17
R606| [14:13:48] System CPU Time Only: 0:00:09.67
R607| [14:13:48] Current Memory Usage: 50.63 GiB
R608| [14:13:48] Peak Memory Usage: 50.63 GiB

As I go even closer:
R607| [14:27:42] Total Wall Clock Time: 0:04:28.28 
R608| [14:27:42] Total CPU Time: 0:00:05.05
R609| [14:27:42] System CPU Time Only: 0:00:03.02
R610| [14:27:42] Current Memory Usage: 123.71 GiB
R611| [14:27:42] Peak Memory Usage: 123.71 GiB
R612| [14:27:42] Image save time: 0:00:00.08


It's definitely camera distance related, but this is with the "subdivisionScheme" set to "none" which should be unsubdivided if I understand Karma correctly. Rendering in scene shows this is the case.
So what would increase RAM based on distance to camera?
Is Karma still doing some sort of dicing?


Edit:
Here is some testing with the dicing quality, definitely appears to be a factor

dicing quality 0 on all geometry


dicing quality default on all geometry


subdivisionScheme set to catclark




thanks,
amwilkins
Edited by am_wilkins - April 9, 2024 09:08:23

Attachments:
tree_large.png (1.2 MB)
tree_large_shaders.png (259.0 KB)
dicing_quality_0.png (306.7 KB)
dicing_quality_default.png (289.2 KB)
subdivScheme_catclark.png (288.4 KB)

User Avatar
Member
7785 posts
Joined: Sept. 2011
Online
am_wilkins
has some displacement.

there's your problem
User Avatar
Member
27 posts
Joined: March 2023
Offline
jsmack
am_wilkins
has some displacement.

there's your problem


I have displacement on the ground, rocks and different tree type trunk.
None of those assets appear to be spiking the memory like this "Tree Large".

Displacement shouldn't inherently be a problem...other engines aren't having any trouble with this scene or with this asset.
Does Karma handle displacement poorly?
Even if I remove the material, the RAM usage is still very high.

Did you see the camera distance testing, how that is increasing memory by huge amounts?

I'm currently attempting to tweak the "Dicing Quality" settings to hopefully achieve a more manageable result.

Edit:
Some success!

It looks like turning down "Dicing Quality" to around 0.1 on all the geometry related to this tree asset appears to drastically reduce the memory usage.

I'm now getting better memory usage and render times.
No envvar.
With everything active/visible and with subdiv scheme.
R1945| [17:49:44] Total Wall Clock Time: 0:01:36.35 
R1946| [17:49:44] Total CPU Time: 0:00:04.71
R1947| [17:49:44] System CPU Time Only: 0:00:02.46
R1948| [17:49:44] Current Memory Usage: 37.20 GiB
R1949| [17:49:44] Peak Memory Usage: 37.20 GiB
R1950| [17:49:45] Image save time: 0:00:00.46

The memory was increasing as the camera moved closer to this tree asset, and that also made it harder to diagnose.

Coming from using mainly Arnold for years, this dicing quality is a little confusing.
For example what's the equivalent of 1 subdiv iteration in terms of dicing quality?



thanks,
amwilkins
Edited by am_wilkins - April 9, 2024 11:52:29
User Avatar
Member
7785 posts
Joined: Sept. 2011
Online
Dicing is in screen space, so each mesh is chopped until the triangles are pixel sized at dicing level 1. If that tree has displacement on the leaves, thats a lot of surface area to be diced.
User Avatar
Member
27 posts
Joined: March 2023
Offline
jsmack
Dicing is in screen space, so each mesh is chopped until the triangles are pixel sized at dicing level 1. If that tree has displacement on the leaves, thats a lot of surface area to be diced.

Still new to Karma, isn't Dicing separate to whether something has displacement or not?

Yeah, the leaves and twigs don't have displacement thankfully... that'd be overkill. Just the "trunk"/
User Avatar
Member
7785 posts
Joined: Sept. 2011
Online
am_wilkins
Still new to Karma, isn't Dicing separate to whether something has displacement or not?

I'm not sure what you mean. Yes, it's separate, but the presence of displacement or subdivision will trigger dicing up to the quality level allowed. I'm not sure how it interacts with instancing in Karma. I know in mantra it would warn you that the shared dicing across instances could lead to inappropriate dicing since you couldn't control which instance was the reference for dicing.
User Avatar
Staff
470 posts
Joined: May 2019
Offline
Thanks for doing the tests

am_wilkins
2) Interesting, good to know. With that setting enabled, the GPU appears to fail to load much/keep VRAM loaded and then appears to just render purely on the CPU.

What render time did you get when using that NVidia driver setting?
Did it go back to being as fast as CPU-only again?
I'm wondering if its the NVidia driver memory swapping stuff that is slowing down your scene with GPU+CPU vs CPU-only

am_wilkins
As a point of failure, only really noted that about 25% into the frame rendering the CUDA cores stopped reporting utilization, however the GPU would still list itself at 100% sometimes.

Yea... we've found the Windows "GPU-utilization" metrics UI to be very incorrect for NVidia/Optix :/
For example, when rendering with two GPUs, the 2nd GPU often registers as not having any load, even though its working full-steam.

That last image is curious for me, xpu_render_stats_after_01.png
Does that show the stats after the GPU has failed, but while the CPU is still rendering the frame? Or has the whole frame finished at this stage (eg husk/karma has finished and closed). The thing I'm trying to determine is if the GPU releases all its memory at the point of failure, or if it only happens at the end of the frame.

Thanks
Edited by brians - April 9, 2024 22:09:27
User Avatar
Member
27 posts
Joined: March 2023
Offline
jsmack
Yes, it's separate, but the presence of displacement or subdivision will trigger dicing up to the quality level allowed. I'm not sure how it interacts with instancing in Karma. I know in mantra it would warn you that the shared dicing across instances could lead to inappropriate dicing since you couldn't control which instance was the reference for dicing.

Yeah, I was just thinking it was the same as Arnolds "Adaptive Subdivision" or Redshifts "Minimum Edge Loop" & "Maximum Subdivision" kind of settings. So it doesn't matter if there is displacement or not it just controls how the mesh gets subdivided.

In both those engines if you use adaptive subdivision then they give a warning at render time if you're using it on instance geometry it will cause issues and I think just disables it.
On productions we tend to just not use it on anything that's going to get instanced.

I was wondering how Karma's dicing would handle this too...


brians
What render time did you get when using that NVidia driver setting?
Did it go back to being as fast as CPU-only again?

Yes, back to the same speed as CPU-only again. ie. that +-4min frame time from the first post.


brians
Does that show the stats after the GPU has failed, but while the CPU is still rendering the frame? Or has the whole frame finished at this stage (eg husk/karma has finished and closed).

Yes, that screen-grab was from when the frame was done in pixel rendering I believe. Waiting for the new frame to start, likely busy trying to save out the frame etc. all that post-process stuff.
As I do further testing an notice any other behaviors Ill let you know.



amwilkins
  • Quick Links