I am doing some tests with the new ML volume upres node in H21 and weirdly choosing Direct ML as the execution provider is a lot faster than CUDA. I verified that Houdini actually uses CUDA Toolkit 12.8 using:
import os
print(os.environ.get("CUDA_PATH"))
print(os.environ.get("PATH"))
My cuDNN version is 9.13 and it's also on my PATH environment variable. My understanding is that CUDA should be the fastest option but it's in fact 3-5x slower than Direct ML.
Any ideas what could be going wrong here?