Been getting to grips with OpenCL in Houdini recently, and it's got me wondering - unless you're doing 1000s of iterations on a data set between copy-backs, or some very complex calculations, generally the lion's share of time seems to end up loading data to/from the GPU (even when being careful only to transfer the minimum amount necessary).

So what in a typical modern highish-end-desktop system is the principle bottleneck there? Say, in a dual-channel DDR5 system with a PCIE 5.0 x16 interface to the GPU, would it be the RAM bandwidth or the PCI bus, or some other factor?