In VEX it has to be done in 2 steps where the same thing can be achieved in OpenCL in one step using barriers [khronos.org].
For a 822K polygon model, it's over 500 times faster than VEX on GTX 970 for 1000 iterations, and over 600 times faster for 10K iterations.
Timings are recorded separately as there seems to be some overhead when running them one after another.
Special thanks to SESI.