There's a few things going on here that affect performance.
First is that SMT is not a doubling of performance. You might get an extra 10% as the 2 threads are able to utilize the CPU better, filling in the idle units of one thread with work from another (maybe). The downside is that a thread also takes memory and memory bandwith, so having twice as many threads means you have less cache to work with per thread - more cache misses, so more higher-latency calls to main memory. And the more misses, the more memory bandwidth each thread is using, and you're already using twice as much.
Second is that a threaded job is made of parts - the threaded part (A), and the single threaded part (B). So if you're looking at total render time, it's
total time = time(A)/#threads + time(B)
As the number of threads increases from 1, the time taken to do A halves, halves again, etc. Assuming ideal scaling (more on that in a moment), at 32 threads it's taking 3% of the time of 1 thread. And at 64, it's 1.5% (but not really, ‘cause SMT isn’t 2x faster). What happens is that the total time begins approaching time(B), the single threaded stuff. You'll get a nice exponential falloff settling onto some plateau where it doesn't really improve much anymore. This is known as Amdahl's law, and the only way to get around it is to optimize B as much as possible.
The last problem is thread contention. Anything that needs exclusive temporary access to a resource can chip away at the speedup as threads wait. Even getting rid of all exclusive resources in the code will still have waits at the system level. This is a problem which gets worse as the # threads increase as well.
As a user, you can do a few things to fix this. 1) run the optimum number of threads for the job (often needs a bit of testing to determine this). 2) Run multiple mantra jobs with a lower thread count, which reduces the performance loss from the third point (but still runs into performance issues from the first. 3) Disable SMT.
Hope that helps!