There you go dudealbinopapa wrote:I've been looking ( not extensively ) for this info on AMD chips, where did you find this? Intel has the intrinsics guide which has a lot of this info there, but haven't found anything similar for AMD.
https://www.agner.org/optimize/instruction_tables.pdf
I use a instruction (VPCMPGTQ) that on Haswell has latency of 5 and throughput 1 CPI, so by "unrolling" the AVX loop I can complete (in theory) 2 VPCMPGTQ instructions in 6 clocks .. thus the speed upalbinopapa wrote:I personally never got any speed increase from loop unrolling. In most of my trials, the compiler usually unrolled the loops already or there wasn't enough work to be done between the loads/stores, or was probably already memory bandwidth limited. The prefetcher gets 64 bytes ( four SSE lanes worth ) in an array as it is, so four single iteration loops are already cached if you are processing arrays. This is probably the reason why unrolling usually doesn't do much of anything.