我在SSE2和AVX中实现了4x4矩阵逆.两者都比普通实现更快.但是如果启用AVX(-mavx),则SSE2实现比手动AVX实现运行得更快.似乎编译器使我的SSE2实现与AVX更友好:(
在我的AVX实现中,乘法次数减少,添加次数减少......所以我希望AVX可以比SSE更快.也许有些像说明资讯_mm256_permute2f128_ps
,_mm256_permutevar_ps/_mm256_permute_ps
使得AVX慢?我不是要将SSE/XMM寄存器加载到AVX/YMM寄存器.
如何使我的AVX实现比SSE更快?
我的CPU:Intel(R)Core(TM)i7-3615QM CPU @ 2.30GHz(Ivy Bridge)
Plain with -O3 : 0.045853 secs SSE2 with -O3 : 0.026021 secs SSE2 with -O3 -mavx: 0.024336 secs AVX1 with -O3 -mavx: 0.031798 secs Updated (See bottom of question) all have -O3 -mavx flags: AVX1 (reduced div) : 0.027666 secs AVX1 (using rcp_ps) : 0.023205 secs SSE2 (using rcp_ps) : 0.021969 secs
初始矩阵:
Matrix (float4x4): |0.0714 -0.6589 0.7488 2.0000| |0.9446 0.2857 0.1613 4.0000| |-0.3202 0.6958 0.6429 6.0000| |0.0000 0.0000 0.0000 1.0000|
测试代码:
start = clock(); for (int i = 0; i <1000000; i++) { glm_mat4_inv_sse2(m, m); // glm_mat4_inv_avx(m, m); // glm_mat4_inv(m, m) } end = clock(); total = (float)(end - start) / CLOCKS_PER_SEC; printf("%f secs\n\n", total);
实现:
图书馆:http://github.com/recp/cglm
SSE Impl:https://gist.github.com/recp/690025c955c2e69a91e3a60a13768dee
AVX Impl:https://gist.github.com/recp/8ccc5ad0d19f5516de55f9bf7b5045b2
SSE2实现输出(使用godbolt;选项-O3):
glm_mat4_inv_sse2: movaps xmm8, XMMWORD PTR [rdi+32] movaps xmm2, XMMWORD PTR [rdi+16] movaps xmm5, XMMWORD PTR [rdi+48] movaps xmm6, XMMWORD PTR [rdi] movaps xmm4, xmm8 movaps xmm13, xmm8 movaps xmm11, xmm8 shufps xmm11, xmm2, 170 shufps xmm4, xmm5, 238 movaps xmm3, xmm11 movaps xmm1, xmm8 pshufd xmm12, xmm4, 127 shufps xmm13, xmm2, 255 movaps xmm0, xmm13 movaps xmm9, xmm8 pshufd xmm4, xmm4, 42 shufps xmm9, xmm2, 85 shufps xmm1, xmm5, 153 movaps xmm7, xmm9 mulps xmm0, xmm4 pshufd xmm10, xmm1, 42 movaps xmm1, xmm11 shufps xmm5, xmm8, 0 mulps xmm3, xmm12 pshufd xmm5, xmm5, 128 mulps xmm7, xmm12 mulps xmm1, xmm10 subps xmm3, xmm0 movaps xmm0, xmm13 mulps xmm0, xmm10 mulps xmm13, xmm5 subps xmm7, xmm0 movaps xmm0, xmm9 mulps xmm0, xmm4 subps xmm0, xmm1 movaps xmm1, xmm8 movaps xmm8, xmm11 shufps xmm1, xmm2, 0 mulps xmm8, xmm5 movaps xmm11, xmm7 mulps xmm4, xmm1 mulps xmm5, xmm9 movaps xmm9, xmm2 mulps xmm12, xmm1 shufps xmm9, xmm6, 85 pshufd xmm9, xmm9, 168 mulps xmm1, xmm10 movaps xmm10, xmm2 shufps xmm10, xmm6, 0 pshufd xmm10, xmm10, 168 subps xmm4, xmm8 mulps xmm7, xmm10 movaps xmm8, xmm2 shufps xmm2, xmm6, 255 shufps xmm8, xmm6, 170 pshufd xmm8, xmm8, 168 pshufd xmm2, xmm2, 168 mulps xmm11, xmm8 subps xmm12, xmm13 movaps xmm13, XMMWORD PTR .LC0[rip] subps xmm1, xmm5 movaps xmm5, xmm3 mulps xmm5, xmm9 mulps xmm3, xmm10 subps xmm5, xmm11 movaps xmm11, xmm0 mulps xmm11, xmm2 mulps xmm0, xmm10 addps xmm5, xmm11 movaps xmm11, xmm12 mulps xmm11, xmm8 mulps xmm12, xmm9 xorps xmm5, xmm13 subps xmm3, xmm11 movaps xmm11, xmm4 mulps xmm4, xmm9 subps xmm7, xmm12 mulps xmm11, xmm2 mulps xmm2, xmm1 mulps xmm1, xmm8 subps xmm0, xmm4 addps xmm3, xmm11 movaps xmm11, XMMWORD PTR .LC1[rip] addps xmm2, xmm7 addps xmm0, xmm1 movaps xmm1, xmm5 xorps xmm3, xmm11 xorps xmm2, xmm13 shufps xmm1, xmm3, 0 xorps xmm0, xmm11 movaps xmm4, xmm2 shufps xmm4, xmm0, 0 shufps xmm1, xmm4, 136 mulps xmm1, xmm6 pshufd xmm4, xmm1, 27 addps xmm1, xmm4 pshufd xmm4, xmm1, 65 addps xmm1, xmm4 movaps xmm4, XMMWORD PTR .LC2[rip] divps xmm4, xmm1 mulps xmm5, xmm4 mulps xmm3, xmm4 mulps xmm2, xmm4 mulps xmm0, xmm4 movaps XMMWORD PTR [rsi], xmm5 movaps XMMWORD PTR [rsi+16], xmm3 movaps XMMWORD PTR [rsi+32], xmm2 movaps XMMWORD PTR [rsi+48], xmm0 ret .LC0: .long 0 .long 2147483648 .long 0 .long 2147483648 .LC1: .long 2147483648 .long 0 .long 2147483648 .long 0 .LC2: .long 1065353216 .long 1065353216 .long 1065353216 .long 1065353216
SSE2实现(启用AVX)输出(使用godbolt;选项-O3 -mavx):
glm_mat4_inv_sse2: vmovaps xmm9, XMMWORD PTR [rdi+32] vmovaps xmm6, XMMWORD PTR [rdi+48] vmovaps xmm2, XMMWORD PTR [rdi+16] vmovaps xmm7, XMMWORD PTR [rdi] vshufps xmm5, xmm9, xmm6, 238 vpshufd xmm13, xmm5, 127 vpshufd xmm5, xmm5, 42 vshufps xmm1, xmm9, xmm6, 153 vshufps xmm11, xmm9, xmm2, 170 vshufps xmm12, xmm9, xmm2, 255 vmulps xmm3, xmm11, xmm13 vpshufd xmm1, xmm1, 42 vmulps xmm0, xmm12, xmm5 vshufps xmm10, xmm9, xmm2, 85 vshufps xmm6, xmm6, xmm9, 0 vpshufd xmm6, xmm6, 128 vmulps xmm8, xmm10, xmm13 vmulps xmm4, xmm10, xmm5 vsubps xmm3, xmm3, xmm0 vmulps xmm0, xmm12, xmm1 vsubps xmm8, xmm8, xmm0 vmulps xmm0, xmm11, xmm1 vsubps xmm4, xmm4, xmm0 vshufps xmm0, xmm9, xmm2, 0 vmulps xmm9, xmm12, xmm6 vmulps xmm13, xmm0, xmm13 vmulps xmm5, xmm0, xmm5 vmulps xmm0, xmm0, xmm1 vsubps xmm12, xmm13, xmm9 vmulps xmm9, xmm11, xmm6 vmovaps xmm13, XMMWORD PTR .LC0[rip] vmulps xmm6, xmm10, xmm6 vshufps xmm10, xmm2, xmm7, 85 vpshufd xmm10, xmm10, 168 vsubps xmm5, xmm5, xmm9 vshufps xmm9, xmm2, xmm7, 170 vpshufd xmm9, xmm9, 168 vsubps xmm1, xmm0, xmm6 vmulps xmm11, xmm8, xmm9 vshufps xmm0, xmm2, xmm7, 0 vshufps xmm2, xmm2, xmm7, 255 vmulps xmm6, xmm3, xmm10 vpshufd xmm2, xmm2, 168 vpshufd xmm0, xmm0, 168 vmulps xmm3, xmm3, xmm0 vmulps xmm8, xmm8, xmm0 vmulps xmm0, xmm4, xmm0 vsubps xmm6, xmm6, xmm11 vmulps xmm11, xmm4, xmm2 vaddps xmm6, xmm6, xmm11 vmulps xmm11, xmm12, xmm9 vmulps xmm12, xmm12, xmm10 vxorps xmm6, xmm6, xmm13 vsubps xmm3, xmm3, xmm11 vmulps xmm11, xmm5, xmm2 vmulps xmm5, xmm5, xmm10 vsubps xmm8, xmm8, xmm12 vmulps xmm2, xmm1, xmm2 vmulps xmm1, xmm1, xmm9 vaddps xmm3, xmm3, xmm11 vmovaps xmm11, XMMWORD PTR .LC1[rip] vsubps xmm0, xmm0, xmm5 vaddps xmm2, xmm8, xmm2 vxorps xmm3, xmm3, xmm11 vaddps xmm0, xmm0, xmm1 vshufps xmm1, xmm6, xmm3, 0 vxorps xmm2, xmm2, xmm13 vxorps xmm0, xmm0, xmm11 vshufps xmm4, xmm2, xmm0, 0 vshufps xmm1, xmm1, xmm4, 136 vmulps xmm1, xmm1, xmm7 vpshufd xmm4, xmm1, 27 vaddps xmm1, xmm1, xmm4 vpshufd xmm4, xmm1, 65 vaddps xmm1, xmm1, xmm4 vmovaps xmm4, XMMWORD PTR .LC2[rip] vdivps xmm1, xmm4, xmm1 vmulps xmm6, xmm6, xmm1 vmulps xmm3, xmm3, xmm1 vmulps xmm2, xmm2, xmm1 vmulps xmm1, xmm0, xmm1 vmovaps XMMWORD PTR [rsi], xmm6 vmovaps XMMWORD PTR [rsi+16], xmm3 vmovaps XMMWORD PTR [rsi+32], xmm2 vmovaps XMMWORD PTR [rsi+48], xmm1 ret .LC0: .long 0 .long 2147483648 .long 0 .long 2147483648 .LC1: .long 2147483648 .long 0 .long 2147483648 .long 0 .LC2: .long 1065353216 .long 1065353216 .long 1065353216 .long 1065353216
AVX实现输出(使用godbolt;选项-O3 -mavx):
glm_mat4_inv_avx: vmovaps ymm3, YMMWORD PTR [rdi] vmovaps ymm1, YMMWORD PTR [rdi+32] vmovdqa ymm2, YMMWORD PTR .LC1[rip] vmovdqa ymm0, YMMWORD PTR .LC0[rip] vperm2f128 ymm6, ymm3, ymm3, 3 vperm2f128 ymm5, ymm1, ymm1, 0 vperm2f128 ymm1, ymm1, ymm1, 17 vmovdqa ymm10, YMMWORD PTR .LC4[rip] vpermilps ymm9, ymm5, ymm0 vpermilps ymm7, ymm1, ymm2 vperm2f128 ymm8, ymm6, ymm6, 0 vpermilps ymm1, ymm1, ymm0 vpermilps ymm5, ymm5, ymm2 vpermilps ymm0, ymm8, ymm0 vmulps ymm4, ymm7, ymm9 vpermilps ymm8, ymm8, ymm2 vpermilps ymm11, ymm6, 1 vmulps ymm2, ymm5, ymm1 vmulps ymm7, ymm0, ymm7 vmulps ymm1, ymm8, ymm1 vmulps ymm0, ymm0, ymm5 vmulps ymm5, ymm8, ymm9 vmovdqa ymm9, YMMWORD PTR .LC3[rip] vmovdqa ymm8, YMMWORD PTR .LC2[rip] vsubps ymm4, ymm4, ymm2 vsubps ymm7, ymm7, ymm1 vperm2f128 ymm2, ymm4, ymm4, 0 vperm2f128 ymm4, ymm4, ymm4, 17 vshufps ymm1, ymm2, ymm4, 77 vpermilps ymm1, ymm1, ymm9 vsubps ymm5, ymm0, ymm5 vpermilps ymm0, ymm2, ymm8 vmulps ymm0, ymm0, ymm11 vperm2f128 ymm1, ymm1, ymm2, 0 vshufps ymm2, ymm2, ymm4, 74 vpermilps ymm4, ymm6, 90 vmulps ymm1, ymm1, ymm4 vpermilps ymm2, ymm2, ymm10 vpermilps ymm6, ymm6, 191 vmovaps ymm11, YMMWORD PTR .LC5[rip] vperm2f128 ymm2, ymm2, ymm2, 0 vperm2f128 ymm4, ymm3, ymm3, 0 vpermilps ymm12, ymm4, YMMWORD PTR .LC7[rip] vmulps ymm2, ymm2, ymm6 vinsertf128 ymm6, ymm7, xmm5, 1 vperm2f128 ymm5, ymm7, ymm5, 49 vshufps ymm7, ymm6, ymm5, 77 vpermilps ymm9, ymm7, ymm9 vsubps ymm0, ymm0, ymm1 vpermilps ymm1, ymm4, YMMWORD PTR .LC6[rip] vpermilps ymm4, ymm4, YMMWORD PTR .LC8[rip] vaddps ymm2, ymm0, ymm2 vpermilps ymm0, ymm6, ymm8 vshufps ymm6, ymm6, ymm5, 74 vpermilps ymm6, ymm6, ymm10 vmulps ymm1, ymm1, ymm0 vmulps ymm0, ymm12, ymm9 vmulps ymm6, ymm4, ymm6 vxorps ymm2, ymm2, ymm11 vdpps ymm3, ymm3, ymm2, 255 vsubps ymm0, ymm1, ymm0 vdivps ymm2, ymm2, ymm3 vaddps ymm0, ymm0, ymm6 vxorps ymm0, ymm0, ymm11 vdivps ymm0, ymm0, ymm3 vperm2f128 ymm5, ymm2, ymm2, 3 vshufps ymm1, ymm2, ymm5, 68 vshufps ymm2, ymm2, ymm5, 238 vperm2f128 ymm4, ymm0, ymm0, 3 vshufps ymm6, ymm0, ymm4, 68 vshufps ymm0, ymm0, ymm4, 238 vshufps ymm3, ymm1, ymm6, 136 vshufps ymm1, ymm1, ymm6, 221 vinsertf128 ymm1, ymm3, xmm1, 1 vshufps ymm3, ymm2, ymm0, 136 vshufps ymm0, ymm2, ymm0, 221 vinsertf128 ymm0, ymm3, xmm0, 1 vmovaps YMMWORD PTR [rsi], ymm1 vmovaps YMMWORD PTR [rsi+32], ymm0 vzeroupper ret .LC0: .long 2 .long 1 .long 1 .long 0 .long 0 .long 0 .long 0 .long 0 .LC1: .long 3 .long 3 .long 2 .long 3 .long 2 .long 1 .long 1 .long 1 .LC2: .long 0 .long 0 .long 1 .long 2 .long 0 .long 0 .long 1 .long 2 .LC3: .long 0 .long 1 .long 1 .long 2 .long 0 .long 1 .long 1 .long 2 .LC4: .long 0 .long 2 .long 3 .long 3 .long 0 .long 2 .long 3 .long 3 .LC5: .long 0 .long 2147483648 .long 0 .long 2147483648 .long 2147483648 .long 0 .long 2147483648 .long 0 .LC6: .long 1 .long 0 .long 0 .long 0 .long 1 .long 0 .long 0 .long 0 .LC7: .long 2 .long 2 .long 1 .long 1 .long 2 .long 2 .long 1 .long 1 .LC8: .long 3 .long 3 .long 3 .long 2 .long 3 .long 3 .long 3 .long 2
编辑:
我在macOS上使用Xcode(版本10.0(10A255))(在MacBook Pro(2012年中期),Retina,15')上使用-O3优化选项构建和运行测试.它用clang编译测试代码.我在godbolt中使用GCC 8.2来查看asm(对不起),但是程序集输出看起来很相似.
我通过启用cglm选项启用了shuffd:CGLM_USE_INT_DOMAIN.我忘了在查看asm时禁用它.
#ifdef CGLM_USE_INT_DOMAIN # define glmm_shuff1(xmm, z, y, x, w) \ _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(xmm), \ _MM_SHUFFLE(z, y, x, w))) #else # define glmm_shuff1(xmm, z, y, x, w) \ _mm_shuffle_ps(xmm, xmm, _MM_SHUFFLE(z, y, x, w)) #endif
整个测试代码(标题除外):
#include#include #include int main(int argc, const char * argv[]) { CGLM_ALIGN(32) mat4 m = GLM_MAT4_IDENTITY_INIT; double start, end, total; /* generate invertible matrix */ glm_translate(m, (vec3){1,2,3}); glm_rotate(m, M_PI_2, (vec3){1,2,3}); glm_translate(m, (vec3){1,2,3}); glm_mat4_print(m, stderr); start = clock(); for (int i = 0; i <1000000; i++) { glm_mat4_inv_sse2(m, m); // glm_mat4_inv_avx(m, m); // glm_mat4_inv(m, m); } end = clock(); total = (float)(end - start) / CLOCKS_PER_SEC; printf("%f secs\n\n", total); glm_mat4_print(m, stderr); }
编辑2:
我通过使用乘法减少了一个除法(1 set_ps + 1 div_ps + 2 mul_ps似乎优于2 div_ps):
旧版:
r1 = _mm256_div_ps(r1, y4); r2 = _mm256_div_ps(r2, y4);
新版本(SSE2版本使用了这样的部门):
y5 = _mm256_div_ps(_mm256_set1_ps(1.0f), y4); r1 = _mm256_mul_ps(r1, y5); r2 = _mm256_mul_ps(r2, y5);
新版本(快速版):
y5 = _mm256_rcp_ps(y4); r1 = _mm256_mul_ps(r1, y5); r2 = _mm256_mul_ps(r2, y5);
现在它比以前更好,但仍然不比常春藤桥CPU上的SSE快.我更新了测试结果.
您的CPU是Intel IvyBridge.
Sandybridge/IvyBridge具有每时钟1个mul并在不同端口上增加吞吐量,因此它们不会相互竞争.
但是,对于256位shuffle和所有FP shuffle(甚至128位shufps
),每个时钟只有1个随机播放吞吐量. 但是,对于整数shuffle,它具有每时钟2个吞吐量,并且我注意到您的编译器pshufd
用作FP指令之间的复制和混洗. 在为SSE2编译时,这是一个可靠的胜利,特别是在VEX编码不可用的情况下(因此它movaps
通过替换movaps xmm0, xmm1
/ shufps xmm0, xmm0, 65
或其他来节省一个.)即使AVX可用,你的编译器也会这样做,所以它可以使用vshufps xmm0, xmm1,xmm1, 65
,但它要么巧妙地选择vpshufd
微架构的原因,或者它很幸运,或者它的启发式/指令成本模型的设计考虑到了这一点.(我怀疑它是铿锵的,但你没有在问题中说出来或显示你编译的C源代码).
在Haswell及更高版本(支持AVX2,因此支持每个整数shuffle的256位版本)中,所有shuffle只能在端口5上运行.但在仅支持AVX1的IvB中,它只有FP shuffle,最高可达256位.整数shuffle总是只有128位,并且可以在端口1或端口5上运行,因为这两个端口上都有128位的shuffle执行单元.(https://agner.org/optimize/)
我没有详细查看asm,因为它很长,但是如果使用更宽的向量来节省添加/倍数会花费更多的时间,那就会更慢.
除了因为你所有的shuffle都变成了FP shuffle所以它们只能在端口5上运行,而不是利用端口1.我怀疑它有很多改组,它是一个瓶颈,而不是端口0(FP乘法)或端口1(FP add) .
BTW,Haswell和后来有两个FMA单元,每个单元在p0和p1上,因此乘法吞吐量是其两倍.Skylake以及后来在这些FMA单元上运行FP add,因此它们每个时钟吞吐量都有2个.(如果您可以有效地使用实际的FMA指令,您可以完成两倍的工作.)
此外,您的基准测试是测试延迟,而不是吞吐量,因为m
输入和输出也是如此. 但是,可能有足够的指令级并行性来解决混乱吞吐量的瓶颈问题.
在IvB上,车道交叉改组vperm2f128
并且vinsertf128
具有2个周期延迟,而对于仅具有单周期延迟的车道内改组(包括所有128位改组).英特尔的指南声称有一个不同的数字,IIRC,但是2个周期是Agner Fog在依赖链中实际发现的实际测量结果.(这可能是1个周期+某种旁路延迟).在Haswell和之后,交叉点改组是3个周期的延迟. 为什么英特尔公布的一些Haswell AVX延迟比Sandy Bridge慢3倍?
还有关:AVX512中的128位跨通道操作能提供更好的性能吗? 你有时可以通过在一个有用的点切成128位半部分的未对齐负载来减少混洗量,然后使用内部随机播放.这对于AVX1可能有用,因为它缺少vpermps
粒度小于128位的其他通道混洗.