作者:普林 | 来源:互联网 | 2023-05-19 06:46
Ihaveai5-4250UwhichhasAVX2andFMA3.IamtestingsomedensematrixmultiplicationcodeinGCC
I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile.
我有一个i5-4250U它有AVX2和FMA3。我在我写的Linux上测试了一些密集的矩阵乘法代码。下面是我编译的三个不同方法的列表。
SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp
AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math
The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I should be able to do a lot better with FMA. Matrix Multiplication should benefit directly from FMA. I'm essentially doing eight dot products at once in AVX. When I check march=native
it gives:
SSE2和AVX版本在性能上明显不同。然而,AVX2+FMA并不比AVX版本好。我不明白这一点。如果没有FMA,我就能得到CPU峰值的80%,但我认为我应该能够更好地利用FMA。矩阵乘法应该直接从FMA中获益。我实际上是在AVX中一次做8个点积。当我检查march=native时,它给出:
cc -march=native -E -v - &1 | grep cc1 | grep fma
...-march=core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...
So I can see it's enabled (just to be sure I added -mfma
but it makes not difference). ffast-math
should allow a relaxed floating point model How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
所以我可以看到它是启用的(只是为了确保我添加了-mfma,但它并没有区别)。ffast-math应该允许一个放松的浮点模型如何使用SSE/AVX使用融合的多重添加(FMA)指令。
Edit:
编辑:
Based on Mysticial's comments I went ahead and used _mm256_fmadd_ps and now the AVX2+FMA version is faster. I'm not sure why the compiler won't do this for me. I'm now getting about 80 GFLOPS (110% of the peak flops without FMA) for over 1000x1000 matrices. In case anyone does not trust my peak flop calculation here is what I did.
基于神秘的评论,我继续使用_mm256_fmadd_ps,现在AVX2+FMA版本更快。我不知道为什么编译器不会为我做这个。我现在得到了大约80个GFLOPS(在没有FMA的情况下,有110%的峰值)超过1000x1000个矩阵。如果有人不相信我的峰值失败计算,这就是我所做的。
peak flops (no FMA) = frequency * simd_width * ILP * cores
= 2.3GHZ * 8 * 2 * 2 = 73.2 GFLOPS
peak flops (with FMA) = 2 * peak flops (no FMA) = 146.2 GFLOPS
My CPU in turbo mode when using both cores is 2.3 GHz. I get 2 for ILP because Ivy Bridge can do one AVX multiplication and one AVX addition at the same time (and I have unrolled the loop several times to ensure this).
当使用两个核心时,我在turbo模式下的CPU是2.3 GHz。我得到了ILP的2,因为Ivy Bridge可以同时做一个AVX乘法和一个AVX加法(我已经多次打开循环来确保这一点)。
I'm only geting about 55% of the peak flops (with FMA). I'm not sure why but at least I'm seeing something now.
我只知道了大约55%的人字拖(FMA)。我不知道为什么,但至少我现在看到了一些东西。
One side effect is that I now get a small error when I compare to a simple matrix multiplication algorithm I know I trust. I think that's due to the fact that FMA only has one rounding mode instead of what would normally be two (which ironically breaks IEEE floating point rules even though it's probably better).
一个副作用是,当我比较一个简单的矩阵乘法运算法则时,我得到了一个小错误。我认为这是因为FMA只有一个舍入模式,而不是通常的两种模式(尽管它可能更好一些,但它打破了IEEE的浮点规则)。
Edit:
编辑:
Somebody needs to redo How do I achieve the theoretical maximum of 4 FLOPs per cycle? but do 8 double floating point FLOPS per cycle with Haswell.
有人需要重做,我怎么才能达到每次循环4次的理论最大值?但是在每一个循环上做8个双浮点数字拖。
Edit
编辑
Actually, Mysticial has updated his project to support FMA3 (see his answer in the link above). I ran his code in Windows8 with MSVC2012 (because the Linux version did not compile with FMA support). Here are the results.
实际上,《神秘》已经更新了他的项目以支持FMA3(参见上面的链接)。我在Windows8中使用MSVC2012运行他的代码(因为Linux版本没有通过FMA支持编译)。这里是结果。
Testing AVX Mul + Add:
SecOnds= 22.7417
FP Ops = 768000000000
FLOPs = 3.37705e+010
sum = 17.8122
Testing FMA3 FMA:
SecOnds= 22.1389
FP Ops = 1536000000000
FLOPs = 6.938e+010
sum = 333.309
That's 69.38 GFLOPS for FMA3 for double floating point. For single floating point I need to double it so that's 138.76 SP GFLOPS. I calculate my peak is 146.2 SP GFLOPS. That's 95% of the peak! In other words I should be able to improve my GEMM code quite a bit (although it's already quite a bit faster than Eigen).
这是FMA3的69.38 GFLOPS,用于双浮点。对于单个浮点数,我需要把它翻倍,所以是138.76 SP GFLOPS。我计算我的峰值是146.2 SP GFLOPS。这是顶峰的95% !换句话说,我应该能够很好地改进我的GEMM代码(尽管它已经比Eigen快了一些)。
2 个解决方案