热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

FMA3在GCC中:如何启用。-FMA3inGCC:howtoenable

Ihaveai5-4250UwhichhasAVX2andFMA3.IamtestingsomedensematrixmultiplicationcodeinGCC

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile.

我有一个i5-4250U它有AVX2和FMA3。我在我写的Linux上测试了一些密集的矩阵乘法代码。下面是我编译的三个不同方法的列表。

SSE2:     gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX:      gcc matrix.cpp -o matrix_gcc -O3 -mavx  -fopenmp
AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math

The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I should be able to do a lot better with FMA. Matrix Multiplication should benefit directly from FMA. I'm essentially doing eight dot products at once in AVX. When I check march=native it gives:

SSE2和AVX版本在性能上明显不同。然而,AVX2+FMA并不比AVX版本好。我不明白这一点。如果没有FMA,我就能得到CPU峰值的80%,但我认为我应该能够更好地利用FMA。矩阵乘法应该直接从FMA中获益。我实际上是在AVX中一次做8个点积。当我检查march=native时,它给出:

cc -march=native -E -v - &1 | grep cc1 | grep fma 
...-march=core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...

So I can see it's enabled (just to be sure I added -mfma but it makes not difference). ffast-math should allow a relaxed floating point model How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

所以我可以看到它是启用的(只是为了确保我添加了-mfma,但它并没有区别)。ffast-math应该允许一个放松的浮点模型如何使用SSE/AVX使用融合的多重添加(FMA)指令。

Edit:

编辑:

Based on Mysticial's comments I went ahead and used _mm256_fmadd_ps and now the AVX2+FMA version is faster. I'm not sure why the compiler won't do this for me. I'm now getting about 80 GFLOPS (110% of the peak flops without FMA) for over 1000x1000 matrices. In case anyone does not trust my peak flop calculation here is what I did.

基于神秘的评论,我继续使用_mm256_fmadd_ps,现在AVX2+FMA版本更快。我不知道为什么编译器不会为我做这个。我现在得到了大约80个GFLOPS(在没有FMA的情况下,有110%的峰值)超过1000x1000个矩阵。如果有人不相信我的峰值失败计算,这就是我所做的。

peak flops (no FMA) = frequency * simd_width * ILP * cores
                    = 2.3GHZ    * 8          * 2   * 2     =  73.2 GFLOPS
peak flops (with FMA) = 2 * peak flops (no FMA)            = 146.2 GFLOPS

My CPU in turbo mode when using both cores is 2.3 GHz. I get 2 for ILP because Ivy Bridge can do one AVX multiplication and one AVX addition at the same time (and I have unrolled the loop several times to ensure this).

当使用两个核心时,我在turbo模式下的CPU是2.3 GHz。我得到了ILP的2,因为Ivy Bridge可以同时做一个AVX乘法和一个AVX加法(我已经多次打开循环来确保这一点)。

I'm only geting about 55% of the peak flops (with FMA). I'm not sure why but at least I'm seeing something now.

我只知道了大约55%的人字拖(FMA)。我不知道为什么,但至少我现在看到了一些东西。

One side effect is that I now get a small error when I compare to a simple matrix multiplication algorithm I know I trust. I think that's due to the fact that FMA only has one rounding mode instead of what would normally be two (which ironically breaks IEEE floating point rules even though it's probably better).

一个副作用是,当我比较一个简单的矩阵乘法运算法则时,我得到了一个小错误。我认为这是因为FMA只有一个舍入模式,而不是通常的两种模式(尽管它可能更好一些,但它打破了IEEE的浮点规则)。

Edit:

编辑:

Somebody needs to redo How do I achieve the theoretical maximum of 4 FLOPs per cycle? but do 8 double floating point FLOPS per cycle with Haswell.

有人需要重做,我怎么才能达到每次循环4次的理论最大值?但是在每一个循环上做8个双浮点数字拖。

Edit

编辑

Actually, Mysticial has updated his project to support FMA3 (see his answer in the link above). I ran his code in Windows8 with MSVC2012 (because the Linux version did not compile with FMA support). Here are the results.

实际上,《神秘》已经更新了他的项目以支持FMA3(参见上面的链接)。我在Windows8中使用MSVC2012运行他的代码(因为Linux版本没有通过FMA支持编译)。这里是结果。

Testing AVX Mul + Add:
SecOnds= 22.7417
FP Ops  = 768000000000
FLOPs   = 3.37705e+010
sum = 17.8122

Testing FMA3 FMA:
SecOnds= 22.1389
FP Ops  = 1536000000000
FLOPs   = 6.938e+010
sum = 333.309

That's 69.38 GFLOPS for FMA3 for double floating point. For single floating point I need to double it so that's 138.76 SP GFLOPS. I calculate my peak is 146.2 SP GFLOPS. That's 95% of the peak! In other words I should be able to improve my GEMM code quite a bit (although it's already quite a bit faster than Eigen).

这是FMA3的69.38 GFLOPS,用于双浮点。对于单个浮点数,我需要把它翻倍,所以是138.76 SP GFLOPS。我计算我的峰值是146.2 SP GFLOPS。这是顶峰的95% !换句话说,我应该能够很好地改进我的GEMM代码(尽管它已经比Eigen快了一些)。

2 个解决方案

#1


6  

Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps for instance is now implemented as an inline function that simply uses +, so the code with intrinsics can be optimized as well.

只回答了问题的一小部分。如果您写入_mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9就会像内联asm一样处理它,并且不会对它进行过多的优化。如果您将其替换为areg0*breg0+tmp0,这是一个由gcc和clang支持的语法,那么gcc就开始优化,如果可用,可以使用FMA。我改进了gcc-5, _mm256_add_ps现在作为一个简单地使用+的内联函数实现,因此也可以优化带有特性的代码。

#2


3  

The following compiler options are sufficient to contract _mm256_add_ps(_mm256_mul_ps(a, b), c) to a single fma instruction now (e.g vfmadd213ps):

下面的编译器选项足以将_mm256_add_ps(_mm256_mul_ps(a, b), c)绑定到单个fma指令(e)。g vfmadd213ps):

GCC 5.3:   -O2 -mavx2 -mfma
Clang 3.7: -O1 -mavx2 -mfma -ffp-cOntract=fast
ICC 13:    -O1 -march=core-avx2

I tried /O2 /arch:AVX2 /fp:fast with MSVC but it still does not contract (surprise surprise). MSVC will contract scalar operations though.

我尝试了/O2 /arch:AVX2 /fp:快速与MSVC,但它仍然不收缩(意外惊喜)。MSVC将会收缩标量运算。

GCC started doing this since at least GCC 5.1.

GCC至少从GCC 5.1开始就开始这样做了。


推荐阅读
author-avatar
普林
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有