FMA3在GCC中:如何启用。-FMA3inGCC:howtoenable

作者：普林 | 来源：互联网 | 2023-05-19 06:46

Ihaveai5-4250UwhichhasAVX2andFMA3.IamtestingsomedensematrixmultiplicationcodeinGCC

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile.

我有一个i5-4250U它有AVX2和FMA3。我在我写的Linux上测试了一些密集的矩阵乘法代码。下面是我编译的三个不同方法的列表。

SSE2:     gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX:      gcc matrix.cpp -o matrix_gcc -O3 -mavx  -fopenmp
AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math

The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I should be able to do a lot better with FMA. Matrix Multiplication should benefit directly from FMA. I'm essentially doing eight dot products at once in AVX. When I check march=native it gives:

SSE2和AVX版本在性能上明显不同。然而，AVX2+FMA并不比AVX版本好。我不明白这一点。如果没有FMA，我就能得到CPU峰值的80%，但我认为我应该能够更好地利用FMA。矩阵乘法应该直接从FMA中获益。我实际上是在AVX中一次做8个点积。当我检查march=native时，它给出:

cc -march=native -E -v - &1 | grep cc1 | grep fma 
...-march=core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...

So I can see it's enabled (just to be sure I added -mfma but it makes not difference). ffast-math should allow a relaxed floating point model How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

所以我可以看到它是启用的(只是为了确保我添加了-mfma，但它并没有区别)。ffast-math应该允许一个放松的浮点模型如何使用SSE/AVX使用融合的多重添加(FMA)指令。

Edit:

编辑:

Based on Mysticial's comments I went ahead and used _mm256_fmadd_ps and now the AVX2+FMA version is faster. I'm not sure why the compiler won't do this for me. I'm now getting about 80 GFLOPS (110% of the peak flops without FMA) for over 1000x1000 matrices. In case anyone does not trust my peak flop calculation here is what I did.

基于神秘的评论，我继续使用_mm256_fmadd_ps，现在AVX2+FMA版本更快。我不知道为什么编译器不会为我做这个。我现在得到了大约80个GFLOPS(在没有FMA的情况下，有110%的峰值)超过1000x1000个矩阵。如果有人不相信我的峰值失败计算，这就是我所做的。

peak flops (no FMA) = frequency * simd_width * ILP * cores
                    = 2.3GHZ    * 8          * 2   * 2     =  73.2 GFLOPS
peak flops (with FMA) = 2 * peak flops (no FMA)            = 146.2 GFLOPS

My CPU in turbo mode when using both cores is 2.3 GHz. I get 2 for ILP because Ivy Bridge can do one AVX multiplication and one AVX addition at the same time (and I have unrolled the loop several times to ensure this).

当使用两个核心时，我在turbo模式下的CPU是2.3 GHz。我得到了ILP的2，因为Ivy Bridge可以同时做一个AVX乘法和一个AVX加法(我已经多次打开循环来确保这一点)。

I'm only geting about 55% of the peak flops (with FMA). I'm not sure why but at least I'm seeing something now.

我只知道了大约55%的人字拖(FMA)。我不知道为什么，但至少我现在看到了一些东西。

One side effect is that I now get a small error when I compare to a simple matrix multiplication algorithm I know I trust. I think that's due to the fact that FMA only has one rounding mode instead of what would normally be two (which ironically breaks IEEE floating point rules even though it's probably better).

一个副作用是，当我比较一个简单的矩阵乘法运算法则时，我得到了一个小错误。我认为这是因为FMA只有一个舍入模式，而不是通常的两种模式(尽管它可能更好一些，但它打破了IEEE的浮点规则)。

Edit:

编辑:

Somebody needs to redo How do I achieve the theoretical maximum of 4 FLOPs per cycle? but do 8 double floating point FLOPS per cycle with Haswell.

有人需要重做，我怎么才能达到每次循环4次的理论最大值?但是在每一个循环上做8个双浮点数字拖。

Edit

编辑

Actually, Mysticial has updated his project to support FMA3 (see his answer in the link above). I ran his code in Windows8 with MSVC2012 (because the Linux version did not compile with FMA support). Here are the results.

实际上，《神秘》已经更新了他的项目以支持FMA3(参见上面的链接)。我在Windows8中使用MSVC2012运行他的代码(因为Linux版本没有通过FMA支持编译)。这里是结果。

Testing AVX Mul + Add:
SecOnds= 22.7417
FP Ops  = 768000000000
FLOPs   = 3.37705e+010
sum = 17.8122

Testing FMA3 FMA:
SecOnds= 22.1389
FP Ops  = 1536000000000
FLOPs   = 6.938e+010
sum = 333.309

That's 69.38 GFLOPS for FMA3 for double floating point. For single floating point I need to double it so that's 138.76 SP GFLOPS. I calculate my peak is 146.2 SP GFLOPS. That's 95% of the peak! In other words I should be able to improve my GEMM code quite a bit (although it's already quite a bit faster than Eigen).

这是FMA3的69.38 GFLOPS，用于双浮点。对于单个浮点数，我需要把它翻倍，所以是138.76 SP GFLOPS。我计算我的峰值是146.2 SP GFLOPS。这是顶峰的95% !换句话说，我应该能够很好地改进我的GEMM代码(尽管它已经比Eigen快了一些)。

2 个解决方案

#1

Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps for instance is now implemented as an inline function that simply uses +, so the code with intrinsics can be optimized as well.

只回答了问题的一小部分。如果您写入_mm256_add_ps(_mm256_mul_ps(areg0,breg0)， tmp0)， gcc-4.9就会像内联asm一样处理它，并且不会对它进行过多的优化。如果您将其替换为areg0*breg0+tmp0，这是一个由gcc和clang支持的语法，那么gcc就开始优化，如果可用，可以使用FMA。我改进了gcc-5， _mm256_add_ps现在作为一个简单地使用+的内联函数实现，因此也可以优化带有特性的代码。

#2

The following compiler options are sufficient to contract _mm256_add_ps(_mm256_mul_ps(a, b), c) to a single fma instruction now (e.g vfmadd213ps):

下面的编译器选项足以将_mm256_add_ps(_mm256_mul_ps(a, b)， c)绑定到单个fma指令(e)。g vfmadd213ps):

GCC 5.3:   -O2 -mavx2 -mfma
Clang 3.7: -O1 -mavx2 -mfma -ffp-cOntract=fast
ICC 13:    -O1 -march=core-avx2

I tried /O2 /arch:AVX2 /fp:fast with MSVC but it still does not contract (surprise surprise). MSVC will contract scalar operations though.

我尝试了/O2 /arch:AVX2 /fp:快速与MSVC，但它仍然不收缩(意外惊喜)。MSVC将会收缩标量运算。

GCC started doing this since at least GCC 5.1.

GCC至少从GCC 5.1开始就开始这样做了。

推荐阅读

object
编写有趣的VBScript恶作剧脚本

本文将介绍如何编写一些有趣的VBScript脚本，这些脚本可以在朋友之间进行无害的恶作剧。通过简单的代码示例，帮助您了解VBScript的基本语法和功能。 ... [详细]

蜡笔小新 2024-12-28 09:46:23
object
Java 序列化接口详解

本文深入探讨了 Java 中的 Serializable 接口，解释了其实现机制、用途及注意事项，帮助开发者更好地理解和使用序列化功能。 ... [详细]

蜡笔小新 2024-12-27 15:06:12
char
UNP 第9章：主机名与地址转换

本章探讨了用于在主机名和数值地址之间进行转换的函数，如gethostbyname和gethostbyaddr。此外，还介绍了getservbyname和getservbyport函数，用于在服务器名和端口号之间进行转换。 ... [详细]

蜡笔小新 2024-12-27 11:26:39
char
牛顿·拉普逊和塞——谁能给我解释一下这三条线吗 - Newton Raphson with SSE2 - can someone explain me these 3 lines

Imreadingthisdocument:http:software.intel.comen-usarticlesinteractive-ray-tracing我正在阅读这个文 ... [详细]

蜡笔小新 2024-12-16 14:16:21
char
Transforming the Future of Virtual Worlds

Explore how Matterverse is redefining the metaverse experience, creating immersive and meaningful virtual environments that foster genuine connections and economic opportunities. ... [详细]

蜡笔小新 2024-12-28 09:44:49
object
Handling Null Object Encoding in OAuth 1.0a API Implementation

Explore a common issue encountered when implementing an OAuth 1.0a API, specifically the inability to encode null objects and how to resolve it. ... [详细]

蜡笔小新 2024-12-28 08:54:34
object
Akka BackoffSupervisor的深入解析与实践

本文详细介绍了Akka中的BackoffSupervisor机制，探讨其在处理持久化失败和Actor重启时的应用。通过具体示例，展示了如何配置和使用BackoffSupervisor以实现更细粒度的异常处理。 ... [详细]

蜡笔小新 2024-12-27 15:04:09
object
XNA 3.0 游戏编程：从 XML 文件加载数据

本文介绍如何在 XNA 3.0 游戏项目中从 XML 文件加载数据。我们将探讨如何将 XML 数据序列化为二进制文件，并通过内容管道加载到游戏中。此外，还会涉及自定义类型读取器和写入器的实现。 ... [详细]

蜡笔小新 2024-12-27 11:39:44
object
VC++如何监控cpu fan 转速?

主板IO用W83627THG,用VC如何取得CPU温度,系统温度,CPU风扇转速,VBat的电压. ... [详细]

蜡笔小新 2024-12-22 13:48:42
object
使用预处理器开关确定类的版本

本文探讨了如何通过预处理器开关选择不同的类实现，并解决在特定情况下遇到的链接器错误。 ... [详细]

蜡笔小新 2024-12-22 12:03:31
jar
C++之父论C++：快速掌握高效编程之道

精通C++并非易事，为何它比其他语言更难掌握？这主要归因于C++的设计理念，即不强迫用户接受特定的编程风格或限制创新思维。本文探讨了如何有效学习C++，并介绍了几本权威的学习资源。 ... [详细]

蜡笔小新 2024-12-15 14:51:25
export
利用 Jest 和 Supertest 实现接口测试的全面指南

本文深入探讨了如何使用 Jest 和 Supertest 进行接口测试，通过实际案例详细解析了测试环境的搭建、测试用例的编写以及异步测试的处理方法。 ... [详细]

蜡笔小新 2024-12-14 19:04:38
string
《算法竞赛入门经典》第六章数据结构基础题解（一）

深入探讨栈和队列的应用实例——铁轨问题（Rails, ACM/ICPC CERC 1997, UVa 514）。该问题设定在一个城市火车站，涉及n节车厢从A方向驶入车站，并需按照特定顺序驶出B方向的铁轨。本文将通过算法实现来验证特定顺序的可行性。 ... [详细]

蜡笔小新 2024-12-10 10:32:07
object
新浪笔试题

1:有如下一段程序：packagea.b.c;publicclassTest{privatestaticinti0;publicintgetNext(){return ... [详细]

蜡笔小新 2024-12-27 19:32:17
object
词根词缀解析：greg、hap、helio及其他词源故事

本文基于刘洪波老师的《英文词根词缀精讲》，深入探讨了多个重要词根词缀的起源及其相关词汇，帮助读者更好地理解和记忆英语单词。 ... [详细]

蜡笔小新 2024-12-27 18:59:50

普林

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章