热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

在Solaris中,pthread互斥对象与原子操作-pthreadmutexvsatomicopsinSolaris

Iwasdoingsometestswithasimpleprogrammeasuringtheperformanceofasimpleatomicincrement

I was doing some tests with a simple program measuring the performance of a simple atomic increment on a 64 bit value using an atomic_add_64 vs a mutex lock approach. What is puzzling me is the atomic_add is slower than the mutex lock by a factor of 2.

我用一个简单的程序做了一些测试,该程序使用atomic_add_64和互斥锁方法测量64位值上的简单原子增量的性能。让我迷惑不解的是atomic_add比互斥锁要慢2倍。

EDIT!!! I've done some more testing. Looks like atomics are faster than mutex and scale up to 8 concurrent threads. After that the performance of atomics degrades significantly.

编辑! ! !我做了更多的测试。看起来原子比互斥体快,并且扩展到8个并发线程。之后原子的性能显著下降。

The platform I've tested is:

我测试的平台是:

SunOS 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V490

SunOS 5.10 generic_1444 -09 sun4u sparc SUNW,Sun-Fire-V490

CC: Sun C++ 5.9 SunOS_sparc Patch 124863-03 2008/03/12

CC: Sun C+ 5.9 SunOS_sparc补丁124863-03 2008/03/12

The program is quite simple:

这个程序非常简单:

#include 
#include 
#include 
#include 

uint64_t        g_Loops = 1000000;
volatile uint64_t       g_Counter = 0;
volatile uint32_t       g_Threads = 20;

pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t  g_Condition;

void LockMutex() 
{ 
  pthread_mutex_lock(&g_Mutex); 
}

void UnlockMutex() 
{ 
   pthread_mutex_unlock(&g_Mutex); 
}

void InitCond()
{
   pthread_mutex_init(&g_CondMutex, 0);
   pthread_cond_init(&g_Condition, 0);
}

void SignalThreadEnded()
{
   pthread_mutex_lock(&g_CondMutex);
   --g_Threads;
   pthread_cond_signal(&g_Condition);
   pthread_mutex_unlock(&g_CondMutex);
}

void* ThreadFuncMutex(void* arg)
{
   uint64_t counter = g_Loops;
   while(counter--)
   {
      LockMutex();
      ++g_Counter;
      UnlockMutex();
   }
   SignalThreadEnded();
   return 0;
}

void* ThreadFuncAtomic(void* arg)
{
   uint64_t counter = g_Loops;
   while(counter--)
   {
      atomic_add_64(&g_Counter, 1);
   }
   SignalThreadEnded();
   return 0;
}


int main(int argc, char** argv)
{
   pthread_mutex_init(&g_Mutex, 0);
   InitCond();
   bool bMutexRun = true;
   if(argc > 1)
   {
      bMutexRun = false;
      printf("Atomic run!\n");
   }
   else
        printf("Mutex run!\n");

   // start threads
   uint32_t threads = g_Threads;
   while(threads--)
   {
      pthread_t thr;
      if(bMutexRun)
         pthread_create(&thr, 0,ThreadFuncMutex, 0);
      else
         pthread_create(&thr, 0,ThreadFuncAtomic, 0);
   }
   pthread_mutex_lock(&g_CondMutex);
   while(g_Threads)
   {
      pthread_cond_wait(&g_Condition, &g_CondMutex);
      printf("Threads to go %d\n", g_Threads);
   }
   printf("DONE! g_Counter=%ld\n", (long)g_Counter);
}

The results of a test run on our box is:

在我们的盒子上进行测试运行的结果是:

$ CC -o atomictest atomictest.C
$ time ./atomictest
Mutex run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000

real    0m15.684s
user    0m52.748s
sys     0m0.396s

$ time ./atomictest 1
Atomic run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000

real    0m24.442s
user    3m14.496s
sys     0m0.068s

Did you run into this type of performance difference on Solaris? Any ideas why this happens?

您在Solaris上遇到过这种性能差异吗?你知道为什么会这样吗?

On Linux the same code (using the gcc __sync_fetch_and_add) yields a 5-fold performance improvement over the mutex verstion.

在Linux上,相同的代码(使用gcc __sync_fetch_and_add)比互斥锁带来了5倍的性能改进。

Thanks, Octav

谢谢,Octav

1 个解决方案

#1


2  

You have to be careful what is happening here.

你必须小心这里发生的事情。

1) It takes significant time to create a thread. Thus, its likely that not all the threads are executing simultaneously. As evidence, I took your code and removed the mutex lock and got the correct answer every time I ran it. This means that none of the threads were executing at the same time! You should not count the time to create/destruct threads in your test. You should wait till all threads are created and running before you start the test.

1)创建线程需要花费大量的时间。因此,并不是所有的线程都同时执行。作为证据,我获取了您的代码并删除了互斥锁,每次运行时都得到了正确的答案。这意味着没有一个线程同时执行!您不应该计算在测试中创建/销毁线程的时间。您应该等到创建并运行所有线程后才开始测试。

2) Your test isn't fair. Your test has artificially very high lock contention. For whatever reason, the atomic add_and_fetch suffers in that situation. In real life, you would do some work in the thread. Once you add even a little bit of work, the atomic ops perform a lot better. This is because the chance of a race condition has dropped significantly. The atomic op has lower overhead when there is no contention. The mutex has more overhead than the atomic op when there is no contention.

你的考试不公平。您的测试人为地非常高锁争用。无论出于什么原因,原子add_and_fetch在这种情况下会出现问题。在现实生活中,你会在线程中做一些工作。一旦你添加了一点工作,原子操作就会表现得更好。这是因为发生种族状况的几率已经大大降低了。在没有争用的情况下,原子op的开销更低。当没有争用时,互斥对象的开销比原子操作要大。

3) # of threads. The fewer threads running, the lower the contention. This is why fewer threads do better for the atomic in this test. Your 8 thread number might be the number of simultaneous threads your system supports. It might not be because your test was so skewed towards contention. It would seem to me that your test would scale to the number of simultaneous threads allowed and then plateau. One thing I cannot figure out is why, when the # of threads gets higher than the number of simultaneous threads the system can handle, we don't see evidence of the situation where the mutex is left locked while the thread sleeps. Maybe we do, I just can't see it happening.

3)#的线程。运行的线程越少,争用就越少。这就是为什么在这个测试中更少的线程在原子上做得更好。您的8个线程数可能是系统支持的同步线程数。这可能不是因为您的测试太倾向于争用。在我看来,您的测试将扩展到允许的同步线程的数量,然后是平台。我搞不懂的一件事是,为什么当线程的数量超过系统可以处理的同时线程的数量时,我们看不到在线程休眠时互斥体被锁住的情况。也许是这样的,我只是看不到它的发生。

Bottom line, the atomics are a lot faster in most real life situations. They are not very good when you have to hold a lock for a long time...something you should avoid anyway (well in my opinion at least!)

总的来说,原子学在现实生活中的速度要快得多。当你长时间锁着的时候,它们不是很好……无论如何你都应该避免的事情(至少在我看来是这样!)

I changed your code so you can test with no work, barely any work, and a little more work as well as change the # of threads.

我更改了您的代码,这样您就可以不做任何工作、几乎不做任何工作、多做一些工作以及更改线程的#。

6sm = 6 threads, barely any work, mutex 6s = 6 threads, barely any work, atomic

6个线程,几乎没有任何工作,互斥6 = 6个线程,几乎没有任何工作,原子。

use a capitol S to get more work, and no s to get no work.

利用国会大厦来获得更多的工作,而不是没有工作。

These results show that with 10 threads, the amount of work affects how much faster atomics are. In the first case, there is no work, and the atomics are barely faster. Add a little work and the gap doubles to 6 sec, and a lot of work and it almost gets to 10 sec.

这些结果表明,在10个线程中,工作的数量影响了原子的速度。在第一种情况下,没有功,原子的速度也仅仅是更快。加一点功,间隔加倍到6秒,再加很多功,几乎达到10秒。

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=10000000 13.6520 s
MUTEX  FAST g_Counter=10000000 15.2760 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10s; a.out $t ; a.out "$t"m
ATOMIC slow g_Counter=10000000 11.4957 s
MUTEX  slow g_Counter=10000000 17.9419 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=10000000 14.7108 s
MUTEX  SLOW g_Counter=10000000 23.8762 s

20 threads, atomics still better, but by a smaller margin. No work, they are almost the same speed. With a lot of work, atomics take the lead again.

20个线程,原子力仍然更好,但幅度较小。没有工作,他们的速度几乎是一样的。经过大量的工作,原子学再次领先。

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=20; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=20000000 27.6267 s
MUTEX  FAST g_Counter=20000000 30.5569 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=20S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=20000000 35.3514 s
MUTEX  SLOW g_Counter=20000000 48.7594 s

2 threads. Atomics dominate.

2个线程。原子占主导地位。

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=2S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=2000000 0.6007 s
MUTEX  SLOW g_Counter=2000000 1.4966 s

Here is the code (redhat linux, using gcc atomics):

下面是代码(redhat linux,使用gcc原子):

#include 
#include 
#include 
#include 

volatile uint64_t __attribute__((aligned (64))) g_Loops = 1000000 ;
volatile uint64_t __attribute__((aligned (64))) g_Counter = 0;
volatile uint32_t __attribute__((aligned (64))) g_Threads = 7; 
volatile uint32_t __attribute__((aligned (64))) g_Active = 0;
volatile uint32_t __attribute__((aligned (64))) g_fGo = 0;
int g_fSlow = 0;

#define true 1
#define false 0
#define NANOSEC(t) (1000000000ULL * (t).tv_sec + (t).tv_nsec)

pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t  g_Condition;

void LockMutex() 
{ 
  pthread_mutex_lock(&g_Mutex); 
}

void UnlockMutex() 
{ 
   pthread_mutex_unlock(&g_Mutex); 
}

void Start(struct timespec *pT)
{
   int cActive = __sync_add_and_fetch(&g_Active, 1);
   while(!g_fGo) {} 
   clock_gettime(CLOCK_THREAD_CPUTIME_ID, pT);
}

uint64_t End(struct timespec *pT)
{
   struct timespec T;
   int cActive = __sync_sub_and_fetch(&g_Active, 1);
   clock_gettime(CLOCK_THREAD_CPUTIME_ID, &T);
   return NANOSEC(T) - NANOSEC(*pT);
}
void Work(double *x, double z)
{
      *x += z;
      *x /= 27.6;
      if ((uint64_t)(*x + .5) - (uint64_t)*x != 0)
        *x += .7;
}
void* ThreadFuncMutex(void* arg)
{
   struct timespec T;
   uint64_t counter = g_Loops;
   double x = 0, z = 0;
   int fSlow = g_fSlow;

   Start(&T);
   if (!fSlow) {
     while(counter--) {
        LockMutex();
        ++g_Counter;
        UnlockMutex();
     }
   } else {
     while(counter--) {
        if (fSlow==2) Work(&x, z);
        LockMutex();
        ++g_Counter;
        z = g_Counter;
        UnlockMutex();
     }
   }
   *(uint64_t*)arg = End(&T);
   return (void*)(int)x;
}

void* ThreadFuncAtomic(void* arg)
{
   struct timespec T;
   uint64_t counter = g_Loops;
   double x = 0, z = 0;
   int fSlow = g_fSlow;

   Start(&T);
   if (!fSlow) {
     while(counter--) {
        __sync_add_and_fetch(&g_Counter, 1);
     }
   } else {
     while(counter--) {
        if (fSlow==2) Work(&x, z);
        z = __sync_add_and_fetch(&g_Counter, 1);
     }
   }
   *(uint64_t*)arg = End(&T);
   return (void*)(int)x;
}


int main(int argc, char** argv)
{
   int i;
   int bMutexRun = strchr(argv[1], 'm') != NULL;
   pthread_t thr[1000];
   uint64_t aT[1000];
   g_Threads = atoi(argv[1]);
   g_fSlow = (strchr(argv[1], 's') != NULL) ? 1 : ((strchr(argv[1], 'S') != NULL) ? 2 : 0);

   // start threads
   pthread_mutex_init(&g_Mutex, 0);
   for (i=0 ; i

推荐阅读
  • 【MySQL】frm文件解析
    官网说明:http:dev.mysql.comdocinternalsenfrm-file-format.htmlfrm是MySQL表结构定义文件,通常frm文件是不会损坏的,但是如果 ... [详细]
  • 在1995年,Simon Plouffe 发现了一种特殊的求和方法来表示某些常数。两年后,Bailey 和 Borwein 在他们的论文中发表了这一发现,这种方法被命名为 Bailey-Borwein-Plouffe (BBP) 公式。该问题要求计算圆周率 π 的第 n 个十六进制数字。 ... [详细]
  • 题目描述:Balala Power! 时间限制:4000/2000 MS (Java/Other) 内存限制:131072/131072 K (Java/Other)。题目背景及问题描述详见正文。 ... [详细]
  • 本文详细介绍了如何在 Ubuntu 14.04 系统上搭建仅使用 CPU 的 Caffe 深度学习框架,包括环境准备、依赖安装及编译过程。 ... [详细]
  • Gradle 是 Android Studio 中默认的构建工具,了解其基本配置对于开发效率的提升至关重要。本文将详细介绍如何在 Gradle 中定义和使用共享变量,以确保项目的一致性和可维护性。 ... [详细]
  • 本文探讨了Linux环境下线程私有数据(Thread-Specific Data, TSD)的概念及其重要性,介绍了如何通过TSD技术避免多线程间全局变量冲突的问题,并提供了具体的实现方法和示例代码。 ... [详细]
  • 本文提供了一个关于AC自动机(Aho-Corasick Algorithm)的详细解析与实现方法,特别针对P3796题目进行了深入探讨。文章不仅涵盖了AC自动机的基本概念,还重点讲解了如何通过构建失败指针(fail pointer)来提高字符串匹配效率。 ... [详细]
  • 本文分享了作者在使用LaTeX过程中的几点心得,涵盖了从文档编辑、代码高亮、图形绘制到3D模型展示等多个方面的内容。适合希望深入了解LaTeX高级功能的用户。 ... [详细]
  • 本文详细介绍如何在SSM(Spring + Spring MVC + MyBatis)框架中实现分页功能。包括分页的基本概念、数据准备、前端分页栏的设计与实现、后端分页逻辑的编写以及最终的测试步骤。 ... [详细]
  • 编程解析:CF989C 花朵之雾 (构造算法)
    本文深入探讨了CF989C '花朵之雾'问题的构造算法,提供了详细的解题思路和代码实现。 ... [详细]
  • 视觉Transformer综述
    本文综述了视觉Transformer在计算机视觉领域的应用,从原始Transformer出发,详细介绍了其在图像分类、目标检测和图像分割等任务中的最新进展。文章不仅涵盖了基础的Transformer架构,还深入探讨了各类增强版Transformer模型的设计思路和技术细节。 ... [详细]
  • 网络流24题——试题库问题
    题目描述:假设一个试题库中有n道试题。每道试题都标明了所属类别。同一道题可能有多个类别属性。现要从题库中抽取m道题组成试卷。并要求试卷包含指定类型的试题。试设计一个满足要求的组卷算 ... [详细]
  • 二维码的实现与应用
    本文介绍了二维码的基本概念、分类及其优缺点,并详细描述了如何使用Java编程语言结合第三方库(如ZXing和qrcode.jar)来实现二维码的生成与解析。 ... [详细]
  • 本文通过C++语言实现了一个递归算法,用于解析并计算数学表达式的值。该算法能够处理加法、减法、乘法和除法操作。 ... [详细]
  • 本问题涉及在给定的无向图中寻找一个至少包含三个节点的环,该环上的节点不重复,并且环上所有边的长度之和最小。目标是找到并输出这个最小环的具体方案。 ... [详细]
author-avatar
苗淑香哈哈_405_408
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有