I was doing some tests with a simple program measuring the performance of a simple atomic increment on a 64 bit value using an atomic_add_64 vs a mutex lock approach. What is puzzling me is the atomic_add is slower than the mutex lock by a factor of 2.
我用一个简单的程序做了一些测试,该程序使用atomic_add_64和互斥锁方法测量64位值上的简单原子增量的性能。让我迷惑不解的是atomic_add比互斥锁要慢2倍。
EDIT!!! I've done some more testing. Looks like atomics are faster than mutex and scale up to 8 concurrent threads. After that the performance of atomics degrades significantly.
编辑! ! !我做了更多的测试。看起来原子比互斥体快,并且扩展到8个并发线程。之后原子的性能显著下降。
The platform I've tested is:
我测试的平台是:
SunOS 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V490
SunOS 5.10 generic_1444 -09 sun4u sparc SUNW,Sun-Fire-V490
CC: Sun C++ 5.9 SunOS_sparc Patch 124863-03 2008/03/12
CC: Sun C+ 5.9 SunOS_sparc补丁124863-03 2008/03/12
The program is quite simple:
这个程序非常简单:
#include
#include
#include
#include
uint64_t g_Loops = 1000000;
volatile uint64_t g_Counter = 0;
volatile uint32_t g_Threads = 20;
pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t g_Condition;
void LockMutex()
{
pthread_mutex_lock(&g_Mutex);
}
void UnlockMutex()
{
pthread_mutex_unlock(&g_Mutex);
}
void InitCond()
{
pthread_mutex_init(&g_CondMutex, 0);
pthread_cond_init(&g_Condition, 0);
}
void SignalThreadEnded()
{
pthread_mutex_lock(&g_CondMutex);
--g_Threads;
pthread_cond_signal(&g_Condition);
pthread_mutex_unlock(&g_CondMutex);
}
void* ThreadFuncMutex(void* arg)
{
uint64_t counter = g_Loops;
while(counter--)
{
LockMutex();
++g_Counter;
UnlockMutex();
}
SignalThreadEnded();
return 0;
}
void* ThreadFuncAtomic(void* arg)
{
uint64_t counter = g_Loops;
while(counter--)
{
atomic_add_64(&g_Counter, 1);
}
SignalThreadEnded();
return 0;
}
int main(int argc, char** argv)
{
pthread_mutex_init(&g_Mutex, 0);
InitCond();
bool bMutexRun = true;
if(argc > 1)
{
bMutexRun = false;
printf("Atomic run!\n");
}
else
printf("Mutex run!\n");
// start threads
uint32_t threads = g_Threads;
while(threads--)
{
pthread_t thr;
if(bMutexRun)
pthread_create(&thr, 0,ThreadFuncMutex, 0);
else
pthread_create(&thr, 0,ThreadFuncAtomic, 0);
}
pthread_mutex_lock(&g_CondMutex);
while(g_Threads)
{
pthread_cond_wait(&g_Condition, &g_CondMutex);
printf("Threads to go %d\n", g_Threads);
}
printf("DONE! g_Counter=%ld\n", (long)g_Counter);
}
The results of a test run on our box is:
在我们的盒子上进行测试运行的结果是:
$ CC -o atomictest atomictest.C
$ time ./atomictest
Mutex run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000
real 0m15.684s
user 0m52.748s
sys 0m0.396s
$ time ./atomictest 1
Atomic run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000
real 0m24.442s
user 3m14.496s
sys 0m0.068s
Did you run into this type of performance difference on Solaris? Any ideas why this happens?
您在Solaris上遇到过这种性能差异吗?你知道为什么会这样吗?
On Linux the same code (using the gcc __sync_fetch_and_add) yields a 5-fold performance improvement over the mutex verstion.
在Linux上,相同的代码(使用gcc __sync_fetch_and_add)比互斥锁带来了5倍的性能改进。
Thanks, Octav
谢谢,Octav
2
You have to be careful what is happening here.
你必须小心这里发生的事情。
1) It takes significant time to create a thread. Thus, its likely that not all the threads are executing simultaneously. As evidence, I took your code and removed the mutex lock and got the correct answer every time I ran it. This means that none of the threads were executing at the same time! You should not count the time to create/destruct threads in your test. You should wait till all threads are created and running before you start the test.
1)创建线程需要花费大量的时间。因此,并不是所有的线程都同时执行。作为证据,我获取了您的代码并删除了互斥锁,每次运行时都得到了正确的答案。这意味着没有一个线程同时执行!您不应该计算在测试中创建/销毁线程的时间。您应该等到创建并运行所有线程后才开始测试。
2) Your test isn't fair. Your test has artificially very high lock contention. For whatever reason, the atomic add_and_fetch suffers in that situation. In real life, you would do some work in the thread. Once you add even a little bit of work, the atomic ops perform a lot better. This is because the chance of a race condition has dropped significantly. The atomic op has lower overhead when there is no contention. The mutex has more overhead than the atomic op when there is no contention.
你的考试不公平。您的测试人为地非常高锁争用。无论出于什么原因,原子add_and_fetch在这种情况下会出现问题。在现实生活中,你会在线程中做一些工作。一旦你添加了一点工作,原子操作就会表现得更好。这是因为发生种族状况的几率已经大大降低了。在没有争用的情况下,原子op的开销更低。当没有争用时,互斥对象的开销比原子操作要大。
3) # of threads. The fewer threads running, the lower the contention. This is why fewer threads do better for the atomic in this test. Your 8 thread number might be the number of simultaneous threads your system supports. It might not be because your test was so skewed towards contention. It would seem to me that your test would scale to the number of simultaneous threads allowed and then plateau. One thing I cannot figure out is why, when the # of threads gets higher than the number of simultaneous threads the system can handle, we don't see evidence of the situation where the mutex is left locked while the thread sleeps. Maybe we do, I just can't see it happening.
3)#的线程。运行的线程越少,争用就越少。这就是为什么在这个测试中更少的线程在原子上做得更好。您的8个线程数可能是系统支持的同步线程数。这可能不是因为您的测试太倾向于争用。在我看来,您的测试将扩展到允许的同步线程的数量,然后是平台。我搞不懂的一件事是,为什么当线程的数量超过系统可以处理的同时线程的数量时,我们看不到在线程休眠时互斥体被锁住的情况。也许是这样的,我只是看不到它的发生。
Bottom line, the atomics are a lot faster in most real life situations. They are not very good when you have to hold a lock for a long time...something you should avoid anyway (well in my opinion at least!)
总的来说,原子学在现实生活中的速度要快得多。当你长时间锁着的时候,它们不是很好……无论如何你都应该避免的事情(至少在我看来是这样!)
I changed your code so you can test with no work, barely any work, and a little more work as well as change the # of threads.
我更改了您的代码,这样您就可以不做任何工作、几乎不做任何工作、多做一些工作以及更改线程的#。
6sm = 6 threads, barely any work, mutex 6s = 6 threads, barely any work, atomic
6个线程,几乎没有任何工作,互斥6 = 6个线程,几乎没有任何工作,原子。
use a capitol S to get more work, and no s to get no work.
利用国会大厦来获得更多的工作,而不是没有工作。
These results show that with 10 threads, the amount of work affects how much faster atomics are. In the first case, there is no work, and the atomics are barely faster. Add a little work and the gap doubles to 6 sec, and a lot of work and it almost gets to 10 sec.
这些结果表明,在10个线程中,工作的数量影响了原子的速度。在第一种情况下,没有功,原子的速度也仅仅是更快。加一点功,间隔加倍到6秒,再加很多功,几乎达到10秒。
(2) /dev_tools/Users/c698174/temp/atomic
[c698174@shldvgfas007] $ t=10; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=10000000 13.6520 s
MUTEX FAST g_Counter=10000000 15.2760 s
(2) /dev_tools/Users/c698174/temp/atomic
[c698174@shldvgfas007] $ t=10s; a.out $t ; a.out "$t"m
ATOMIC slow g_Counter=10000000 11.4957 s
MUTEX slow g_Counter=10000000 17.9419 s
(2) /dev_tools/Users/c698174/temp/atomic
[c698174@shldvgfas007] $ t=10S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=10000000 14.7108 s
MUTEX SLOW g_Counter=10000000 23.8762 s
20 threads, atomics still better, but by a smaller margin. No work, they are almost the same speed. With a lot of work, atomics take the lead again.
20个线程,原子力仍然更好,但幅度较小。没有工作,他们的速度几乎是一样的。经过大量的工作,原子学再次领先。
(2) /dev_tools/Users/c698174/temp/atomic
[c698174@shldvgfas007] $ t=20; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=20000000 27.6267 s
MUTEX FAST g_Counter=20000000 30.5569 s
(2) /dev_tools/Users/c698174/temp/atomic
[c698174@shldvgfas007] $ t=20S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=20000000 35.3514 s
MUTEX SLOW g_Counter=20000000 48.7594 s
2 threads. Atomics dominate.
2个线程。原子占主导地位。
(2) /dev_tools/Users/c698174/temp/atomic
[c698174@shldvgfas007] $ t=2S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=2000000 0.6007 s
MUTEX SLOW g_Counter=2000000 1.4966 s
Here is the code (redhat linux, using gcc atomics):
下面是代码(redhat linux,使用gcc原子):
#include
#include
#include
#include
volatile uint64_t __attribute__((aligned (64))) g_Loops = 1000000 ;
volatile uint64_t __attribute__((aligned (64))) g_Counter = 0;
volatile uint32_t __attribute__((aligned (64))) g_Threads = 7;
volatile uint32_t __attribute__((aligned (64))) g_Active = 0;
volatile uint32_t __attribute__((aligned (64))) g_fGo = 0;
int g_fSlow = 0;
#define true 1
#define false 0
#define NANOSEC(t) (1000000000ULL * (t).tv_sec + (t).tv_nsec)
pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t g_Condition;
void LockMutex()
{
pthread_mutex_lock(&g_Mutex);
}
void UnlockMutex()
{
pthread_mutex_unlock(&g_Mutex);
}
void Start(struct timespec *pT)
{
int cActive = __sync_add_and_fetch(&g_Active, 1);
while(!g_fGo) {}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, pT);
}
uint64_t End(struct timespec *pT)
{
struct timespec T;
int cActive = __sync_sub_and_fetch(&g_Active, 1);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &T);
return NANOSEC(T) - NANOSEC(*pT);
}
void Work(double *x, double z)
{
*x += z;
*x /= 27.6;
if ((uint64_t)(*x + .5) - (uint64_t)*x != 0)
*x += .7;
}
void* ThreadFuncMutex(void* arg)
{
struct timespec T;
uint64_t counter = g_Loops;
double x = 0, z = 0;
int fSlow = g_fSlow;
Start(&T);
if (!fSlow) {
while(counter--) {
LockMutex();
++g_Counter;
UnlockMutex();
}
} else {
while(counter--) {
if (fSlow==2) Work(&x, z);
LockMutex();
++g_Counter;
z = g_Counter;
UnlockMutex();
}
}
*(uint64_t*)arg = End(&T);
return (void*)(int)x;
}
void* ThreadFuncAtomic(void* arg)
{
struct timespec T;
uint64_t counter = g_Loops;
double x = 0, z = 0;
int fSlow = g_fSlow;
Start(&T);
if (!fSlow) {
while(counter--) {
__sync_add_and_fetch(&g_Counter, 1);
}
} else {
while(counter--) {
if (fSlow==2) Work(&x, z);
z = __sync_add_and_fetch(&g_Counter, 1);
}
}
*(uint64_t*)arg = End(&T);
return (void*)(int)x;
}
int main(int argc, char** argv)
{
int i;
int bMutexRun = strchr(argv[1], 'm') != NULL;
pthread_t thr[1000];
uint64_t aT[1000];
g_Threads = atoi(argv[1]);
g_fSlow = (strchr(argv[1], 's') != NULL) ? 1 : ((strchr(argv[1], 'S') != NULL) ? 2 : 0);
// start threads
pthread_mutex_init(&g_Mutex, 0);
for (i=0 ; i