您的位置：首页 > 编程语言

Pthreads并行编程之spin lock与mutex性能对比分析

2010-12-01 11:46 901 查看

Pthreads并行编程之spinlock与mutex性能对比分析

POSIXthreads(简称Pthreads)是在多核平台上进行并行编程的一套常用的API。线程同步(ThreadSynchronization)是并行编程中非常重要的通讯手段，其中最典型的应用就是用Pthreads提供的锁机制(lock)来对多个线程之间共享的临界区(CriticalSection)进行保护(另一种常用的同步机制是barrier)。

Pthreads提供了多种锁机制：
(1)Mutex（互斥量）：pthread_mutex_***
(2)Spinlock（自旋锁）：pthread_spin_***
(3)ConditionVariable（条件变量）：pthread_con_***
(4)Read/Writelock（读写锁）：pthread_rwlock_***

Pthreads提供的Mutex锁操作相关的API主要有：
pthread_mutex_lock(pthread_mutex_t*mutex);
pthread_mutex_trylock(pthread_mutex_t*mutex);
pthread_mutex_unlock(pthread_mutex_t*mutex);

Pthreads提供的与SpinLock锁操作相关的API主要有：
pthread_spin_lock(pthread_spinlock_t*lock);
pthread_spin_trylock(pthread_spinlock_t*lock);
pthread_spin_unlock(pthread_spinlock_t*lock);

从实现原理上来讲，Mutex属于sleep-waiting类型的锁。例如在一个双核的机器上有两个线程(线程A和线程B)，它们分别运行在Core0和Core1上。假设线程A想要通过pthread_mutex_lock操作去得到一个临界区的锁，而此时这个锁正被线程B所持有，那么线程A就会被阻塞(blocking)，Core0会在此时进行上下文切换(ContextSwitch)将线程A置于等待队列中，此时Core0就可以运行其他的任务(例如另一个线程C)而不必进行忙等待。而Spinlock则不然，它属于busy-waiting类型的锁，如果线程A是使用pthread_spin_lock操作去请求锁，那么线程A就会一直在Core0上进行忙等待并不停的进行锁请求，直到得到这个锁为止。

如果大家去查阅Linuxglibc中对pthreadsAPI的实现NPTL(NativePOSIXThreadLibrary)的源码的话(使用”getconfGNU_LIBPTHREAD_VERSION”命令可以得到我们系统中NPTL的版本号)，就会发现pthread_mutex_lock()操作如果没有锁成功的话就会调用system_wait()的系统调用（现在NPTL的实现采用了用户空间的futex，不需要频繁进行系统调用，性能已经大有改善），并将当前线程加入该mutex的等待队列里。而spinlock则可以理解为在一个while(1)循环中用内嵌的汇编代码实现的锁操作(印象中看过一篇论文介绍说在linux内核中spinlock操作只需要两条CPU指令，解锁操作只用一条指令就可以完成)。有兴趣的朋友可以参考另一个名为sanos的微内核中pthredsAPI的实现：mutex.cspinlock.c，尽管与NPTL中的代码实现不尽相同，但是因为它的实现非常简单易懂，对我们理解spinlock和mutex的特性还是很有帮助的。

那么在实际编程中mutex和spinlcok哪个的性能更好呢？我们知道spinlock在Linux内核中有非常广泛的利用，那么这是不是说明spinlock的性能更好呢？下面让我们来用实际的代码测试一下（请确保你的系统中已经安装了最近的g++）

01	//Name:spinlockvsmutex1.cc

02	//Source:http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock

03	//Compiler(spinlockversion):g++-ospin_version-DUSE_SPINLOCKspinlockvsmutex1.cc-lpthread

04	//Compiler(mutexversion):g++-omutex_versionspinlockvsmutex1.cc-lpthread

05	#include<stdio.h>

06	#include<unistd.h>

07	#include<sys/syscall.h>

08	#include<errno.h>

09	#include<sys/time.h>

10	#include<list>

11	#include<pthread.h>

13	#defineLOOPS50000000

15	using namespace std;

17	list< int >the_list;

19	#ifdefUSE_SPINLOCK

20	pthread_spinlock_tspinlock;

#else

22	pthread_mutex_tmutex;

#endif

25	//Getthethreadid

26	pid_tgettid(){ return syscall(__NR_gettid);}

28	void consumer( void* *ptr)

30	int i;

32	printf ( "ConsumerTID%lun" ,(unsigned long )gettid());

34	while (1)

36	#ifdefUSE_SPINLOCK

37	pthread_spin_lock(&spinlock);

#else

39	pthread_mutex_lock(&mutex);

#endif

42	if (the_list.empty())

44	#ifdefUSE_SPINLOCK

45	pthread_spin_unlock(&spinlock);

#else

47	pthread_mutex_unlock(&mutex);

#endif

49	break ;

52	i=the_list.front();

53	the_list.pop_front();

55	#ifdefUSE_SPINLOCK

56	pthread_spin_unlock(&spinlock);

#else

58	pthread_mutex_unlock(&mutex);

#endif

62	return NULL;

65	int main()

67	int i;

68	pthread_tthr1,thr2;

69	struct timevaltv1,tv2;

71	#ifdefUSE_SPINLOCK

72	pthread_spin_init(&spinlock,0);

#else

74	pthread_mutex_init(&mutex,NULL);

#endif

77	//Creatingthelistcontent...

78	for (i=0;i<LOOPS;i++)

79	the_list.push_back(i);

81	//Measuringtimebeforestartingthethreads...

82	gettimeofday(&tv1,NULL);

84	pthread_create(&thr1,NULL,consumer,NULL);

85	pthread_create(&thr2,NULL,consumer,NULL);

87	pthread_join(thr1,NULL);

88	pthread_join(thr2,NULL);

90	//Measuringtimeafterthreadsfinished...

91	gettimeofday(&tv2,NULL);

93	if (tv1.tv_usec>tv2.tv_usec)

95	tv2.tv_sec--;

96	tv2.tv_usec+=1000000;

99	printf ( "Result-%ld.%ldn" ,tv2.tv_sec-tv1.tv_sec,

100	tv2.tv_usec-tv1.tv_usec);

102	#ifdefUSE_SPINLOCK

103	pthread_spin_destroy(&spinlock);

#else

105	pthread_mutex_destroy(&mutex);

#endif

108	return 0;

该程序运行过程如下：主线程先初始化一个list结构，并根据LOOPS的值将对应数量的entry插入该list，之后创建两个新线程，它们都执行consumer()这个任务。两个被创建的新线程同时对这个list进行pop操作。主线程会计算从创建两个新线程到两个新线程结束之间所用的时间，输出为下文中的”Result“。

测试机器参数：
Ubuntu9.04X86_64
Intel(R)Core(TM)2DuoCPUE8400@3.00GHz
4.0GBMemory

从下面是测试结果：

01	gchen@gchen-desktop:~/Workspace/mutex$g++-ospin_version-DUSE_SPINLOCKspinvsmutex1.cc-lpthread

02	gchen@gchen-desktop:~/Workspace/mutex$g++-omutex_versionspinvsmutex1.cc-lpthread

03	gchen@gchen-desktop:~/Workspace/mutex$ time ./spin_version

04	ConsumerTID5520

05	ConsumerTID5521

06	Result-5.888750

08	real0m10.918s

09	user0m15.601s

10	sys0m0.804s

12	gchen@gchen-desktop:~/Workspace/mutex$ time ./mutex_version

13	ConsumerTID5691

14	ConsumerTID5692

15	Result-9.116376

17	real0m14.031s

18	user0m12.245s

19	sys0m4.368s

可以看见spinlock的版本在该程序中表现出来的性能更好。另外值得注意的是sys时间，mutex版本花费了更多的系统调用时间，这就是因为mutex会在锁冲突时调用systemwait造成的。

但是，是不是说spinlock就一定更好了呢？让我们再来看一个锁冲突程度非常剧烈的实例程序：

01	//Name:svm2.c

02	//Source:http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Locks

03	//Compile(spinlockversion):gcc-ospin-DUSE_SPINLOCKsvm2.c-lpthread

04	//Compile(mutexversion):gcc-omutexsvm2.c-lpthread

05	#include<stdio.h>

06	#include<stdlib.h>

07	#include<pthread.h>

08	#include<sys/syscall.h>

10	#defineTHREAD_NUM2

12	pthread_tg_thread[THREAD_NUM];

13	#ifdefUSE_SPINLOCK

14	pthread_spinlock_tg_spin;

#else

16	pthread_mutex_tg_mutex;

#endif

18	__uint64_tg_count;

20	pid_tgettid()

22	return syscall(SYS_gettid);

25	void run_amuck( void* *arg)

27	int i,j;

29	printf ( "Thread%lustarted.n" ,(unsigned long )gettid());

31	for (i=0;i<10000;i++){

32	#ifdefUSE_SPINLOCK

33	pthread_spin_lock(&g_spin);

#else

35	pthread_mutex_lock(&g_mutex);

#endif

37	for (j=0;j<100000;j++){

38	if (g_count++==123456789)

39	printf ( "Thread%luwins!n" ,(unsigned long )gettid());

41	#ifdefUSE_SPINLOCK

42	pthread_spin_unlock(&g_spin);

#else

44	pthread_mutex_unlock(&g_mutex);

#endif

48	printf ( "Thread%lufinished!n" ,(unsigned long )gettid());

50	return (NULL);

53	int main( int argc, char *argv[])

55	int i,threads=THREAD_NUM;

57	printf ( "Creating%dthreads...n" ,threads);

58	#ifdefUSE_SPINLOCK

59	pthread_spin_init(&g_spin,0);

#else

61	pthread_mutex_init(&g_mutex,NULL);

#endif

63	for (i=0;i<threads;i++)

64	pthread_create(&g_thread[i],NULL,run_amuck,( void *)i);

66	for (i=0;i<threads;i++)

67	pthread_join(g_thread[i],NULL);

69	printf ( "Done.n" );

71	return (0);

这个程序的特征就是临界区非常大，这样两个线程的锁竞争会非常的剧烈。当然这个是一个极端情况，实际应用程序中临界区不会如此大，锁竞争也不会如此激烈。测试结果显示mutex版本性能更好：

01	gchen@gchen-desktop:~/Workspace/mutex$ time ./spin

02	Creating2threads...

03	Thread31796started.

04	Thread31797started.

05	Thread31797wins!

06	Thread31797finished!

07	Thread31796finished!

Done.

10	real0m5.748s

11	user0m10.257s

12	sys0m0.004s

14	gchen@gchen-desktop:~/Workspace/mutex$ time ./mutex

15	Creating2threads...

16	Thread31801started.

17	Thread31802started.

18	Thread31802wins!

19	Thread31802finished!

20	Thread31801finished!

Done.

23	real0m4.823s

24	user0m4.772s

25	sys0m0.032s

另外一个值得注意的细节是spinlock耗费了更多的usertime。这就是因为两个线程分别运行在两个核上，大部分时间只有一个线程能拿到锁，所以另一个线程就一直在它运行的core上进行忙等待，CPU占用率一直是100%；而mutex则不同，当对锁的请求失败后上下文切换就会发生，这样就能空出一个核来进行别的运算任务了。（其实这种上下文切换对已经拿着锁的那个线程性能也是有影响的，因为当该线程释放该锁时它需要通知操作系统去唤醒那些被阻塞的线程，这也是额外的开销）

总结
（1）Mutex适合对锁操作非常频繁的场景，并且具有更好的适应性。尽管相比spinlock它会花费更多的开销（主要是上下文切换），但是它能适合实际开发中复杂的应用场景，在保证一定性能的前提下提供更大的灵活度。

（2）spinlock的lock/unlock性能更好(花费更少的cpu指令)，但是它只适应用于临界区运行时间很短的场景。而在实际软件开发中，除非程序员对自己的程序的锁操作行为非常的了解，否则使用spinlock不是一个好主意(通常一个多线程程序中对锁的操作有数以万次，如果失败的锁操作(contendedlockrequests)过多的话就会浪费很多的时间进行空等待)。

（3）更保险的方法或许是先（保守的）使用Mutex，然后如果对性能还有进一步的需求，可以尝试使用spinlock进行调优。毕竟我们的程序不像Linuxkernel那样对性能需求那么高(LinuxKernel最常用的锁操作是spinlock和rwlock)。

2010年3月3日补记：这个观点在Oracle的文档中得到了支持：

Duringconfiguration,BerkeleyDBselectsamuteximplementationforthearchitecture.BerkeleyDBnormallyprefersblocking-muteximplementationsovernon-blockingones.Forexample,BerkeleyDBwillselectPOSIXpthreadmutexinterfacesratherthanassembly-codetest-and-setspinmutexesbecausepthreadmutexesareusuallymoreefficientandlesslikelytowasteCPUcyclesspinningwithoutgettinganyworkaccomplished.

p.s.调用syscall(SYS_gettid)和syscall(__NR_gettid)都可以得到当前线程的id:)

转载请注明来自:www.parallellabs.com

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航