您的位置：首页 > 其它

tbb基础之parallel_for用法详解

2015-04-07 16:30 459 查看

要讲解parallel_for，我们首先讲一个例子，该例子是对数组的每一个元素进行遍历，常规的串行算法代码如下：

template<typename T> void Visit( T var)
{
	printf("%0.2f, ", var);
}

void Sequence_Visit( const float* fArray, int nSize)
{
	for ( int i=0; i<nSize; i++)
	{
		Visit<float>(fArray[i]);

		if ( i%5==0 && i>0)
		{
			printf("\n");
		}
	}
}

上面这段代码很熟悉，常规编程中我们都是这样写的，这里不再做分析。而采用TBB中的parallel_for重写后的代码如下：

class ApplyFoo
{
	float* const my_a;

public:
	void operator()(const blocked_range<size_t>& range) const
	{
		float* a = my_a;
		for ( size_t i = range.begin(); i!=range.end(); ++i)
		{				
			Foo(a[i]);

			if ( i%5==0 && i>0)
			{
				printf("\n");
			}
		}
	}

	ApplyFoo( float a[]):my_a(a)
	{

	}

    void Foo( float var) const
	{
		printf("%0.2f, ", var);
	}

};

void ParallelApplyFoo( float a[], size_t n )
{
	parallel_for( blocked_range<size_t>(0, n, 100), ApplyFoo(a), auto_partitioner());
}

这段代码中，我们首先定义了一个ApplyFoo类，并重载了operator()，而在ParallelApplyFoo函数中调用parallel_for来并行对数组元素进行操作，这里出现了两个新词：blocked_range和parallel_for。下面逐个分析：

1、blocked_range

blocked_range是一个模板类，表述了一维迭代（iterator），我们可以通过其头文件blocked_range.h查看其定义：

template<typename Value>
	class blocked_range {
	public:
		//! Type of a value		
		typedef Value const_iterator;
		typedef std::size_t size_type;

		//! Construct range with default-constructed values for begin and end.	
		blocked_range() : my_end(), my_begin() {}

		//! Construct range over half-open interval [begin,end), with the given grainsize.
		blocked_range( Value begin_, Value end_, size_type grainsize_=1 ) : 
		my_end(end_), my_begin(begin_), my_grainsize(grainsize_) 
		{
			__TBB_ASSERT( my_grainsize>0, "grainsize must be positive" );
		}

		//! Beginning of range.
		const_iterator begin() const {return my_begin;}

		//! One past last value in range.
		const_iterator end() const {return my_end;}

		//! Size of the range	
		size_type size() const {
			__TBB_ASSERT( !(end()<begin()), "size() unspecified if end()<begin()" );
			return size_type(my_end-my_begin);
		}

		//! The grain size for this range.
		size_type grainsize() const {return my_grainsize;}

		//------------------------------------------------------------------------
		// Methods that implement Range concept
		//------------------------------------------------------------------------

		//! True if range is empty.
		bool empty() const {return !(my_begin<my_end);}

		//! True if range is divisible.
		/** Unspecified if end()<begin(). */
		bool is_divisible() const {return my_grainsize<size();}

		//! Split range.  
		/** The new Range *this has the second half, the old range r has the first half. 
		Unspecified if end()<begin() or !is_divisible(). */
		blocked_range( blocked_range& r, split ) : 
		my_end(r.my_end),
			my_begin(do_split(r)),
			my_grainsize(r.my_grainsize)
		{}

	private:
		/** NOTE: my_end MUST be declared before my_begin, otherwise the forking constructor will break. */
		Value my_end;
		Value my_begin;
		size_type my_grainsize;

		//! Auxiliary function used by forking constructor.	
		static Value do_split( blocked_range& r ) {
			__TBB_ASSERT( r.is_divisible(), "cannot split blocked_range that is not divisible" );
			Value middle = r.my_begin + (r.my_end-r.my_begin)/2u;
			r.my_end = middle;
			return middle;
		}

		template<typename RowValue, typename ColValue>
		friend class blocked_range2d;

		template<typename RowValue, typename ColValue, typename PageValue>
		friend class blocked_range3d;
	};

从上面blocked_range的定义中，可以看到该类有3个构造函数，而我们上面的实例代码采用的构造函数为：

blocked_range( Value begin_, Value end_, size_type grainsize_=1 ) : 
		my_end(end_), my_begin(begin_), my_grainsize(grainsize_) 
		{
			__TBB_ASSERT( my_grainsize>0, "grainsize must be positive" );
		}

第一个参数表示起始，第二个参数表示结束，它们的类型为const_iterator，表示的区间为[begin，end)这样一个半开区间。第三个参数，grainsize，表示的是一个“合适的大小”块，这个块会在一个循环中进行处理，如果数组比这个grainsize还大，parallel_for会把它分割为独立的block，然后分别进行调度（有可能由多个线程进行处理）。

这样我们知道，grainsize其实决定了TBB什么时候对数据进行划分，如果我们把grainsize指定得太小，那就可能会导致产生过多得block，从而使得不同block间的overhead增加（比如多个线程间切换的代价），有可能会使性能下降。相反，如果grainsize设得太大，以致于这个数组几乎没有被划分，那又会导致不能发挥parallel_for期望达到的并行效果，也没有达到理想得性能。所以我们在决定grainsize时需要小心，最好是能够经过调整测试后得到的值，当然你也可以如本例中一样不指定，让TBB帮你来决定合适的值（一般不是最优的）。一个调整grainsize的经验性步骤：

1）首先把grainsize设得比预想的要大一些，通常设为10000

2）在单处理机机器上运行，得到性能数据

3）把grainsize减半，看性能降低多少，如果降低在5%-10%之间，那这个grainsize就已经是一个不错的设定。

另外，在该类定义的最后可以看到还定义了：

template<typename RowValue, typename ColValue>
		friend class blocked_range2d;

		template<typename RowValue, typename ColValue, typename PageValue>
		friend class blocked_range3d;

特别是blocked_range2d对于处理矩阵和图像数据非常有用。

2、Parallel_for

parallel_for是本文的核心，因此首先我们看下其定义（在parallel_for.h文件中，这里只摘录了部分代码）：

template<typename Range, typename Body>
void parallel_for( const Range& range, const Body& body ) {
	internal::start_for<Range,Body,__TBB_DEFAULT_PARTITIONER>::run(range,body,__TBB_DEFAULT_PARTITIONER());
}

//! Parallel iteration over range with simple partitioner.
template<typename Range, typename Body>
void parallel_for( const Range& range, const Body& body, const simple_partitioner& partitioner ) {
	internal::start_for<Range,Body,simple_partitioner>::run(range,body,partitioner);
}

//! Parallel iteration over range with auto_partitioner.
template<typename Range, typename Body>
void parallel_for( const Range& range, const Body& body, const auto_partitioner& partitioner ) {
	internal::start_for<Range,Body,auto_partitioner>::run(range,body,partitioner);
}

//! Parallel iteration over range with affinity_partitioner.
template<typename Range, typename Body>
void parallel_for( const Range& range, const Body& body, affinity_partitioner& partitioner ) {
	internal::start_for<Range,Body,affinity_partitioner>::run(range,body,partitioner);
}

本文用的是第二种调用方法，其参数：

1）range：指定划分block的范围。

2）body：指定对block应用的操作，Body可以看成是一个操作子functor，它的operator(...)会以blocked_range为参数进行调用，当然如果我们传过来的是一个函数指针也是可以的，只要它能以blocked_range为参数进行调用。

3）partitioner：指定划分器，可选的两种simple_partitioner和auto_partitioner。

调用实例：

int _tmain(int argc, char* argv[])
{	
	const int CONST_SIZE = 300000;

	float *fArray = new float[CONST_SIZE];

	for ( long i =0; i<CONST_SIZE; i++)
	{
		fArray[i] = i/2.0+i/3.0;
	}

	tick_count t1 = tick_count::now();

	Sequence_Visit(fArray, CONST_SIZE);
	tick_count t2 = tick_count::now();

	ParallelApplyFoo(fArray, CONST_SIZE);
	tick_count t3 = tick_count::now();

	printf("\nSeq seconds:%g\n",(t2-t1).seconds());
	printf("TBB seconds:%g\n",(t3-t2).seconds());		 

	getchar();
	return 0;
}

参考资料：
1、TBB基础之parallel_for

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航