您的位置:首页 > 编程语言 > MATLAB

双GPU-c++MATLAB混合编程

2016-07-04 20:59 288 查看

使用gpu编程双gpu编程例程,其中有头文件book.h

编译错误windows.h找不到

在MATLAB中混合vs2013编程,mex时出现windows.h无法识别,但是单独在vs2013下运行双GPU测试程序可以运行成功,路径定位准确。

第一步,首先在MATLAB中添加Windows kits所有路径,mex失败

第二步,使用绝对路径可行但是需要添加的路径太多,添加环境变量也会很多

第三步,添加环境变量,无法解决

第四步,在cuda编程指南中搜索include,在nvcc中可以同过-I path的方式添加所需要的包含文件的路径,此方法解决windows.h文件无法找到的问题,mex编译通过,但是在链接是出现新问题,”identifier “IUnknown” is undefined” error .

链接错误

第五步,通过搜索该问题在这里

找到解决方法,即在文件开头添加

#ifdef _WIN32

#define WIN32_LEAN_AND_MEAN

#endif


即可解决。

mexfunction函数可以通过传入参数和传出参数进行交互,在MATLAB中采用

[a b c ...] = fun(A,B,C,...)


调用,在cpp端

mexFunction(nrhs,nrhs[],nlhs,hlhs[])


其中nrhs是输入参数数量,nlhs是输出参数数量。

在mexFunction中使用mexPrintf()方法对数据进行打印,用法同printf()方法。

犯二错误

使用mexErrMsgTxt()方法,作为打印函数,此函数会终止程序运行。

在MATLAB端采用

[a,b] = double(func(c,d,e));


调用函数,当输出参数为一个时,用法可以,但是当输出参数为两个或更多少,就会导致在cpp函数中的输出参数nlhs总为1,运行错误提示输出参数过多。

双GPU实例代码框架

/*
* Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
*
* NVIDIA Corporation and its licensors retain all intellectual property and
* proprietary rights in and to this software and related documentation.
* Any use, reproduction, disclosure, or distribution of this software
* and related documentation without an express license agreement from
* NVIDIA Corporation is strictly prohibited.
*
* Please refer to the applicable NVIDIA end user license agreement (EULA)
* associated with this source code for terms and conditions that govern
* your use of this NVIDIA software.
*
*/

#include "../common/book.h"

#define imin(a,b) (a<b?a:b)

#define     N    (33*1024*1024)
const int threadsPerBlock = 256;
const int blocksPerGrid =
imin( 32, (N/2+threadsPerBlock-1) / threadsPerBlock );

__global__ void dot( int size, float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;

float   temp = 0;
while (tid < size) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}

// set the cache values
cache[cacheIndex] = temp;

// synchronize threads in this block
__syncthreads();

// for reductions, threadsPerBlock must be a power of 2
// because of the following code
int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
i /= 2;
}

if (cacheIndex == 0)
c[blockIdx.x] = cache[0];
}

struct DataStruct {
int     deviceID;
int     size;
float   *a;
float   *b;
float   returnValue;
};

void* routine( void *pvoidData ) {
DataStruct  *data = (DataStruct*)pvoidData;
HANDLE_ERROR( cudaSetDevice( data->deviceID ) );

int     size = data->size;
float   *a, *b, c, *partial_c;
float   *dev_a, *dev_b, *dev_partial_c;

// allocate memory on the CPU side
a = data->a;
b = data->b;
partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

// allocate the memory on the GPU
HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
size*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
size*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
blocksPerGrid*sizeof(float) ) );

// copy the arrays 'a' and 'b' to the GPU
HANDLE_ERROR( cudaMemcpy( dev_a, a, size*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, size*sizeof(float),
cudaMemcpyHostToDevice ) );

dot<<<blocksPerGrid,threadsPerBlock>>>( size, dev_a, dev_b,
dev_partial_c );
// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
blocksPerGrid*sizeof(float),
cudaMemcpyDeviceToHost ) );

// finish up on the CPU side
c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

HANDLE_ERROR( cudaFree( dev_a ) );
HANDLE_ERROR( cudaFree( dev_b ) );
HANDLE_ERROR( cudaFree( dev_partial_c ) );

// free memory on the CPU side
free( partial_c );

data->returnValue = c;
return 0;
}

int main( void ) {
int deviceCount;
HANDLE_ERROR( cudaGetDeviceCount( &deviceCount ) );
if (deviceCount < 2) {
printf( "We need at least two compute 1.0 or greater "
"devices, but only found %d\n", deviceCount );
return 0;
}

float   *a = (float*)malloc( sizeof(float) * N );
HANDLE_NULL( a );
float   *b = (float*)malloc( sizeof(float) * N );
HANDLE_NULL( b );

// fill in the host memory with data
for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

// prepare for multithread
DataStruct  data[2];
data[0].deviceID = 0;
data[0].size = N/2;
data[0].a = a;
data[0].b = b;

data[1].deviceID = 1;
data[1].size = N/2;
data[1].a = a + N/2;
data[1].b = b + N/2;

CUTThread   thread = start_thread( routine, &(data[0]) );
routine( &(data[1]) );
end_thread( thread );

// free memory on the CPU side
free( a );
free( b );

printf( "Value calculated:  %f\n",
data[0].returnValue + data[1].returnValue );

return 0;
}


说明:两个设备,执行同一段代码,但是数据是不一样的,需要开两个线程,设置一个结构体或类,其中包含设备ID,数据等。

将GPU执行程序进行打包,实例中采样空的函数指针的形式实现,内部会执行一下步骤

1. 根据设备ID选中要使用的设备;

2. 开辟GPU显存空间,并进行数据拷贝;

3. 执行核函数

在主函数中执行以下步骤

1. 将要处理的数据封装导不同的结构体中;

2. 开辟线程。在从线程中止执行程序;

3. 主线程中执行程序;

4. 销毁空间
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: