您的位置:首页 > 其它

CUDA优化实例(三)共享内存

2018-03-11 21:52 489 查看

CUDA优化实例(三)共享内存

前言

经过前面的实验发现,共享内存是优化CUDA程序的核心方法。共享内存可以通过对全局内存数据进行合并访问,让kernel内交错的内存需求去访问共享内存。如:矩阵转置问题,将二维内存的行写入二维内存的列。对列的写入就是一个内存交错访问的例子。可以用合并的方式将块要操作的数据写入共享内存,让复杂的内存交错访问访问共享内存,然后将结果以合并的方式写入全局内存。

在使用共享内存时需要注意

共享内存是服务于块的,所以一般要注意同步即
__syncthreads()
的使用。

共享内存对复杂的内存处理有优势是因为它被分为32个内存存储体,而32个内存存储体可同时被访问,即内存请求分布在不同的内存存储体上的话,会同时获得该数据,但请求在同一(行)内存存储体上的话,速度会受影响。填充是一种克服内存存储体冲突的方法,下面会介绍。

每个共享内存存储体的宽可以是32位或64位,即4字节和8字节,用于对付单精度和双精度的数据。一般情况下,我们认为我们的每个kernel只处理一个数据只处理4字节的单精度或8字节的双精度,这样一个线程束的32个线程纠正好与32个(32个说成,一行更准确)内存存储体的大小一样都是128b,这样如果不存在内存体冲突,只需要一个共享内存访问事务,反之则需要多个。共享内存资源处理单精度的数据默认存储体宽度的32位即可,若处理双精度,为了不浪费稀有的共享内存资源,显式的定义宽度为64位。

共享内存是稀缺的SM上的资源,所以它会影响块在SM上的分布,影响占用率,计算一下每个块共享的低下限。对于存储体32位情况:我的GTX1050Ti 每个SM的共享内存大小是48KB,如果保证占用率的话,每个SM最多可容纳16个块,也就是每个块可以最多分配48KB/16 = 3KB的空间,换算到768个单精度的浮点数据。也就是说一个块最多可以申请768个4字节的数据,如果超过这个空间的大小,那么不能保证所有SM中有16个块,不能保证占用率。(提一下寄存器资源2048个线程运行时,每个线程32字节)

实验:

此实验是在NVIDIA官方找的一个例子

代码:

/* Copyright (c) 1993-2015, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*  * Redistributions of source code must retain the above copyright
*    notice, this list of conditions and the following disclaimer.
*  * Redistributions in binary form must reproduce the above copyright
*    notice, this list of conditions and the following disclaimer in the
*    documentation and/or other materials provided with the distribution.
*  * Neither the name of NVIDIA CORPORATION nor the names of its
*    contributors may be used to endorse or promote products derived
*    from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#include <stdio.h>
#include <assert.h>

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
if (result != cudaSuccess) {
fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
assert(result == cudaSuccess);
}
#endif
return result;
}

const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;

// Check errors and print GB/s
void postprocess(const float *ref, const float *res, int n, float ms)
{
bool passed = true;
for (int i = 0; i < n; i++)
if (res[i] != ref[i]) {
printf("%d %f %f\n", i, res[i], ref[i]);
printf("%25s\n", "*** FAILED ***");
passed = false;
break;
}
if (passed)
printf("%20.2f\n", 2 * n * sizeof(float) * 1e-6 * NUM_REPS / ms );
}

// simple copy kernel
// Used as reference case representing best effective bandwidth.
__global__ void copy(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}

// copy kernel using shared memory
// Also used as reference case, demonstrating effect of using shared memory.
__global__ void copySharedMem(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM * TILE_DIM];

int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for
4000
(int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];

__syncthreads();

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];
}

// naive transpose
// Simplest transpose; doesn't use shared memory.
// Global memory reads are coalesced but writes are not.
__global__ void transposeNaive(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

// coalesced transpose
// Uses shared memory to achieve coalesing in both reads and writes
// Tile width == #banks causes shared memory bank conflicts.
__global__ void transposeCoalesced(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM][TILE_DIM];

int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

__syncthreads();

x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

// No bank-conflict transpose
// Same as transposeCoalesced except the first tile dimension is padded
// to avoid shared memory bank conflicts.
__global__ void transposeNoBankConflicts(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM][TILE_DIM+1];

int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

__syncthreads();

x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

int main(int argc, char **argv)
{
const int nx = 1024;
const int ny = 1024;
const int mem_size = nx*ny*sizeof(float);

dim3 dimGrid(nx/TILE_DIM, ny/TILE_DIM, 1);
dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1);

int devId = 0;
if (argc > 1) devId = atoi(argv[1]);

cudaDeviceProp prop;
checkCuda( cudaGetDeviceProperties(&prop, devId));
printf("\nDevice : %s\n", prop.name);
printf("Matrix size: %d %d, Block size: %d %d, Tile size: %d %d\n",
nx, ny, TILE_DIM, BLOCK_ROWS, TILE_DIM, TILE_DIM);
printf("dimGrid: %d %d %d. dimBlock: %d %d %d\n",
dimGrid.x, dimGrid.y, dimGrid.z, dimBlock.x, dimBlock.y, dimBlock.z);

checkCuda( cudaSetDevice(devId) );

float *h_idata = (float*)malloc(mem_size);
float *h_cdata = (float*)malloc(mem_size);
float *h_tdata = (float*)malloc(mem_size);
float *gold    = (float*)malloc(mem_size);

float *d_idata, *d_cdata, *d_tdata;
checkCuda( cudaMalloc(&d_idata, mem_size) );
checkCuda( cudaMalloc(&d_cdata, mem_size) );
checkCuda( cudaMalloc(&d_tdata, mem_size) );

// check parameters and calculate execution configuration
if (nx % TILE_DIM || ny % TILE_DIM) {
printf("nx and ny must be a multiple of TILE_DIM\n");
goto error_exit;
}

if (TILE_DIM % BLOCK_ROWS) {
printf("TILE_DIM must be a multiple of BLOCK_ROWS\n");
goto error_exit;
}

// host
for (int j = 0; j < ny; j++)
for (int i = 0; i < nx; i++)
h_idata[j*nx + i] = j*nx + i;

// correct result for error checking
for (int j = 0; j < ny; j++)
for (int i = 0; i < nx; i++)
gold[j*nx + i] = h_idata[i*nx + j];

// device
checkCuda( cudaMemcpy(d_idata, h_idata, mem_size, cudaMemcpyHostToDevice) );

// events for timing
cudaEvent_t startEvent, stopEvent;
checkCuda( cudaEventCreate(&startEvent) );
checkCuda( cudaEventCreate(&stopEvent) );
float ms;

// ------------
// time kernels
// ------------
printf("%25s%25s\n", "Routine", "Bandwidth (GB/s)");

// ----
// copy
// ----
printf("%25s", "copy");
checkCuda( cudaMemset(d_cdata, 0, mem_size) );
// warm up
copy<<<dimGrid, dimBlock>>>(d_cdata, d_idata);
checkCuda( cudaEventRecord(startEvent, 0) );
for (int i = 0; i < NUM_REPS; i++)
copy<<<dimGrid, dimBlock>>>(d_cdata, d_idata);
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost) );
postprocess(h_idata, h_cdata, nx*ny, ms);

// -------------
// copySharedMem
// -------------
printf("%25s", "shared memory copy");
checkCuda( cudaMemset(d_cdata, 0, mem_size) );
// warm up
copySharedMem<<<dimGrid, dimBlock>>>(d_cdata, d_idata);
checkCuda( cudaEventRecord(startEvent, 0) );
for (int i = 0; i < NUM_REPS; i++)
copySharedMem<<<dimGrid, dimBlock>>>(d_cdata, d_idata);
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost) );
postprocess(h_idata, h_cdata, nx * ny, ms);

// --------------
// transposeNaive
// --------------
printf("%25s", "naive transpose");
checkCuda( cudaMemset(d_tdata, 0, mem_size) );
// warmup
transposeNaive<<<dimGrid, dimBlock>>>(d_tdata, d_idata);
checkCuda( cudaEventRecord(startEvent, 0) );
for (int i = 0; i < NUM_REPS; i++)
transposeNaive<<<dimGrid, dimBlock>>>(d_tdata, d_idata);
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost) );
postprocess(gold, h_tdata, nx * ny, ms);

// ------------------
// transposeCoalesced
// ------------------
printf("%25s", "coalesced transpose");
checkCuda( cudaMemset(d_tdata, 0, mem_size) );
// warmup
transposeCoalesced<<<dimGrid, dimBlock>>>(d_tdata, d_idata);
checkCuda( cudaEventRecord(startEvent, 0) );
for (int i = 0; i < NUM_REPS; i++)
transposeCoalesced<<<dimGrid, dimBlock>>>(d_tdata, d_idata);
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost) );
postprocess(gold, h_tdata, nx * ny, ms);

// ------------------------
// transposeNoBankConflicts
// ------------------------
printf("%25s", "conflict-free transpose");
checkCuda( cudaMemset(d_tdata, 0, mem_size) );
// warmup
transposeNoBankConflicts<<<dimGrid, dimBlock>>>(d_tdata, d_idata);
checkCuda( cudaEventRecord(startEvent, 0) );
for (int i = 0; i < NUM_REPS; i++)
transposeNoBankConflicts<<<dimGrid, dimBlock>>>(d_tdata, d_idata);
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost) );
postprocess(gold, h_tdata, nx * ny, ms);

error_exit:
// cleanup
checkCuda( cudaEventDestroy(startEvent) );
checkCuda( cudaEventDestroy(stopEvent) );
checkCuda( cudaFree(d_tdata) );
checkCuda( cudaFree(d_cdata) );
checkCuda( cudaFree(d_idata) );
free(h_idata);
free(h_tdata);
free(h_cdata);
free(gold);
}


结果



性能由好到坏是:

核函数名带宽
copy95.42
shared memory copy97.16
conflict-free transpose94.90
coalesced transpose53.72
naive transpose27.81
前三个基本一样。后面解释

分析

const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;


每个块只有32×8的大小,但其处理了32×32的数据,因为核函数中一下处理了4行,如下:

for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];


同时NUM_REPS是为了运行每个核函数100次并取平均带宽值作为结果

各个核函数的含义:

naive transpose:

这是实现矩阵转置最朴素的方法,行访问列,其中行访问合并,但列访问不合并。它是所以性能的下限,即最差的情况。

for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[x*width + (y+j)] = idata[(y+j)*width + x];


copy:

是理论带宽的最大值,它的读取和存储都是合并,是理论峰值。这问题是转置问题,存储合并不可能都是合并访问的,所以copy不是转置,而是简单的赋值或者叫复制,看check就看出来了,check的都是输入数组,它的作用是提供一个理论峰值作为参考。如下:都是原数组,没有转置。

postprocess(h_idata, h_cdata, nx*ny, ms);


shared memory copy:

与copy函数一样,这个函数也是copy,只不过是使用shared memory去copy,是shared memory 合并访问全局内存,全局内存合并访问shared memory,这与是无法转置的,也是为了提供理论峰值。同时还可以做到控制变量。

transposeCoalesced:

此例是共享内存合并访问全局内存,全局内存非合并访问共享内存,由于全局内存写的时候访问的是一列数据,正好都在同一个内存存储体中,所以这是个内存存储体冲突较为严重的例子。但其性能还是较不使用共享内存高了很多。

conflict-free transpose

此例是共享内存合并访问全局内存,全局内存非合并访问共享内存,由于全局内存写的时候访问的是一列数据,该函数在共享内存后加了一列,达到了填充的目的:

__shared__ float tile[TILE_DIM][TILE_DIM+1];


填充:因为要访问的数据都是在一列,填充一列之后,要访问的数据就正好全部错开了,完全消除了存储体冲突,性能和合并一样好。

前三个性能一样是因为它们都是合并加载/存储全局变量都是理论峰值

填充在此例中是完美的被运用。

结论

共享内存对程序有很好的优化,但其运用也稍有些复杂,掌握好共享内存的最终目的是让全局内存被合并访问,达到最高的内存效率。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  笔记