CUDA优化实例(三)共享内存
2018-03-11 21:52
489 查看
CUDA优化实例(三)共享内存
前言
经过前面的实验发现,共享内存是优化CUDA程序的核心方法。共享内存可以通过对全局内存数据进行合并访问,让kernel内交错的内存需求去访问共享内存。如:矩阵转置问题,将二维内存的行写入二维内存的列。对列的写入就是一个内存交错访问的例子。可以用合并的方式将块要操作的数据写入共享内存,让复杂的内存交错访问访问共享内存,然后将结果以合并的方式写入全局内存。在使用共享内存时需要注意
共享内存是服务于块的,所以一般要注意同步即
__syncthreads()的使用。
共享内存对复杂的内存处理有优势是因为它被分为32个内存存储体,而32个内存存储体可同时被访问,即内存请求分布在不同的内存存储体上的话,会同时获得该数据,但请求在同一(行)内存存储体上的话,速度会受影响。填充是一种克服内存存储体冲突的方法,下面会介绍。
每个共享内存存储体的宽可以是32位或64位,即4字节和8字节,用于对付单精度和双精度的数据。一般情况下,我们认为我们的每个kernel只处理一个数据只处理4字节的单精度或8字节的双精度,这样一个线程束的32个线程纠正好与32个(32个说成,一行更准确)内存存储体的大小一样都是128b,这样如果不存在内存体冲突,只需要一个共享内存访问事务,反之则需要多个。共享内存资源处理单精度的数据默认存储体宽度的32位即可,若处理双精度,为了不浪费稀有的共享内存资源,显式的定义宽度为64位。
共享内存是稀缺的SM上的资源,所以它会影响块在SM上的分布,影响占用率,计算一下每个块共享的低下限。对于存储体32位情况:我的GTX1050Ti 每个SM的共享内存大小是48KB,如果保证占用率的话,每个SM最多可容纳16个块,也就是每个块可以最多分配48KB/16 = 3KB的空间,换算到768个单精度的浮点数据。也就是说一个块最多可以申请768个4字节的数据,如果超过这个空间的大小,那么不能保证所有SM中有16个块,不能保证占用率。(提一下寄存器资源2048个线程运行时,每个线程32字节)
实验:
此实验是在NVIDIA官方找的一个例子 。代码:
/* Copyright (c) 1993-2015, NVIDIA CORPORATION. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * Neither the name of NVIDIA CORPORATION nor the names of its * contributors may be used to endorse or promote products derived * from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include <stdio.h> #include <assert.h> // Convenience function for checking CUDA runtime API results // can be wrapped around any runtime API call. No-op in release builds. inline cudaError_t checkCuda(cudaError_t result) { #if defined(DEBUG) || defined(_DEBUG) if (result != cudaSuccess) { fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result)); assert(result == cudaSuccess); } #endif return result; } const int TILE_DIM = 32; const int BLOCK_ROWS = 8; const int NUM_REPS = 100; // Check errors and print GB/s void postprocess(const float *ref, const float *res, int n, float ms) { bool passed = true; for (int i = 0; i < n; i++) if (res[i] != ref[i]) { printf("%d %f %f\n", i, res[i], ref[i]); printf("%25s\n", "*** FAILED ***"); passed = false; break; } if (passed) printf("%20.2f\n", 2 * n * sizeof(float) * 1e-6 * NUM_REPS / ms ); } // simple copy kernel // Used as reference case representing best effective bandwidth. __global__ void copy(float *odata, const float *idata) { int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) odata[(y+j)*width + x] = idata[(y+j)*width + x]; } // copy kernel using shared memory // Also used as reference case, demonstrating effect of using shared memory. __global__ void copySharedMem(float *odata, const float *idata) { __shared__ float tile[TILE_DIM * TILE_DIM]; int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for 4000 (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x]; __syncthreads(); for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x]; } // naive transpose // Simplest transpose; doesn't use shared memory. // Global memory reads are coalesced but writes are not. __global__ void transposeNaive(float *odata, const float *idata) { int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) odata[x*width + (y+j)] = idata[(y+j)*width + x]; } // coalesced transpose // Uses shared memory to achieve coalesing in both reads and writes // Tile width == #banks causes shared memory bank conflicts. __global__ void transposeCoalesced(float *odata, const float *idata) { __shared__ float tile[TILE_DIM][TILE_DIM]; int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x]; __syncthreads(); x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset y = blockIdx.x * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j]; } // No bank-conflict transpose // Same as transposeCoalesced except the first tile dimension is padded // to avoid shared memory bank conflicts. __global__ void transposeNoBankConflicts(float *odata, const float *idata) { __shared__ float tile[TILE_DIM][TILE_DIM+1]; int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; int width = gridDim.x * TILE_DIM; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x]; __syncthreads(); x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset y = blockIdx.x * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j]; } int main(int argc, char **argv) { const int nx = 1024; const int ny = 1024; const int mem_size = nx*ny*sizeof(float); dim3 dimGrid(nx/TILE_DIM, ny/TILE_DIM, 1); dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1); int devId = 0; if (argc > 1) devId = atoi(argv[1]); cudaDeviceProp prop; checkCuda( cudaGetDeviceProperties(&prop, devId)); printf("\nDevice : %s\n", prop.name); printf("Matrix size: %d %d, Block size: %d %d, Tile size: %d %d\n", nx, ny, TILE_DIM, BLOCK_ROWS, TILE_DIM, TILE_DIM); printf("dimGrid: %d %d %d. dimBlock: %d %d %d\n", dimGrid.x, dimGrid.y, dimGrid.z, dimBlock.x, dimBlock.y, dimBlock.z); checkCuda( cudaSetDevice(devId) ); float *h_idata = (float*)malloc(mem_size); float *h_cdata = (float*)malloc(mem_size); float *h_tdata = (float*)malloc(mem_size); float *gold = (float*)malloc(mem_size); float *d_idata, *d_cdata, *d_tdata; checkCuda( cudaMalloc(&d_idata, mem_size) ); checkCuda( cudaMalloc(&d_cdata, mem_size) ); checkCuda( cudaMalloc(&d_tdata, mem_size) ); // check parameters and calculate execution configuration if (nx % TILE_DIM || ny % TILE_DIM) { printf("nx and ny must be a multiple of TILE_DIM\n"); goto error_exit; } if (TILE_DIM % BLOCK_ROWS) { printf("TILE_DIM must be a multiple of BLOCK_ROWS\n"); goto error_exit; } // host for (int j = 0; j < ny; j++) for (int i = 0; i < nx; i++) h_idata[j*nx + i] = j*nx + i; // correct result for error checking for (int j = 0; j < ny; j++) for (int i = 0; i < nx; i++) gold[j*nx + i] = h_idata[i*nx + j]; // device checkCuda( cudaMemcpy(d_idata, h_idata, mem_size, cudaMemcpyHostToDevice) ); // events for timing cudaEvent_t startEvent, stopEvent; checkCuda( cudaEventCreate(&startEvent) ); checkCuda( cudaEventCreate(&stopEvent) ); float ms; // ------------ // time kernels // ------------ printf("%25s%25s\n", "Routine", "Bandwidth (GB/s)"); // ---- // copy // ---- printf("%25s", "copy"); checkCuda( cudaMemset(d_cdata, 0, mem_size) ); // warm up copy<<<dimGrid, dimBlock>>>(d_cdata, d_idata); checkCuda( cudaEventRecord(startEvent, 0) ); for (int i = 0; i < NUM_REPS; i++) copy<<<dimGrid, dimBlock>>>(d_cdata, d_idata); checkCuda( cudaEventRecord(stopEvent, 0) ); checkCuda( cudaEventSynchronize(stopEvent) ); checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) ); checkCuda( cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost) ); postprocess(h_idata, h_cdata, nx*ny, ms); // ------------- // copySharedMem // ------------- printf("%25s", "shared memory copy"); checkCuda( cudaMemset(d_cdata, 0, mem_size) ); // warm up copySharedMem<<<dimGrid, dimBlock>>>(d_cdata, d_idata); checkCuda( cudaEventRecord(startEvent, 0) ); for (int i = 0; i < NUM_REPS; i++) copySharedMem<<<dimGrid, dimBlock>>>(d_cdata, d_idata); checkCuda( cudaEventRecord(stopEvent, 0) ); checkCuda( cudaEventSynchronize(stopEvent) ); checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) ); checkCuda( cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost) ); postprocess(h_idata, h_cdata, nx * ny, ms); // -------------- // transposeNaive // -------------- printf("%25s", "naive transpose"); checkCuda( cudaMemset(d_tdata, 0, mem_size) ); // warmup transposeNaive<<<dimGrid, dimBlock>>>(d_tdata, d_idata); checkCuda( cudaEventRecord(startEvent, 0) ); for (int i = 0; i < NUM_REPS; i++) transposeNaive<<<dimGrid, dimBlock>>>(d_tdata, d_idata); checkCuda( cudaEventRecord(stopEvent, 0) ); checkCuda( cudaEventSynchronize(stopEvent) ); checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) ); checkCuda( cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost) ); postprocess(gold, h_tdata, nx * ny, ms); // ------------------ // transposeCoalesced // ------------------ printf("%25s", "coalesced transpose"); checkCuda( cudaMemset(d_tdata, 0, mem_size) ); // warmup transposeCoalesced<<<dimGrid, dimBlock>>>(d_tdata, d_idata); checkCuda( cudaEventRecord(startEvent, 0) ); for (int i = 0; i < NUM_REPS; i++) transposeCoalesced<<<dimGrid, dimBlock>>>(d_tdata, d_idata); checkCuda( cudaEventRecord(stopEvent, 0) ); checkCuda( cudaEventSynchronize(stopEvent) ); checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) ); checkCuda( cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost) ); postprocess(gold, h_tdata, nx * ny, ms); // ------------------------ // transposeNoBankConflicts // ------------------------ printf("%25s", "conflict-free transpose"); checkCuda( cudaMemset(d_tdata, 0, mem_size) ); // warmup transposeNoBankConflicts<<<dimGrid, dimBlock>>>(d_tdata, d_idata); checkCuda( cudaEventRecord(startEvent, 0) ); for (int i = 0; i < NUM_REPS; i++) transposeNoBankConflicts<<<dimGrid, dimBlock>>>(d_tdata, d_idata); checkCuda( cudaEventRecord(stopEvent, 0) ); checkCuda( cudaEventSynchronize(stopEvent) ); checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) ); checkCuda( cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost) ); postprocess(gold, h_tdata, nx * ny, ms); error_exit: // cleanup checkCuda( cudaEventDestroy(startEvent) ); checkCuda( cudaEventDestroy(stopEvent) ); checkCuda( cudaFree(d_tdata) ); checkCuda( cudaFree(d_cdata) ); checkCuda( cudaFree(d_idata) ); free(h_idata); free(h_tdata); free(h_cdata); free(gold); }
结果
性能由好到坏是:
核函数名 | 带宽 |
---|---|
copy | 95.42 |
shared memory copy | 97.16 |
conflict-free transpose | 94.90 |
coalesced transpose | 53.72 |
naive transpose | 27.81 |
分析
const int TILE_DIM = 32; const int BLOCK_ROWS = 8; const int NUM_REPS = 100;
每个块只有32×8的大小,但其处理了32×32的数据,因为核函数中一下处理了4行,如下:
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) odata[(y+j)*width + x] = idata[(y+j)*width + x];
同时NUM_REPS是为了运行每个核函数100次并取平均带宽值作为结果
各个核函数的含义:
naive transpose:
这是实现矩阵转置最朴素的方法,行访问列,其中行访问合并,但列访问不合并。它是所以性能的下限,即最差的情况。for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) odata[x*width + (y+j)] = idata[(y+j)*width + x];
copy:
是理论带宽的最大值,它的读取和存储都是合并,是理论峰值。这问题是转置问题,存储合并不可能都是合并访问的,所以copy不是转置,而是简单的赋值或者叫复制,看check就看出来了,check的都是输入数组,它的作用是提供一个理论峰值作为参考。如下:都是原数组,没有转置。
postprocess(h_idata, h_cdata, nx*ny, ms);
shared memory copy:
与copy函数一样,这个函数也是copy,只不过是使用shared memory去copy,是shared memory 合并访问全局内存,全局内存合并访问shared memory,这与是无法转置的,也是为了提供理论峰值。同时还可以做到控制变量。
transposeCoalesced:
此例是共享内存合并访问全局内存,全局内存非合并访问共享内存,由于全局内存写的时候访问的是一列数据,正好都在同一个内存存储体中,所以这是个内存存储体冲突较为严重的例子。但其性能还是较不使用共享内存高了很多。
conflict-free transpose
此例是共享内存合并访问全局内存,全局内存非合并访问共享内存,由于全局内存写的时候访问的是一列数据,该函数在共享内存后加了一列,达到了填充的目的:
__shared__ float tile[TILE_DIM][TILE_DIM+1];
填充:因为要访问的数据都是在一列,填充一列之后,要访问的数据就正好全部错开了,完全消除了存储体冲突,性能和合并一样好。
前三个性能一样是因为它们都是合并加载/存储全局变量都是理论峰值
填充在此例中是完美的被运用。
结论
共享内存对程序有很好的优化,但其运用也稍有些复杂,掌握好共享内存的最终目的是让全局内存被合并访问,达到最高的内存效率。相关文章推荐
- CUDA优化实例(五)纹理内存与常量内存
- CUDA优化实例(四)纹理内存
- CUDA的内存结构,通过实例展示寄存器和共享内存的使用
- Android Ashmem匿名共享内存 Java实例
- [菜鸟每天来段CUDA_C]基于共享内存的位图与syncthreads的使用
- 一起talk C栗子吧(第九十六回:C语言实例--使用共享内存进行进程间通信二)
- linux 共享内存实例(转)
- Linux C编程--进程间通信(IPC)6--综合应用实例--信号量和共享内存的使用
- Linux 多线程编程( POSIX )( 六 )----->代码区 ( 共享内存实例 )
- 共享内存实例
- 对oracle实例的内存(SGA和PGA)进行调整,优化数据库性
- 共享内存实例
- Linux C编程--进程间通信(IPC)6--综合应用实例--信号量和共享内存的使用
- CUDA入门学习(三):共享内存与线程同步
- PHP共享内存用法实例分析
- CUDA之矩阵转置程序优化实例
- 共享内存实例 :
- 匿名共享内存Ashmem实例--C++
- CUDA,day-2,共享内存-排序算法
- system的共享内存实例