您的位置:首页 > 其它

CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

2015-08-31 20:11 435 查看
CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common
heuristic used to choose a good block size is to aim for high occupancy,
which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency
hiding ability of a kernel.


Release
Candidate Available


Become a CUDA Registered Developer and download now!

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage,
registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each
supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API,
cudaOccupancyMaxActiveBlocksPerMultiprocessor
produces an
occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by
the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.

CUDA 6.5 also introduces occupancy-based launch configurator APIs,
cudaOccupancyMaxPotentialBlockSize
and
cudaOccupancyMaxPotentialBlockSizeVariableSMem
,
which heuristically calculate a block size that achieves the maximum multiprocessor-level occupancy. You can use the
VariableSmem
version
for kernels where the amount of shared memory allocated depends on the number of threads per block. Note that there are also CUDA driver API equivalents of these functions. The following example demonstrates the use of these APIs. It first chooses a reasonable
block size by calling
cudaOccupancyMaxPotentialBlockSize
, and then calculates the theoretical maximum occupancy the kernel will achieve
on the present device by calling
cudaGetDeviceProperties
and
cudaOccupancyMaxActiveBlocksPerMultiprocessor
.
#include "stdio.h"

__global__ void MyKernel(int *array, int arrayCount)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayCount)
{
array[idx] *= array[idx];
}
}

void launchMyKernel(int *array, int arrayCount)
{
int blockSize;   // The launch configurator returned block size
int minGridSize; // The minimum grid size needed to achieve the
// maximum occupancy for a full device launch
int gridSize;    // The actual grid size needed, based on input size

cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize,
MyKernel, 0, arrayCount);
// Round up according to array size
gridSize = (arrayCount + blockSize - 1) / blockSize;

MyKernel<<< gridSize, blockSize >>>(array, arrayCount);

cudaDeviceSynchronize();

// calculate theoretical occupancy
int maxActiveBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor( &maxActiveBlocks,
MyKernel, blockSize,
0);

int device;
cudaDeviceProp props;
cudaGetDevice(&device);
cudaGetDeviceProperties(&props, device);

float occupancy = (maxActiveBlocks * blockSize / props.warpSize) /
(float)(props.maxThreadsPerMultiProcessor /
props.warpSize);

printf("Launched blocks of size %d. Theoretical occupancy: %f\n",
blockSize, occupancy);
}


cudaOccupancyMaxPotentialBlockSize
makes it possible to compute a reasonably efficient execution configuration for a kernel without
having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also
handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.

The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in
<CUDA_Toolkit_Path>/include/cuda_occupancy.h
for
any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact
of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: