CUDA Pro Tip: Occupancy API Simplifies Launch Configuration
2015-08-31 20:11
435 查看
CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common
heuristic used to choose a good block size is to aim for high occupancy,
which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency
hiding ability of a kernel.
Release
Candidate Available
Become a CUDA Registered Developer and download now!
Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage,
registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each
supported GPU architecture.
CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API,
occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by
the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.
CUDA 6.5 also introduces occupancy-based launch configurator APIs,
which heuristically calculate a block size that achieves the maximum multiprocessor-level occupancy. You can use the
for kernels where the amount of shared memory allocated depends on the number of threads per block. Note that there are also CUDA driver API equivalents of these functions. The following example demonstrates the use of these APIs. It first chooses a reasonable
block size by calling
on the present device by calling
having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also
handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.
The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in
any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact
of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.
heuristic used to choose a good block size is to aim for high occupancy,
which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency
hiding ability of a kernel.
Release
Candidate Available
Become a CUDA Registered Developer and download now!
Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage,
registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each
supported GPU architecture.
CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API,
cudaOccupancyMaxActiveBlocksPerMultiprocessorproduces an
occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by
the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.
CUDA 6.5 also introduces occupancy-based launch configurator APIs,
cudaOccupancyMaxPotentialBlockSizeand
cudaOccupancyMaxPotentialBlockSizeVariableSMem,
which heuristically calculate a block size that achieves the maximum multiprocessor-level occupancy. You can use the
VariableSmemversion
for kernels where the amount of shared memory allocated depends on the number of threads per block. Note that there are also CUDA driver API equivalents of these functions. The following example demonstrates the use of these APIs. It first chooses a reasonable
block size by calling
cudaOccupancyMaxPotentialBlockSize, and then calculates the theoretical maximum occupancy the kernel will achieve
on the present device by calling
cudaGetDevicePropertiesand
cudaOccupancyMaxActiveBlocksPerMultiprocessor.
#include "stdio.h" __global__ void MyKernel(int *array, int arrayCount) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < arrayCount) { array[idx] *= array[idx]; } } void launchMyKernel(int *array, int arrayCount) { int blockSize; // The launch configurator returned block size int minGridSize; // The minimum grid size needed to achieve the // maximum occupancy for a full device launch int gridSize; // The actual grid size needed, based on input size cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, MyKernel, 0, arrayCount); // Round up according to array size gridSize = (arrayCount + blockSize - 1) / blockSize; MyKernel<<< gridSize, blockSize >>>(array, arrayCount); cudaDeviceSynchronize(); // calculate theoretical occupancy int maxActiveBlocks; cudaOccupancyMaxActiveBlocksPerMultiprocessor( &maxActiveBlocks, MyKernel, blockSize, 0); int device; cudaDeviceProp props; cudaGetDevice(&device); cudaGetDeviceProperties(&props, device); float occupancy = (maxActiveBlocks * blockSize / props.warpSize) / (float)(props.maxThreadsPerMultiProcessor / props.warpSize); printf("Launched blocks of size %d. Theoretical occupancy: %f\n", blockSize, occupancy); }
cudaOccupancyMaxPotentialBlockSizemakes it possible to compute a reasonably efficient execution configuration for a kernel without
having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also
handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.
The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in
<CUDA_Toolkit_Path>/include/cuda_occupancy.hfor
any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact
of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.
相关文章推荐
- 天台山的读音
- 中文参数传递乱码问题
- 8个发光二极管来回流动,每个管亮五十毫秒,灭五十毫秒,亮时蜂鸣器响,灭时蜂鸣器灭,一直重复下去
- Android触摸屏事件派发机制详解与源码分析一(View篇)
- 结构体内存分配:举例待续
- IOS 字符串删除某一个字符
- 作为程序员的硬实力是什么 ?
- 用lazarus快速创建xml格式文件
- UINavigationController 、界面通信
- 专业软件测试工程师必备之软件测试要学什么技能?
- 什么是JS跨域请求?有几种方式可以实现?请简述其中某一种的实现原理?
- java 面向对象
- Java中的instanceof关键字
- linux下安装QT的方法
- Ubuntu Server 开机启动Xampp
- AD健康体验
- C语言---数组
- 【linux下用C语言编写带图形界面的成绩管理系统附带源代码】
- hdoj 3657 Game 【最小割 方格填数加强版】
- 对比Oracle和Mysql在锁机制上的类似和差异点