Microsoft C++ AMP Accelerated Massive Parallelism
2011-09-30 09:29
423 查看
Microsoft's C++ AMP Unveiled
http://drdobbs.com/windows/231600761Over the past few years, some developers have started to take advantage of the power of GPU hardwarein their apps. In other words, from their CPU code, they have been offloading parts of their app that are computeintensive to the GPU, and enjoying overall
performanceincreasesin their solution.
The code parts that are offloadable to an accelerator such as the GPU are data parallel algorithms operating over large (often multi-dimensional) arrays of data. Soif you have code that fits that pattern,itisin yourinterest to explore taking advantage
of the GPU from your app.
As you start on such a journey, you will face the sameissue that so many before you have faced: While your appis writtenin a mainstream programming language such as C++, most of the options for targeting a data parallel acceleratorinvolve learning a
new niche syntax, obtaining a new set of development tools (including a separate compiler), trying to figure out which hardware you can target and whichis out of reach for your chosen option, and perhaps drawing a deployment matrix of what needs to ship to
the customer's machine for your solution.
Microsoftis aiming to significantly lower the barrier to entry by providing a mainstream C++ option that we are calling "C++ Accelerated Massive Parallelism" or "C++ AMP" for short.
C++ AMPintroduces a key new language feature to C++ and a minimal STL-like library that enables you to very easily work with large multidimensional arrays to express your data parallel algorithmsin a manner that exposes massive parallelism on an accelerator,
such as the GPU.
We
announced this technology at the AMD Fusion Developer Summitin June 2011. At the same time, we announced ourintent to make the specification open, and we are working with other compiler vendors so they can supportitin their compilers (on any platform).
Microsoft'simplementation of C++ AMPis part of the next version of the Visual C++ compiler and the next version of Visual Studio. A Developer Preview of that release should be available publicly at the time you are reading this article.
Microsoft'simplementation targets Windows by building on top of the ubiquitous and reliable Direct3D platform, and that means thatin addition to the performance and productivity advantages of C++ AMP, you will benefit from hardware portability across all
major hardware vendors. The core API surface areais general and Direct3D-neutral, such that one could think of Direct3D as animplementation detail;in future releases, we could offer additionalimplementations targeting other kinds of hardware and topologies
(e.g., cloud), while preserving yourinvestmentsin learning our data parallel API.
So what does the C++ AMP API look like? Before delvinginto a complete example, the next section covers the basics of the structure of C++ AMP code, which you can findin the amp.h filein the concurrency namespace.
C++ AMP APIin aNutshell
You can see an example of a matrix multiplicationin C++ AMPinthis blog post. Figure 1 shows a simple array addition example demonstrating some core concepts.
Figure 1: Simple array addition using C++ AMP.
To use large datasets with C++ AMP, you can either copy them to an array or wrap them with an
array_view. The
arrayclassis a container of data of element
Tand of rank
N, residing on a specific accelerator, that you must explicitly copy data to and from. The
array_viewclassis a wrapper over existing data, and copying of data required for computationsis doneimplicitly on demand.
The entry point to an algorithmimplementedin C++ AMPis one of the overloads of
parallel_for_each.
parallel_for_eachinvocations are translated by our compilerinto GPU code and GPU runtime calls (through Direct3D).
The first argument to
parallel_for_eachis a
gridobject. The
gridclass lets you define an
N-dimensional space. The second argument to the
parallel_for_eachcallis a lambda, whose parameter list consists of an
indexobject (If you are not familiar with lambdasin C++, visit
this page for further details). The
indexclass lets you define an
N-dimensional point.
The lambda you write and pass to the
parallel_for_eachis called by the C++ AMP runtime once per thread, passingin the thread ID as anindex, which you can use toindexinto your array or
array_viewobjects. The variables for those objects are not explicitin any signature;instead, you capture theminto the lambda as needed — one of the beauties of a lambda-based design.
Note that the lambda (and any other functions thatit calls) must be annotated with the new
restrictmodifier,indicating that the function should be compiled for the
restrictspecifier —in our case, any Direct3D device. This new language feature, whose usageis very simple, will be coveredin more depthin a separate article.
There are other classes as part of the C++ AMP API — for example,
acceleratorand
accelerator_view— that let you check the capabilities of accelerators and queryinformation about them, and hence, let you choose which one you want your algorithm to execute on.
Finally, thereis a tiled overload of
parallel_for_eachthat accepts a
tiled_grid, and whose lambda takes a
tiled_index; and within the lambda body, you can use the new
tile_staticstorage class and the
tile_barrierclass. This variant of
parallel_for_eachlets you take advantage of the programmable cache on the GPU (also known as shared memory, also knownin C++ AMP as
tile_staticmemory), for maximum performance benefits, and an example ofit will be shownin the next part of the article.
Calculating a Moving Average with C++ AMP
A common problemin finance and scienceis that of calculating a moving average over a time series. For example, for a given company stock, financial Web sites providein addition to a normal chart tracking the stock's price as a function of time, a smoothercurve, which tracks the stock's average price over the last 50 or 200 days. That curveis the
simple moving average of the stock's price.
We start with a serialimplementation, which calculates the simple moving average directly based onits definition (in our presentation of the problem, we will assume that we don't need to calculate the values of the moving average for pointsin the range
[0…window-2]).
Simple C++ AMP Version
Our first C++ AMP version of this algorithmis produced by parallelizing the serial algorithm. We note that eachiteration of the loop overiisindependent and, thus, could safely be parallelized. Compare the serial version with the parallel
version below:
?
float*buffers,
seriesand
moving_average, with
array_viewclasses. Then we transformed the loop over
iinto a
parallel_for_eachcall, where the loop bounds have been replaced by aninstance of class
grid. The body of the lambdais then almostidentical to the serial loop.
When we try the parallel algorithm on a machine with a decent DX11 card, we finditis significantly faster than the serial version. For example, on our hardware, for certain data sizes, we saw the GPU perform the calculation 150-times faster than the serial
CPUimplementation. In fact, the parallel algorithmis also faster than more-sophisticated serial algorithms. The reasons for thisincreased speed are that the GPU has a high degree of parallelism and a wide pipe to the memory subsystem to service each of
these cores.
相关文章推荐
- C++ AMP: .Massive Data Parallelism on the GPU with Microsoft's C++ AMP (Accelerated Massive Parallel
- C++ AMP: C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++
- C++ AMP Accelerated Massive Parallelism
- C++ AMP (C++ Accelerated Massive Parallelism)
- C++ AMP: C++ Accelerated Massive Parallelism
- Microsoft Visual C++ 6.0 (SP6) 中can not open include file "afxres.h"问题的解决方案
- Microsoft Visual C++ 6.0 (SP6)中 can not open include file "winresrc.h"问题的解决方案
- Python: Windows下pip安装库出错:Microsoft Visual C++ 9.0 is required < Unable to find vcvarsall.bat
- FATAL: Payload 'Microsoft Visual C++ 2008 Redistributable Package (x86) 7.0.0.103 {9C4AA28F-AC6B-11E
- C++ Q & A -- Microsoft Systems Journal August 1999
- 读书札记:#include&quot;stdafx.h&quot; 问题的解决--Microsoft Visual Studio 2008之C++
- "getline" bug fix for Microsoft Visual C++ 6.0 关于VC6的getline输入需要两个回车才结束的BUG修改方法
- cmake 编译opencv 出现 The C++ compiler "C:/Program Files/Microsoft Visual Studio 10.0/VC/bin/cl.exe
- Microsoft Dynamics AX API - Part 3 "book of orders"
- C++之尽量不要重载&&,||或者,运算符(7)---《More Effective C++》
- 【C++】引用&指针
- 解决:pip install pyduktape失败,Microsoft Visual C++ 14.0 is required
- 【C&C++】stringstream的一些用法 - 尝试一下新的东西
- C和C++混合编程(error C2059: syntax error : 'string')
- c/c++: uint8_t & uint16_t & int32_t etc.