您的位置:首页 > 其它

CPU与GPU的内存带宽对比(CPU vs CUDA GPU memory bandwidth)

2011-07-22 10:28 148 查看
原文链接:http://blog.cudachess.org/2009/07/cpu-vs-cuda-gpu-memory-bandwidth/
导读:
最近打算学习CUDA,但在与一个同学聊天时她提到GPU并不适用于某些类型的计算,瓶颈在于I/O上。可我看了下GPU的参数,内存带宽(Memory Bandwidth)很高,怎么会这样呢?下面这篇文章可以回答这个问题。

如何对比和解读现代CPU与使用CUDA架构的GPU的内存带宽差距?

根据我个人目前的研究,我认为尽管GPU的内存带宽很大,但CPU的一级缓实际上比CUDA架构效率更高。

CUDA GPU的速度可以达到gigaflops(每秒10亿次浮点操作),是Core i7/Nethalem速度的十倍。为充分利用强大的计算能力,我们需要从存储器中(全局显存或计算机内存)尽量快地给他们提供数据。
我通过这篇有趣的文章benchmarked overclocked Core i7 cache and memory bandwidth发现在三通道DDR3中:一级缓存的读写峰值可以达到50GB/s,但这两个操作是可以同时进行的,因此总峰值可以达到100GB/s,但计算机内存速度(三通道DDR3)仅为16GB/s。这很令人惊讶,三年前的 Athlon X2 3800+ (2×2Hz)一级缓存比现在最新的主存速度要快!(译者注:怀疑原文输入错误,应该是惊叹三年前的比现在快,而不是反之)

CUDA的共享存储器 (16KB/8 Scalar Processors)和CPU的一级缓存(32K)的速度差不多,都是50GB/s。

GPU的共享存储器内存带宽可以达到100GB/s ~ 150GB/s,是计算机内存带宽的8倍,这是因为多个64位接口(8 vs 3)和更高的时钟频率。

下面比较GPU的共享内存读写速度和CPU的一级缓存读写速度。对于i7处理器,因为四个核都有自己的一级缓存,因此峰值可以达到200~400GB/s。而CUDA GTX285因为有30组8标量处理器,因此期内存带宽可以达到1500GB/s,是超频后i7的4倍。

总结一下,CUDA GPU的全局存储器速度是计算机内存的8倍,共享存储器是现代CPU一级缓存的4倍。

原文如下:

What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: