内存测试Stream程序分析(一)--基本参数
2015-07-16 10:11
176 查看
Stream测试是内存测试中业界公认的内存带宽性能测试基准工具。作为一个服务器工程师在衡量和评比服务器的性能时,该如何从简单的一个源码使之编译成适合自己的可用的测试工具呢?让我们一起来学习这个基本过程吧。
首先我们看一下Linux下最简单的编译过程:
点击(此处)折叠或打开
gcc -O stream.c -o stream.o
上述的编译使用了程序和编译器的默认参数,生成的stream.o即可执行,执行结果如下:
点击(此处)折叠或打开
# ./stream.o
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 45000000 (elements), Offset = 0 (elements)
Memory per array = 343.3 MiB (= 0.3 GiB).
Total memory required = 1030.0 MiB (= 1.0 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 122074 microseconds.
(= 122074 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 5376.3 0.134082 0.133922 0.134222
Scale: 5139.4 0.140465 0.140093 0.140754
Add: 5680.4 0.190304 0.190128 0.190547
Triad: 5476.8 0.197417 0.197195 0.197598
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
以上测试结果是在Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz 6Cores CPU及1333MHz*8的内存。以上测试中主要的影响结果的默认值为单线程,Array
size = 45000000。以下是源文件全部可以定义的参数。
点击(此处)折叠或打开
#ifdef _OPENMP 定义开启多处理器运行环境,
extern int omp_get_num_threads();
#endif
#ifndef STREAM_ARRAY_SIZE 定义计算的数组大小(成员个数)
# define STREAM_ARRAY_SIZE 45000000
#endif
#ifdef NTIMES 定义值非法时,若定义计算次数,10
#if NTIMES<=1
# define NTIMES 10
#endif
#endif
#ifndef NTIMES 若定义计算次数,10
# define NTIMES 10
#endif
#ifndef OFFSET 定义数组偏移量
# define OFFSET 0
#endif
#ifndef STREAM_TYPE 定义数组为双精度,64bit,8Bytes
#define STREAM_TYPE double
#endif
以上参数是源码中的预定义参数,也可以在编译或运行时动态指定。 为方便理解,下面展示以下我编译的多核多内存机器的编译方法
gcc -mtune=native -march=native -O3 -mcmodel=medium -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=30 -DOFFSET=4096 stream.c -o stream.o
解释:
-mtune=native -march=native; 针对CPU指令的优化,此处由于编译机即运行机器。故采用native的优化方法。更多编译器对CPU的优化参考:http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html
-O3
; 编译器编译优化级别;
-mcmodel=medium
;当单个Memory Array Size 大于2GB时需要设置此参数。
-fopenmp;
适应多处理器环境;开启后,程序默认线程为CPU线程数,也可以运行时也可以动态指定运行的进程数 :export
OMP_NUM_THREADS=12 #12为自定义的要使用的处理器
-DSTREAM_ARRAY_SIZE=100000000;指定计算中a[],b[],c[]数组的大小,
-DNTIMES=30
;执行的次数,并且从这些结果中选最优值。
-DOFFSET=4096 ;数组的偏移,一般可以不定义。
其中STREAM_ARRAY_SIZE对测试结果影响较大,源码中也为数组大小的选取进行了经验说明。
点击(此处)折叠或打开
You should adjust the value of 'STREAM_ARRAY_SIZE' (below)
* to meet *both* of the following criteria:
* (a) Each array must be at least 4 times the size of the
* available cache memory. I don't worry about the difference
* between 10^6 and 2^20, so in practice the minimum array size
* is about 3.8 times the cache size.
* Example 1: One Xeon E3 with 8 MB L3 cache
* STREAM_ARRAY_SIZE should be >= 4 million, giving
* an array size of 30.5 MB and a total memory requirement
* of 91.5 MB. \\STREAM_ARRAY_SIZE的值最好能4倍于CPU Cache, 注意Array Size占的存储空间的换算,double 64bit 8Byte的转换。
* Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
* STREAM_ARRAY_SIZE should be >= 20 million, giving
* an array size of 153 MB and a total memory requirement
* of 458 MB.
* (b) The size should be large enough so that the 'timing calibration'
* output by the program is at least 20 clock-ticks.
* Example: most versions of Windows have a 10 millisecond timer
* granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.
* If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
* This means the each array must be at least 1 GB, or 128M elements. \ 这个可以参考运行时的提示,考虑修改
-DOFFSET=4096 ;数组的偏移,为了对齐数组内存空间和内存中物理空间,因为CPU在读取内存时不是1bit一bit的读,一般可以不定义,定义后也不一定有效果。主要是由于影响内存分配和物理存储单元的对齐。以下是解释和代码中仅有的OFFSET参数出现的地方
点击(此处)折叠或打开
Users are allowed to modify the "OFFSET" variable, which *may* change the
* relative alignment of the arrays (though compilers may change the
* effective offset by making the arrays non-contiguous on some systems).
* Use of non-zero values for OFFSET can be especially helpful if the
* STREAM_ARRAY_SIZE is set to a value close to a large power of 2.
* OFFSET can also be set on the compile line without changing the source
* code using, for example, "-DOFFSET=56".
点击(此处)折叠或打开
#ifndef STREAM_TYPE
#define STREAM_TYPE double
#endif
static STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET],
b[STREAM_ARRAY_SIZE+OFFSET],
c[STREAM_ARRAY_SIZE+OFFSET]; //可以看出设置该值后,数组的长度变长了
首先我们看一下Linux下最简单的编译过程:
点击(此处)折叠或打开
gcc -O stream.c -o stream.o
上述的编译使用了程序和编译器的默认参数,生成的stream.o即可执行,执行结果如下:
点击(此处)折叠或打开
# ./stream.o
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 45000000 (elements), Offset = 0 (elements)
Memory per array = 343.3 MiB (= 0.3 GiB).
Total memory required = 1030.0 MiB (= 1.0 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 122074 microseconds.
(= 122074 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 5376.3 0.134082 0.133922 0.134222
Scale: 5139.4 0.140465 0.140093 0.140754
Add: 5680.4 0.190304 0.190128 0.190547
Triad: 5476.8 0.197417 0.197195 0.197598
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
以上测试结果是在Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz 6Cores CPU及1333MHz*8的内存。以上测试中主要的影响结果的默认值为单线程,Array
size = 45000000。以下是源文件全部可以定义的参数。
点击(此处)折叠或打开
#ifdef _OPENMP 定义开启多处理器运行环境,
extern int omp_get_num_threads();
#endif
#ifndef STREAM_ARRAY_SIZE 定义计算的数组大小(成员个数)
# define STREAM_ARRAY_SIZE 45000000
#endif
#ifdef NTIMES 定义值非法时,若定义计算次数,10
#if NTIMES<=1
# define NTIMES 10
#endif
#endif
#ifndef NTIMES 若定义计算次数,10
# define NTIMES 10
#endif
#ifndef OFFSET 定义数组偏移量
# define OFFSET 0
#endif
#ifndef STREAM_TYPE 定义数组为双精度,64bit,8Bytes
#define STREAM_TYPE double
#endif
以上参数是源码中的预定义参数,也可以在编译或运行时动态指定。 为方便理解,下面展示以下我编译的多核多内存机器的编译方法
gcc -mtune=native -march=native -O3 -mcmodel=medium -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=30 -DOFFSET=4096 stream.c -o stream.o
解释:
-mtune=native -march=native; 针对CPU指令的优化,此处由于编译机即运行机器。故采用native的优化方法。更多编译器对CPU的优化参考:http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html
-O3
; 编译器编译优化级别;
-mcmodel=medium
;当单个Memory Array Size 大于2GB时需要设置此参数。
-fopenmp;
适应多处理器环境;开启后,程序默认线程为CPU线程数,也可以运行时也可以动态指定运行的进程数 :export
OMP_NUM_THREADS=12 #12为自定义的要使用的处理器
-DSTREAM_ARRAY_SIZE=100000000;指定计算中a[],b[],c[]数组的大小,
-DNTIMES=30
;执行的次数,并且从这些结果中选最优值。
-DOFFSET=4096 ;数组的偏移,一般可以不定义。
其中STREAM_ARRAY_SIZE对测试结果影响较大,源码中也为数组大小的选取进行了经验说明。
点击(此处)折叠或打开
You should adjust the value of 'STREAM_ARRAY_SIZE' (below)
* to meet *both* of the following criteria:
* (a) Each array must be at least 4 times the size of the
* available cache memory. I don't worry about the difference
* between 10^6 and 2^20, so in practice the minimum array size
* is about 3.8 times the cache size.
* Example 1: One Xeon E3 with 8 MB L3 cache
* STREAM_ARRAY_SIZE should be >= 4 million, giving
* an array size of 30.5 MB and a total memory requirement
* of 91.5 MB. \\STREAM_ARRAY_SIZE的值最好能4倍于CPU Cache, 注意Array Size占的存储空间的换算,double 64bit 8Byte的转换。
* Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
* STREAM_ARRAY_SIZE should be >= 20 million, giving
* an array size of 153 MB and a total memory requirement
* of 458 MB.
* (b) The size should be large enough so that the 'timing calibration'
* output by the program is at least 20 clock-ticks.
* Example: most versions of Windows have a 10 millisecond timer
* granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.
* If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
* This means the each array must be at least 1 GB, or 128M elements. \ 这个可以参考运行时的提示,考虑修改
-DOFFSET=4096 ;数组的偏移,为了对齐数组内存空间和内存中物理空间,因为CPU在读取内存时不是1bit一bit的读,一般可以不定义,定义后也不一定有效果。主要是由于影响内存分配和物理存储单元的对齐。以下是解释和代码中仅有的OFFSET参数出现的地方
点击(此处)折叠或打开
Users are allowed to modify the "OFFSET" variable, which *may* change the
* relative alignment of the arrays (though compilers may change the
* effective offset by making the arrays non-contiguous on some systems).
* Use of non-zero values for OFFSET can be especially helpful if the
* STREAM_ARRAY_SIZE is set to a value close to a large power of 2.
* OFFSET can also be set on the compile line without changing the source
* code using, for example, "-DOFFSET=56".
点击(此处)折叠或打开
#ifndef STREAM_TYPE
#define STREAM_TYPE double
#endif
static STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET],
b[STREAM_ARRAY_SIZE+OFFSET],
c[STREAM_ARRAY_SIZE+OFFSET]; //可以看出设置该值后,数组的长度变长了
相关文章推荐
- Android Studio无法检测到魅族手机的解决方法
- iOS学习笔记-retain/assign/strong/weak/copy/mutablecopy/autorelease区别
- Swift中共有74个内建函数
- 页面性能测试&提升方式
- 0031 二维数组指针 推箱子代码实现
- 开源时序服务器influxdb使用
- DrawItem
- 详细分析JavaScript函数定义
- jsp和servlet映射关系
- Cron表达式
- Mysql字符集相关操作
- 日经社説 20150716 市場介入頼みでは中国の安定成長厳しく
- VC问题 IntelliSense:“没有可用的附加信息”,[请参见“C++项目 IntelliSense 疑难解答”,获得进一步的帮助]
- Error: unable to connect to node 'rabbit@devlop-ceilo': nodedown
- 链接器都干了些什么?
- 10行Java代码实现最近被使用(LRU)缓存
- VC串口API超时的详细介绍
- 第一次使用框架-amaze UI
- struts2.3.16之环境搭建
- DOS命令大全:Net use命令详解