intel向量化指令在矩阵乘应用中的评估
2017-02-13 22:05
274 查看
随着机器学习等人工智能技术的飞速发展,矩阵乘法的应用越来越多,intel芯片先后提供了不同系列的向量指令,包括mmx、sse、avx等,支持simd操作。后来为了更好地支持矩阵乘法,又增加了fma(Fused Multiply-Add)指令。fma指令需要三个向量参数va,vb,vc,其效果等价于表达式(va∗vb)+vc,其中的乘法和加法都是面向向量中的元素的,也就是fma指令的结果是一个同样长度的向量。fma指令的出现为矩阵乘法提供了方便,但是其效果同样可以用avx指令系列中的乘法和加法的组合来实现,本文使用例子来分析不同向量指令在矩阵乘中的性能和精度。
例子主要计算了一个矩阵W和向量x的乘积,W的列数等于x的长度,结果仍然是一个向量,长度等于W的行数。代码的实现如下。
在ubuntu系统中,程序的编译命令是:
gcc -O2 -mfma test.c -o test
需要注意的是,只有在支持fma的芯片结构下,程序才能够执行。可以通过命令:
cat /proc/cpuinfo | grep fma
来判断芯片是否支持fma。
其执行结果为:
Time taken: 93.56 second.
409.8341, 413.4546, 398.7332, 399.8303, 404.1195, 402.3861, 394.6979, 412.6429, 409.0014, 390.9019, 400.3911, 392.7900, 400.5019, 418.6781, 399.3336, 404.0719, 414.9839, 411.6887, 396.0086, 406.6972, 384.5781, 399.3724, 400.0473, 391.6383, 401.3511, 400.8543, 418.4066, 406.6425, 405.5102, 408.4534, 403.0285, 406.3510, 410.2005, 414.9617, 417.3602, 406.4511, 397.1705, 406.1265, 393.3314, 407.1777, 389.9053, 397.3145, 401.7866, 413.3134, 415.7482, 414.2341, 403.3439, 405.4922, 395.4076, 399.6389, 409.6675, 419.8184, 412.3336, 399.8252, 403.3434, 387.4861, 402.2747, 399.8241, 414.1568, 405.4861, 406.6151, 410.4040, 408.9755, 398.9610,
Time taken: 10.94 second.
409.8341, 413.4549, 398.7335, 399.8304, 404.1191, 402.3860, 394.6979, 412.6424, 409.0016, 390.9022, 400.3909, 392.7900, 400.5020, 418.6781, 399.3336, 404.0718, 414.9842, 411.6884, 396.0087, 406.6971, 384.5780, 399.3723, 400.0472, 391.6382, 401.3510, 400.8541, 418.4067, 406.6424, 405.5103, 408.4536, 403.0287, 406.3513, 410.2007, 414.9618, 417.3603, 406.4513, 397.1708, 406.1266, 393.3315, 407.1776, 389.9049, 397.3150, 401.7864, 413.3134, 415.7483, 414.2341, 403.3439, 405.4922, 395.4075, 399.6392, 409.6674, 419.8183, 412.3336, 399.8253, 403.3433, 387.4865, 402.2746, 399.8239, 414.1567, 405.4861, 406.6153, 410.4034, 408.9752, 398.9612,
Time taken: 12.08 second.
409.8341, 413.4549, 398.7335, 399.8304, 404.1191, 402.3860, 394.6979, 412.6424, 409.0016, 390.9022, 400.3909, 392.7900, 400.5021, 418.6781, 399.3336, 404.0718, 414.9842, 411.6884, 396.0087, 406.6971, 384.5780, 399.3722, 400.0472, 391.6382, 401.3510, 400.8541, 418.4067, 406.6424, 405.5102, 408.4536, 403.0287, 406.3513, 410.2007, 414.9618, 417.3603, 406.4513, 397.1708, 406.1266, 393.3315, 407.1776, 389.9050, 397.3150, 401.7864, 413.3134, 415.7483, 414.2341, 403.3439, 405.4922, 395.4075, 399.6392, 409.6674, 419.8183, 412.3336, 399.8253, 403.3433, 387.4865, 402.2746, 399.8239, 414.1568, 405.4861, 406.6153, 410.4034, 408.9752, 398.9612,
可见,avx对乘加的组合实现性能还略高于fma指令。而精度两者相似,略低于原始的运算。
例子主要计算了一个矩阵W和向量x的乘积,W的列数等于x的长度,结果仍然是一个向量,长度等于W的行数。代码的实现如下。
#include <stdio.h> #include <time.h> #include <x86intrin.h> int main() { const int col = 1024, row = 64, num_trails = 1000000; float w[row][col]; float x[col]; float y[row]; float scratchpad[8]; for (int i=0; i<row; i++) { for (int j=0; j<col; j++) { w[i][j]=(float)(rand()%1000)/800.0f; } } for (int j=0; j<col; j++) { x[j]=(float)(rand()%1000)/800.0f; } clock_t t1, t2; // The original matrix multiplication version t1 = clock(); for (int r = 0; r < num_trails; r++) for(int j = 0; j < row; j++) { float sum = 0; float *wj = w[j]; for(int i = 0; i < col; i++) sum += wj[i] * x[i]; y[j] = sum; } t2 = clock(); float diff = ((float)t2 - (float)t1) / CLOCKS_PER_SEC; printf("\nTime taken: %.2f second.\n", diff); for (int i=0; i<row; i++) { printf("%.4f, ", y[i]); } printf("\n"); // The avx matrix multiplication version. const int col_reduced_8 = col - col % 8; __m256 op0, op1, tgt, tmp_vec; t1 = clock(); for (int r = 0; r < num_trails; r++) for (int i=0; i<row; i++) { float res = 0; tgt = _mm256_setzero_ps(); for (int j = 0; j < col_reduced_8; j += 8) { op0 = __builtin_ia32_loadups256(&x[j]); op1 = __builtin_ia32_loadups256(&w[i][j]); tmp_vec = __builtin_ia32_mulps256(op0, op1); tgt = __builtin_ia32_addps256(tmp_vec, tgt); } __builtin_ia32_storeups256(scratchpad, tgt); for (int k=0; k<8; k++) res += scratchpad[k]; for (int l=col_reduced_8; l<col; l++) { res += w[i][l] * x[l]; } y[i] = res; } t2 = clock(); diff = ((float)t2 - (float)t1) / CLOCKS_PER_SEC; printf("\nTime taken: %.2f second.\n", diff); for (int i=0; i<row; i++) { printf("%.4f, ", y[i]); } printf("\n"); // The fma matrix multiplication version. t1 = clock(); for(int r = 0; r < num_trails; r++) for(int i = 0; i < row; i++) { float rlt = 0; tgt = _mm256_setzero_ps(); for(int j = 0; j < col_reduced_8; j += 8) { op0 = __builtin_ia32_loadups256(&x[j]); op1 = __builtin_ia32_loadups256(&w[i][j]); tgt = _mm256_fmadd_ps(op0, op1, tgt); } __builtin_ia32_storeups256(scratchpad, tgt); for(int k = 0; k < 8; k++) { rlt += scratchpad[k]; } for(int l = col_reduced_8; l < col; l++) { rlt += w[i][l] * x[l]; } y[i] = rlt; } t2 = clock(); diff = ((float)t2 - (float)t1) / CLOCKS_PER_SEC ; printf("\nTime taken: %.2f second.\n", diff); for(int i=0; i<row; i++) { printf("%.4f, ", y[i]); } printf("\n");
在ubuntu系统中,程序的编译命令是:
gcc -O2 -mfma test.c -o test
需要注意的是,只有在支持fma的芯片结构下,程序才能够执行。可以通过命令:
cat /proc/cpuinfo | grep fma
来判断芯片是否支持fma。
其执行结果为:
Time taken: 93.56 second.
409.8341, 413.4546, 398.7332, 399.8303, 404.1195, 402.3861, 394.6979, 412.6429, 409.0014, 390.9019, 400.3911, 392.7900, 400.5019, 418.6781, 399.3336, 404.0719, 414.9839, 411.6887, 396.0086, 406.6972, 384.5781, 399.3724, 400.0473, 391.6383, 401.3511, 400.8543, 418.4066, 406.6425, 405.5102, 408.4534, 403.0285, 406.3510, 410.2005, 414.9617, 417.3602, 406.4511, 397.1705, 406.1265, 393.3314, 407.1777, 389.9053, 397.3145, 401.7866, 413.3134, 415.7482, 414.2341, 403.3439, 405.4922, 395.4076, 399.6389, 409.6675, 419.8184, 412.3336, 399.8252, 403.3434, 387.4861, 402.2747, 399.8241, 414.1568, 405.4861, 406.6151, 410.4040, 408.9755, 398.9610,
Time taken: 10.94 second.
409.8341, 413.4549, 398.7335, 399.8304, 404.1191, 402.3860, 394.6979, 412.6424, 409.0016, 390.9022, 400.3909, 392.7900, 400.5020, 418.6781, 399.3336, 404.0718, 414.9842, 411.6884, 396.0087, 406.6971, 384.5780, 399.3723, 400.0472, 391.6382, 401.3510, 400.8541, 418.4067, 406.6424, 405.5103, 408.4536, 403.0287, 406.3513, 410.2007, 414.9618, 417.3603, 406.4513, 397.1708, 406.1266, 393.3315, 407.1776, 389.9049, 397.3150, 401.7864, 413.3134, 415.7483, 414.2341, 403.3439, 405.4922, 395.4075, 399.6392, 409.6674, 419.8183, 412.3336, 399.8253, 403.3433, 387.4865, 402.2746, 399.8239, 414.1567, 405.4861, 406.6153, 410.4034, 408.9752, 398.9612,
Time taken: 12.08 second.
409.8341, 413.4549, 398.7335, 399.8304, 404.1191, 402.3860, 394.6979, 412.6424, 409.0016, 390.9022, 400.3909, 392.7900, 400.5021, 418.6781, 399.3336, 404.0718, 414.9842, 411.6884, 396.0087, 406.6971, 384.5780, 399.3722, 400.0472, 391.6382, 401.3510, 400.8541, 418.4067, 406.6424, 405.5102, 408.4536, 403.0287, 406.3513, 410.2007, 414.9618, 417.3603, 406.4513, 397.1708, 406.1266, 393.3315, 407.1776, 389.9050, 397.3150, 401.7864, 413.3134, 415.7483, 414.2341, 403.3439, 405.4922, 395.4075, 399.6392, 409.6674, 419.8183, 412.3336, 399.8253, 403.3433, 387.4865, 402.2746, 399.8239, 414.1568, 405.4861, 406.6153, 410.4034, 408.9752, 398.9612,
可见,avx对乘加的组合实现性能还略高于fma指令。而精度两者相似,略低于原始的运算。
相关文章推荐
- 应用的量化评估——2007高效流量管理的目标与关键
- Intel奔腾系列CPU指令全集(包含P4)
- 使用Intel 向量化编译器优化性能(1)
- 稀疏矩阵线性解析库SPOOLES的简单应用
- C++数组应用之特殊矩阵的压缩存储
- C++数组应用之特殊矩阵的压缩存储
- C++数组应用之特殊矩阵的压缩存储
- C的预编译指令"#"的一个应用
- 网络管理必须引入应用量化管理
- Intel 应用处理器和闪存技术在手提和嵌入应用问答精选
- 使用Intel 向量化编译器优化性能(1)
- GCC中SIMD指令的应用方法
- Intel案例研究:麦德龙集团的RFID应用
- GCC中SIMD指令的应用方法
- ARM指令预取的应用---调整DRAM的clock经典代码段
- 使用Intel 向量化编译器优化性能(2)
- 使用Intel 向量化编译器优化性能(3)
- Intel体系MMX指令&指令说明
- ARM应用系统开发详解 --第3章 ARM微处理器的指令系统
- C++数组应用之特殊矩阵的压缩存储