<2012 12 17> Why ARM's EABI Matters?有关ARM-Linux平台浮点计算性能。
2012-12-17 16:57
567 查看
在没有硬浮点hardware FPU (floating point unit)的ARM处理器上实现浮点运算,有两种办法。第一种是OABI,交叉编译器编译程序时按照普通方式编译(即把平台当做有浮点运算支持),程序运行时,遇到无效的浮点指令产生一个异常,kernel处理该异常(调用一个函数库,执行模拟浮点运算);另一种是EABI方式,即编译器在编译时添加软浮点编译选项“ -mfpu=vfp -mfloat-abi=softfp ”,从而在编译时就链接软浮点库函数,产生无异常指令的代码。
EABI的方式是一种革新,因为交叉编译器在编译的时候,对于浮点运行会预设硬浮点运算FPA(Float Point Architecture),而没有FPA的CPU,比如SAMSUNG S3C2410/S3C2440,会使用kernel提供的FPE(Float Point Emulation 即软浮点)机制,这样在速度上就会遇到极大的限制,使用EABI(Embedded Application Binary Interface)则可以对此改善处理,ARM EABI有许多革新之处,其中最突出的改进就是Float Point Performance,它使用Vector Float Point(矢量浮点),因此可以极大提高涉及到浮点运算的程序。
下面的这篇文章很好的说明、并测试了这两种方式的性能(可以看出这种性能改善还是很明显的,10倍速):
Why ARM's EABI Matters
by Andres Calderon and Nelson Castillo
It's common nowadays to hear of the new ARM EABI (embedded application binary interface) Linux port. There are many motivations to start using it, but there is one we especially like -- it's much faster for floating point operations. Since many ARM cores lack a hardware FPU (floating point unit), any software acceleration is more than welcome.
It might be hard to switch to EABI, though. For instance, for the Debian distribution, EABI is actually considered a new port.
Without EABI
The ARM EABI improves the floating point performance. This is not surprising, if you read how your processor is wasting a lot of cycles now. From the DebianARM-EABI wiki:
So, what does this mean? It means that the compilers usually generate instructions for a piece of hardware, namely a Floating Point Unit, that is not actually there! When you make a floating point operation, such at 3.58*x, the CPU runs into an illegal instruction, and it raises an exception. The kernel catches this specific exception and performs the intended float point operation, and then resumes executing the program. And this is slow because it implies a context switch.
The benchmark
We decided to make a simple benchmark using our Open Hardware Free ECB_AT91 ARM(ARMv4t) development board, based on an Atmel AT91RM9200 processor.
We used a simple benchmark we have used before: the dot product of two given vectors, the Euclidean distance of the vectors, and the FFT (fast Fourier transform) algorithm (complex valued, Cooley and Tukey radix-2). The source code we used is available here (GPL).
It's common to use the number of floating point operations per second (FLOPS) performed by a given program for benchmarking purposes. However, this can be misleading, because some operations (e.g. division) take more time than others (e.g. addition). To ensure uniformity, we ran the same program in both setups, with similar compiler flags.
First we tried the Old ABI using the Debian distribution (Debian Sid), and an image that we bootstrapped. Then, for the EABI test, we used the Angstrom Distribution, part of the OpenEmbedded project.
We used a simple benchmark we have used before: the dot product of two given vectors, the Euclidean distance of the vectors, and the FFT (fast Fourier transform) algorithm (complex valued, Cooley and Tukey radix-2). The source code we used is available here (GPL).
It's common to use the number of floating point operations per second (FLOPS) performed by a given program for benchmarking purposes. However, this can be misleading, because some operations (e.g. division) take more time than others (e.g. addition). To ensure uniformity, we ran the same program in both setups, with similar compiler flags.
First we tried the Old ABI using the Debian distribution (Debian Sid), and an image that we bootstrapped. Then, for the EABI test, we used the Angstrom Distribution, part of the OpenEmbedded project.
Results
EABI vx. OABI, floating point benchmark (Free_ECP_AT91_V1.5, AT92RM9200)
EABI/OABI speed-up, floating point benchmark (Free_ECB_AT91_V1.5, AT92RM9200)
In each context switch, both the data and instruction cache are flushed, and this hurts the Old ABI's performance. You will notice it in the graphs because the performance with the old ABI does not depend on the size (N) of the input data, whereas in EABI the impact of the cache in the performance is seen clearly. The dot-product performance only goes down when N > 4096 (When we use more than 16KB in memory); the Atmel processor we're using has a 16 Kbyte data cache.
source : http://linuxdevices.com/articles/AT5920399313.html
EABI的方式是一种革新,因为交叉编译器在编译的时候,对于浮点运行会预设硬浮点运算FPA(Float Point Architecture),而没有FPA的CPU,比如SAMSUNG S3C2410/S3C2440,会使用kernel提供的FPE(Float Point Emulation 即软浮点)机制,这样在速度上就会遇到极大的限制,使用EABI(Embedded Application Binary Interface)则可以对此改善处理,ARM EABI有许多革新之处,其中最突出的改进就是Float Point Performance,它使用Vector Float Point(矢量浮点),因此可以极大提高涉及到浮点运算的程序。
下面的这篇文章很好的说明、并测试了这两种方式的性能(可以看出这种性能改善还是很明显的,10倍速):
Why ARM's EABI Matters
by Andres Calderon and Nelson Castillo
It's common nowadays to hear of the new ARM EABI (embedded application binary interface) Linux port. There are many motivations to start using it, but there is one we especially like -- it's much faster for floating point operations. Since many ARM cores lack a hardware FPU (floating point unit), any software acceleration is more than welcome.
It might be hard to switch to EABI, though. For instance, for the Debian distribution, EABI is actually considered a new port.
Without EABI
The ARM EABI improves the floating point performance. This is not surprising, if you read how your processor is wasting a lot of cycles now. From the DebianARM-EABI wiki:
The current Debian port creates hardfloat FPA instructions. FPA comes from "Floating Point Accelerator." Since the FPA floating point unit was implemented only in very few ARM cores, these days FPA instructions are emulated in kernel via Illegal instruction faults. This is of course very inefficient: about 10 times slower that -msoftfloat for a FIR test program. The FPA unit also has the peculiarity of having mixed-endian doubles, which is usually the biggest grief for ARM porters, along with structure packing issues.
So, what does this mean? It means that the compilers usually generate instructions for a piece of hardware, namely a Floating Point Unit, that is not actually there! When you make a floating point operation, such at 3.58*x, the CPU runs into an illegal instruction, and it raises an exception. The kernel catches this specific exception and performs the intended float point operation, and then resumes executing the program. And this is slow because it implies a context switch.
The benchmark
We decided to make a simple benchmark using our Open Hardware Free ECB_AT91 ARM(ARMv4t) development board, based on an Atmel AT91RM9200 processor.
We used a simple benchmark we have used before: the dot product of two given vectors, the Euclidean distance of the vectors, and the FFT (fast Fourier transform) algorithm (complex valued, Cooley and Tukey radix-2). The source code we used is available here (GPL).
It's common to use the number of floating point operations per second (FLOPS) performed by a given program for benchmarking purposes. However, this can be misleading, because some operations (e.g. division) take more time than others (e.g. addition). To ensure uniformity, we ran the same program in both setups, with similar compiler flags.
First we tried the Old ABI using the Debian distribution (Debian Sid), and an image that we bootstrapped. Then, for the EABI test, we used the Angstrom Distribution, part of the OpenEmbedded project.
We used a simple benchmark we have used before: the dot product of two given vectors, the Euclidean distance of the vectors, and the FFT (fast Fourier transform) algorithm (complex valued, Cooley and Tukey radix-2). The source code we used is available here (GPL).
It's common to use the number of floating point operations per second (FLOPS) performed by a given program for benchmarking purposes. However, this can be misleading, because some operations (e.g. division) take more time than others (e.g. addition). To ensure uniformity, we ran the same program in both setups, with similar compiler flags.
First we tried the Old ABI using the Debian distribution (Debian Sid), and an image that we bootstrapped. Then, for the EABI test, we used the Angstrom Distribution, part of the OpenEmbedded project.
Results
EABI vx. OABI, floating point benchmark (Free_ECP_AT91_V1.5, AT92RM9200)
EABI/OABI speed-up, floating point benchmark (Free_ECB_AT91_V1.5, AT92RM9200)
In each context switch, both the data and instruction cache are flushed, and this hurts the Old ABI's performance. You will notice it in the graphs because the performance with the old ABI does not depend on the size (N) of the input data, whereas in EABI the impact of the cache in the performance is seen clearly. The dot-product performance only goes down when N > 4096 (When we use more than 16KB in memory); the Atmel processor we're using has a 16 Kbyte data cache.
source : http://linuxdevices.com/articles/AT5920399313.html
相关文章推荐
- <2012 12 15> ABI/EABI/OABI详解及ARM-linux 浮点运算解析与配置
- <2012 12 17> linux驱动中的platform总线架构(含具体IIC设备驱动)
- <2012 12 20> Gcc/ARM/Linux Kernel关于浮点运算的一些说明(很有用!)
- <2012 12 12> Linux and the GNU System —— by Richard Stallman
- <2012 12 13> 裸机编程与OS环境编程的有关思考
- <2012 12 02> linux下利用valgrind工具进行内存泄露检测和性能分析
- <2012 12 17> C标准库中一些字符串操作函数的实现
- <stdint.h>有关字长与平台无关的整数类型
- <2012 11 3 > linux设备驱动程序开发初探(1) 目次 概念 框架 最小驱动程序
- <2012 11 15> Linux内核驱动总体框架——《深入理解Linux内核》chapter 13 读书笔记
- <2012 12 05> FL2440开发板的U-boot-2010.09版本移植(七)NAND Flash启动支持
- <2012 12 06> FL2440开发板的U-boot-2010.09版本移植(九)NAND Flash启动支持的一种新型方法,利用U-Boot自带nand_spl/nand_boot.c
- <2012 11 4 > linux设备驱动程序开发初探(3) 练习:从零写一个查询式按键驱动程序
- <2012 12 16> gcc-ARM交叉编译器死活不支持math.h中的isnormal、isfinite两个宏
- <2012 12 17> “Kernel panic - not syncing” 问题的解决
- <2012 12 05> FL2440开发板的U-boot-2010.09版本移植(八)LCD的支持
- <2012 11 13> 一步步建立linux&嵌入式linux应用与开发环境(based on 虚拟机)
- <2013 12 17> 雅思写作、口语相关
- <2012 10 29> 调C记录 <有关指针 数组 声明>
- <<鸟哥的Linux 私房菜>> (1->12) + my