您的位置:首页 > 大数据 > 人工智能

ARM basic training

2015-12-14 11:37 447 查看

Concepts

架构、指令集、产品

几个概念的区别:Arch(Architecture)架构, Instruction set指令集,ARM Core/ARM Product ARM核心/ARM产品, Vairant SocProduct各类的Soc芯片产品

架构和指令集一般是一对一对应的(也可以理解为一个范畴),比如ARMv7架构对应的是ARMv7指令集;

Core核心一般指ARM推出的产品,比如CortexA系列的A53,A57,A72等等

SoC 产品,包含CPU,GPU,基带等,各大芯片厂商根据ARMCore或者ARM指令集来设计各自Soc产品,有的用A53公版设计(处理器授权),如 QualcommSnapdragon810,有的不用A53自行设计(指令集授权)比如QualcommSnapdragon820 with Kryo core

RISC vs CISC

reduced instruction set computer,精简指令集计算机)是一种执行较少类型计算机指令的微处理器,起源于80年代的MIPS主机(即RISC机),RISC机中采用的微处理器统称RISC处理器。

ARM/MIPS/PowerPC均是基于精简指令集机器处理器的架构

x86架构是可变指令长度的CISC(复杂指令集电脑,ComplexInstructionSet Computer)

APCS

ARM 过程调用标准(ARMProcedureCallStandard)

ARMv7

Reference

DDI0406C_C_arm_architecture_reference_manual.pdf download form www.arm.com

ARM工作模式

ARM指令集32-bit,Thumb指令集16-bit

ARM工作模式分Un-privileged和Privilegedmode,

前者主要跑userspace程序,

后者为特权模式,分为FIQ,IRQ,Supervisor(复位或软中断)即SVCmode,内核执行单元运行的模式,Abort(memoryaccessviolations),Undef(undefinedinstructions),System

ARM register

[DDI0406C_C] A2.3 ARM coreregisters

thirteengeneral-purpose 32-bit registers, R0 to R12

three32-bit registers with special uses, SP, LR, and PC, that can be described asR13 to R15.

R13:SP(Stackpointer)

R14:LR(Linkregister)

R15:PC(programmingcounter)

CPSR(currentprogram status register)

SPSR(savedprogram status register)特权模式(除system模式)可以存取

Condition flags update

About how to updateof condition flags in APSR

Ref: DUI0473J_armasm_user_guide.pdf 5.4 Updates to the condition flags

Some instructions update all flags, and some instructionsonly update a subset of the flags. Most instructions update the condition flagsonly if the S suffix is specified. The instructions CMP, CMN, TEQ, and TSTalways update the flags.

N

Set to 1 when the result of the operation is negative,cleared to 0 otherwise.

Z

Set to 1 when the result of the operation is zero, clearedto 0 otherwise.

C

Set to 1 when the operation results in a carry, or when asubtraction results in no borrow, cleared to 0 otherwise.

[shift operation could also set C flag, review at C10.6]

V

Set to 1 when the operation causes overflow, cleared to 0otherwise.

As for C and V, please review chapter 5.4 for usagedescription

condition codesuffixes

Suffix Flags Meaning

EQ Z set Equal

NE Z clear Not equal

CS or HS C set Higher or same (unsigned >= )

CC or LO C clear Lower (unsigned < )

MI N set Negative

PL N clear Positive or zero

VS V set Overflow

VC V clear No overflow

HI C set and Z clear Higher (unsigned >)

LS C clear or Z set Lower or same (unsigned <=)

GE N and V the same Signed >=

LT N and V differ Signed <

GT Z clear, N and V the same Signed >

LE Z set, N and V differ Signed <=

AL Any Always. This suffix is normally omitted.

I think: if flag isset, instruction will be executed, like CS

Stack implementation using LDM and STM

Ref: DUI0473J_armasm_user_guide.pdf C4.5

You can use the LDM and STM instructions to implement popand push operations respectively. You use a suffix to indicate the stack type.

Some words like FD, DB, and so on could be reviewed in C4.5

I think descending and ascending means direction of stackincreasement.

I think full or empty means location which sp point to,either next item or last existing item. Because FD(full-decending) is relatedwith DB(decrease before), that’s to say, you must decrease sp before insert newitem, ..... STMFD is a synonym of STMDB

please review the following table retrived from C4.5 Table4-6

Table 4-6 Suffixes for load and store multiple instructions

Stack type Store Load

Full descending STMFD (STMDB, DecrementBefore) LDMFD (LDM, increment after)

Full ascending STMFA (STMIB, IncrementBefore) LDMFA (LDMDA, Decrement After)

Empty descending STMED (STMDA, DecrementAfter) LDMED (LDMIB, Increment Before)

Empty ascending STMEA (STM, increment after) LDMEA (LDMDB,Decrement Before)

Serker: notice that STM is not correspongdingwith LDM

And Note that The PUSH and POP instructionsassume a full descending stack. They are the preferred synonyms for STMDB andLDM with writeback.

Usage of registers

(NOTE: this paragraph come from Internet)

1)子程序间通过寄存器R0-R3来传递参数,这时可以使用它们的别名A0-A3,被调用的子程序返回前无须重复R0-R3的内容。子函数通过R0寄存器将返回值传递给父函数。子函数返回时,将返回值存入R0,当返回到父函数时,父函数读取R0获得返回值。 发生函数调用时,R0~R3是传递参数的寄存器,即使是父函数没有参数需要传递,子函数也可以任意更改R0~R3寄存器,无需考虑会破坏它们在父函数中保存的数值,返回父函数前无需恢复其值。AAPCS规定,发生函数调用前,由父函数将R0~R3中有用的数据压栈,然后才能调用子函数,以防止父函数R0~R3中的有用数据被子函数破坏。但编译器无法预知中断函数的调用,被中断的函数无法提前对R0~R3进行压栈处理,因此需要在中断函数里对它所使用的R0~R11压栈。R12寄存器在某些版本的编译器下另有它用,用户程序不能使用,因此我们在编写汇编函数时也必须对它进行压栈处理,确保它的数值不能被破坏。

2)在子程序中,使用R4-R11来保存局部变量,这时可以使用它们的别名V1-V8,如果在子程序中使用了它们的某些寄存器,子程序进入时要保存这些寄存器的值,返回时再次恢复它们;

对于子程序中没有使用到的寄存器,则不必进行这些操作,在Thumb指令中,通常只能使用寄存器R4-R7来保存局部变量。

3)寄存器R12用作子程序间scratch寄存器,别名为IP。

4)寄存器R13用作数据栈指针,别名SP,在子程序中寄存器R13不能用作它用,它的值在进入、退出子程序时必须相等。

5)寄存器R14称为链接寄存器,别名LR,它用于保存子程序的返回地址。 如果在子程序中保存了返回地址,R14可用作它用。

6)寄存器R15是程序计数器,别名PC,没用别的用途

VectorTable

From ARM Cortex-A SeriesProgrammer’s Guide

In Chapter 11 “Exception Handling”,11.1.1 Exceptionpriorities.

There’re detailed Vector offset inTable 11-1 “Summary ofexception behavior”

For example,

0xFFFF0000 is Not Used

0xFFFF0004 is Undefined instruction

0xFFFF0008 is Supervisor Call

......

ARMv7 Instruction set

Definition

UAL : Unified Assembler Language

Assembler syntax

STR vs STM

STR和STM方向相反,STRRT,Rn, STM Rn,{registers}

Rn为baseregister,是拷贝内存的taget地址,可见STM和STR语法中Rn的位置方向相反

ADR

From DUI0473 C4.9

The ADR instruction loads an address within a certain range, without performing a data load.

ADR accepts a PC-relative expression, that is, a label with an optional offset where the address of the label is relative to the PC.

NOTE: The label used with ADR must be within the same code section.

For ARM, the certain range, i.e. offset is any value that can be produced by rotating an 8-bit value right by any even number of

bits within a 32-bit word. The range is relative to the PC.

JC: ADR is either PC-relative, or register-relative, ADRL could be more larger range address access

COMMON RULE

Memory accesses(寻址方式)

DUI0473 C4.18

(1) Offset addressing [Rn, offset]: means 在Rn基地址上偏移offset,Rn寄存器值不变

(2) Pre-indexed addressing [Rn, offset]! : means seem like above except changing Rn value

(3) Post-indexed addressing [Rn], offset : means offset just used as changing Rn rather than memory addressing

Note: offset can be

"An immediate constant",

"An index register, Rm",

"A shifted index register, such as Rm, LSL #shift" (About shift operation could be checked out in C10.6)

Example:

STR R0,[R1],#8 ;将R0中的字数据写入以R1为地址的存储器中,并将新地址R1+8写入R1。

STR R0,[R1,#8] ;将R0中的字数据写入以R1+8为地址的存储器中。”

Example

Change Stack, like function invoking[JC: NOT CLEAR]

(1) Segment in assemble code

ENTRY(call_with_stack2)

/*save current stack and return pointer*/

str sp, [r2, #-4]! //保存当前sp和lr在新栈的栈顶

str lr, [r2, #-4]!

/*copy parameters for new stack*/

mov sp, r2 //设置新栈的栈顶

mov r2, r0

mov r0, r1 //设置跳转函数的参数

adr lr, BSYM(1f) //保存返回地址在lr

mov pc, r2 //跳转新的函数地址

1: ldr lr, [sp]

ldr sp, [sp, #4]

mov pc, lr

ENDPROC(call_with_stack2)

(2) Segment in C code

static u8 dummy_stack[PAGE_SIZE / 4] __nosavedata;

call_with_stack2(mem_dummy, 0, dummy_stack + ARRAY_SIZE(dummy_stack));

解释:call_with_stack2的第一个参数是即将跳转执行的代码(拷贝一段特殊内存到预分配内存的一段地址中)地址,第二个参数是跳转执行的函数参数,第三个参数是新栈的“栈顶”(注意实际上新栈的栈顶还要往下偏移八个字节)

注意问题:

这条语句“mov sp, r2 //设置新栈的栈顶”,r2实际指向的是保存lr的首地址,而不是下一个未使用的地址,或许这是栈操作的规则,就是如果要压栈的话(比如存储4个字节),首先要向下移动四个字节,然后保存入栈,即类似“str sp, [r2, #-4]!”

ARMv8

ARMv8支持32-bit和64-bit,分别为ARMv8AArch32,ARMv8AArch64。

其工作模式在32-bit下和ARMv7一样;AArch64則是將CPU狀態簡化了,分別為EL0~3。EL0為User模式、EL1為Kernal模式、EL2為Hypervisor虛擬化之用(ForPXA1928平台,Marvell用于bootloader(uBoot) andobm)、EL3為TrustZone安全監控之用。

ARMv8跟之前ARM處理器相比,最大的亮點之一就是Crypto加密指令集的支援,目前這部份的支援主要包括基於SIMD指令的AES,SHA1與SHA2-256硬體加速指令,可參考如下簡表.

http://loda.hala01.com/2014/12/armv8-%E8%88%87-linux%E7%9A%84%E6%96%B0%E6%89%8B%E7%AD%86%E8%A8%98/

MARVEL loke7用的SHA,只去一个0x5xxxx的地址读数据,看来的确用了硬件功能。

MISC

原子操作Atomic operation

Reference: http://infocenter.arm.com/help/index.jsp
Searching “LDREX” and “STREX”

ARM平台SpinLock实现(arch/arm/include/asm/spinlock.h)

static inline void__raw_spin_lock(raw_spinlock_t *lock)

{

unsigned long tmp;

__asm__ __volatile__(

"1: ldrex %0, [%1]/n"

" teq %0, #0/n"

#ifdef CONFIG_CPU_32v6K

" wfene/n"

#endif

" strexeq%0, %2, [%1]/n"

" teqeq %0, #0/n"

" bne 1b"

: "=&r"(tmp)

: "r"(&lock->lock), "r" (1)

: "cc");

smp_mb();

}

static inline int__raw_spin_trylock(raw_spinlock_t *lock)

{

unsigned long tmp;

__asm__ __volatile__(

" ldrex %0, [%1]/n"

" teq %0, #0/n"

" strexeq%0, %2, [%1]"

: "=&r"(tmp)

: "r"(&lock->lock), "r" (1)

: "cc");

if (tmp == 0) {

smp_mb();

return 1;

} else {

return 0;

}

}

随笔:

libjpeg-turbo是对libjpeg的扩展,支持SIMD指令,如X86架构的MMX、SSE、SSE2、3DNOW,ARM架构的NEON,在对jpeg进行编码和解码的过程中能提高速度。

在图形库CxImage7.01中内含了libjpeg,因此可以很方便的将libjpeg-turbo替换掉它。对cortex-a8架构、1G主频的WINCE6平台上针对2560X1920的jpeg(解码后占用14MB内存)做了一个粗略的测试,

结果是采用libjpeg的需要2642ms,采用libjpeg-turbo的需要1900ms,效率提高30%左右。考虑到现在数码相机的分辨率越来越大,NEON指令还不能完全满足需要,硬解码才是王道。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: