您的位置:首页 > 运维架构

How to user SSE2 instructions to improve the performance of memory copy?

2009-04-04 09:41 761 查看
How to user SSE2 instructions to improve the
performance of memory copy?


It is an ageless topic for
programmers to discuss how to improve the performance to transfer data. For the
common scenario, the memcpy provided by C/C++ lib is enough to handle
the boring duplicating tasks. However, if the size of data is larger than the
size of last data cache in your system, for example, the last data cache is L2
data cache in my system whose size equals 1Mbytes, what will happen when
copying 10Mbytes data? Obviously, the speed will become slow because the cache
is polluted after invoking memcpy.
As a result, we often ask
whether there is a good solution that can decrease the polluting degree in the
data cache, and improve its performance at the same time. Fortunately, it
exists in the SSEx instructions’ set. This article only focus on the SSE2’s
instructions, the advanced SSEx will be discussed in the following articles.
Before showing the source codes,
we had better understand several basic instructions in the SSE, and they are:
a. PREFETCHNTA.
Non-temporal data—fetch data
into location close to the processor, minimizing cache pollution
• Pentium III
processor—1st-level cache
• Pentium 4 and Intel Xeon
processor—2nd-level cache
b. MOVNTDQ.
The MOVNTDQ (store double
quadword using non-temporal hint) instruction stores packed integer data from
an XMM register to memory, using a non-temporal hint.
c. MOVDQA.
The MOVDQA (move aligned
double quadword) instruction transfers a double quadword operand from memory to
an XMM register or vice versa; or between XMM registers. The memory address
must be aligned to a 16-byte boundary; otherwise, a general-protection
exception (#GP) is generated.
d. MOVDQU.
The MOVDQU (move unaligned
double quadword) instruction performs the same operations as the MOVDQA
instruction, except that 16-byte alignment of a memory address is not required.
Its efficiency is lower than MOVDQA.
e. SFENCE.
The SFENCE (Store Fence)
instruction controls write ordering by creating a fence for memory store
operations. This instruction guarantees that the result of every store
instruction that precedes the store fence in program order is globally visible
before any store instruction that follows the fence. The SFENCE instruction
provides an efficient way of ensuring ordering between procedures that produce
weakly-ordered data and procedures that consume that data.

The maximal advantage is that SSE
instruction enhances the throughput to transfer data. One cycle of instruction
can transfer 16 bytes’ data, but the original bandwidth of MOV only holds 4
bytes. Moreover, by cooperating with PREFETCHNTA, the hit rating of data cache
takes few effects after SSE’s memcpy if comparing with the traditional way.
Ok, we have spent too much time
in describing the un-core contents, and let’s enter the main branch.

[The block of
source code]


void* sse2_fast_memcpy( void *pDst, const
void *pSrc, size_t len )
{
void
*pBegin = pDst;
int offset;

if
( (offset = ((unsigned long) pDst) & SSE_ALIGNMENT_MASK) > 0 )
offset
= SSE_ALIGNMENT_VAL - offset;

if
( len < (size_t) offset + 16 )
{
return
memcpy( pDst, pSrc, len );
}

if
( offset > 0 )
{
memcpy(
pDst, pSrc, offset );
len
-= offset;
pDst
= ((char *) pDst) + offset;
pSrc
= ((char *) pSrc) + offset;
}

if
( SSE_CAN_ALIGN( pDst, pSrc ) )
{
_asm
{
mov
ecx, len
mov
esi, pSrc
mov
edi, pDst

cmp
ecx, 128
jb
LA2
prefetchnta
[esi]
LA1:
prefetchnta
XMMWORD PTR[esi + 16 * 4]
movdqa
xmm0, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm0
movdqa
xmm1, XMMWORD PTR[esi + 16 * 1]
movntdq
XMMWORD PTR[edi + 16 * 1], xmm1
movdqa
xmm2, XMMWORD PTR[esi + 16 * 2]
movntdq
XMMWORD PTR[edi + 16 * 2], xmm2
movdqa
xmm3, XMMWORD PTR[esi + 16 * 3]
movntdq
XMMWORD PTR[edi + 16 * 3], xmm3

prefetchnta
XMMWORD PTR[esi + 16 * 8]
movdqa
xmm4, XMMWORD PTR[esi + 16 * 4]
movntdq
XMMWORD PTR[edi + 16 * 4], xmm4
movdqa
xmm5, XMMWORD PTR[esi + 16 * 5]
movntdq
XMMWORD PTR[edi + 16 * 5], xmm5
movdqa
xmm6, XMMWORD PTR[esi + 16 * 6]
movntdq
XMMWORD PTR[edi + 16 * 6], xmm6
movdqa
xmm7, XMMWORD PTR[esi + 16 * 7]
movntdq
XMMWORD PTR[edi + 16 * 7], xmm7

add
esi, 128
add
edi, 128
sub
ecx, 128
cmp
ecx, 128
jae
LA1
LA2:
cmp
ecx, 64
jb LA3
prefetchnta
XMMWORD PTR[esi]
sub
ecx, 64
movdqa
xmm0, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm0
movdqa
xmm1, XMMWORD PTR[esi + 16 * 1]
movntdq
XMMWORD PTR[edi + 16 * 1], xmm1
movdqa
xmm2, XMMWORD PTR[esi + 16 * 2]
movntdq
XMMWORD PTR[edi + 16 * 2], xmm2
movdqa xmm3, XMMWORD PTR[esi + 16 * 3]
movntdq
XMMWORD PTR[edi + 16 * 3], xmm3

add
esi, 64
add
edi, 64
LA3:
prefetchnta
XMMWORD PTR[esi]
cmp
ecx, 32
jb LA4
sub
ecx, 32
movdqa
xmm4, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm4
movdqa
xmm5, XMMWORD PTR[esi + 16 * 1]
movntdq
XMMWORD PTR[edi + 16 * 1], xmm5

add
esi, 32
add
edi, 32
LA4:
cmp
ecx, 16
jb LA5
sub
ecx, 16
movdqa
xmm6, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm6

//add
esi, 16
//add
edi, 16
LA5:
sfence
}
}
else
// Unalignment
{
_asm
{
mov
ecx, len
mov
esi, pSrc
mov
edi, pDst

cmp
ecx, 128
jb
LB2
prefetchnta
[esi]
LB1:
prefetchnta
XMMWORD PTR[esi + 16 * 4]
movdqu
xmm0, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm0
movdqu
xmm1, XMMWORD PTR[esi + 16 * 1]
movntdq
XMMWORD PTR[edi + 16 * 1], xmm1
movdqu
xmm2, XMMWORD PTR[esi + 16 * 2]
movntdq
XMMWORD PTR[edi + 16 * 2], xmm2
movdqu
xmm3, XMMWORD PTR[esi + 16 * 3]
movntdq
XMMWORD PTR[edi + 16 * 3], xmm3

prefetchnta
XMMWORD PTR[esi + 16 * 8]
movdqu
xmm4, XMMWORD PTR[esi + 16 * 4]
movntdq
XMMWORD PTR[edi + 16 * 4], xmm4
movdqu
xmm5, XMMWORD PTR[esi + 16 * 5]
movntdq
XMMWORD PTR[edi + 16 * 5], xmm5
movdqu
xmm6, XMMWORD PTR[esi + 16 * 6]
movntdq
XMMWORD PTR[edi + 16 * 6], xmm6
movdqu
xmm7, XMMWORD PTR[esi + 16 * 7]
movntdq
XMMWORD PTR[edi + 16 * 7], xmm7

add
esi, 128
add
edi, 128
sub
ecx, 128
cmp
ecx, 128
jae
LB1
LB2:
cmp
ecx, 64
jb LB3
prefetchnta
XMMWORD PTR[esi]
sub
ecx, 64
movdqu
xmm0, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm0
movdqu
xmm1, XMMWORD PTR[esi + 16 * 1]
movntdq
XMMWORD PTR[edi + 16 * 1], xmm1
movdqu
xmm2, XMMWORD PTR[esi + 16 * 2]
movntdq
XMMWORD PTR[edi + 16 * 2], xmm2
movdqu
xmm3, XMMWORD PTR[esi + 16 * 3]
movntdq
XMMWORD PTR[edi + 16 * 3], xmm3

add
esi, 64
add
edi, 64
LB3:
prefetchnta
XMMWORD PTR[esi]
cmp
ecx, 32
jb LB4
sub
ecx, 32
movdqu
xmm4, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm4
movdqu
xmm5, XMMWORD PTR[esi + 16 * 1]
movntdq
XMMWORD PTR[edi + 16 * 1], xmm5

add
esi, 32
add
edi, 32
LB4:
cmp
ecx, 16
jb LB5
sub
ecx, 16
movdqu
xmm6, XMMWORD PTR[esi]
movntdq
XMMWORD PTR[edi], xmm6

//add
esi, 16
//add
edi, 16
LB5:
sfence
}
}
// End if ( SSE_CAN_ALIGN( pDst, pSrc ) )

offset
= len & 0x0F;
if
( offset > 0 )
{
memcpy(
((char *) pDst) + (len - offset), ((char *) pSrc) + (len - offset), offset );
}

return
pBegin;
}
[Comments of
source codes]


1) The memcpy of system is adopted when the
length of data is small in the source codes. If you prefer all by yourself,
small_fast_mempcy listed in the next session maybe is an ideal selection to
handle the small block of data.
void* small_fast_memcpy( void *pDst, const
void *pSrc, size_t len )
{
_asm
{
mov
ecx, len
mov
edi, pDst
mov
esi, pSrc
rep
movsb
}

return pDst;
}
2) To ease to catch the emphasis of source
code, the simply logical judegement is implemented by C syntax. In fact, it is
very easy to convert them into ASM, too. If readers do not prefer hybrid way,
why no to do it now by yourselfJ!

3) MACROs are utilized in the source code
to eliminate the modification if instructions are 64-byte alignment(The partial
instructions of SSE4x need 64-bytes alignment). Please refer the below block.
#define SSE_ALIGNMENT_VAL (16)
#define SSE_ALIGNMENT_MASK (SSE_ALIGNMENT_VAL - 1)
#define SSE_CAN_ALIGN( addr1, addr2 ) /
((((unsigned
long) (addr1)) & SSE_ALIGNMENT_MASK) == (((unsigned long) (addr2)) &
SSE_ALIGNMENT_MASK))
4) The interlace of XMMx register is to
decrease the reading and writing dependence, and improve the parallel ability.

5) The interlace of reading and writing
operations is to add the delaying time to cover the current prefetching
latency.

6) The prefetching size is 64bytes and it
is my length of data cache line in my PC. It should be adjusted in the real
environment, but it is appropriate size in normal case.

[Summarization]
1. It only can acquire the improvement of
speed when the data’s size is larger than the size of last data cache.
Otherwise, the performance becomes worse. By testing it for big data, there is
+50% improvement.
2. Be careful to use prefetch instruction.
Its abusing operation can heavily disturb the performance of system.
3. It is not the best solution by using
SSEx instruction. If the higher SSEx’s instructions are used, +1% gain is
available if compared with SSE2.

Anyway, I hope that readers can give me any
feedback to improve/correct the usage of SSEx. I think that it will help me
understand inside deeply.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: