How to user SSE2 instructions to improve the performance of memory copy?
2009-04-04 09:41
761 查看
How to user SSE2 instructions to improve the
performance of memory copy?
It is an ageless topic for
programmers to discuss how to improve the performance to transfer data. For the
common scenario, the memcpy provided by C/C++ lib is enough to handle
the boring duplicating tasks. However, if the size of data is larger than the
size of last data cache in your system, for example, the last data cache is L2
data cache in my system whose size equals 1Mbytes, what will happen when
copying 10Mbytes data? Obviously, the speed will become slow because the cache
is polluted after invoking memcpy.
As a result, we often ask
whether there is a good solution that can decrease the polluting degree in the
data cache, and improve its performance at the same time. Fortunately, it
exists in the SSEx instructions’ set. This article only focus on the SSE2’s
instructions, the advanced SSEx will be discussed in the following articles.
Before showing the source codes,
we had better understand several basic instructions in the SSE, and they are:
a. PREFETCHNTA.
Non-temporal data—fetch data
into location close to the processor, minimizing cache pollution
• Pentium III
processor—1st-level cache
• Pentium 4 and Intel Xeon
processor—2nd-level cache
b. MOVNTDQ.
The MOVNTDQ (store double
quadword using non-temporal hint) instruction stores packed integer data from
an XMM register to memory, using a non-temporal hint.
c. MOVDQA.
The MOVDQA (move aligned
double quadword) instruction transfers a double quadword operand from memory to
an XMM register or vice versa; or between XMM registers. The memory address
must be aligned to a 16-byte boundary; otherwise, a general-protection
exception (#GP) is generated.
d. MOVDQU.
The MOVDQU (move unaligned
double quadword) instruction performs the same operations as the MOVDQA
instruction, except that 16-byte alignment of a memory address is not required.
Its efficiency is lower than MOVDQA.
e. SFENCE.
The SFENCE (Store Fence)
instruction controls write ordering by creating a fence for memory store
operations. This instruction guarantees that the result of every store
instruction that precedes the store fence in program order is globally visible
before any store instruction that follows the fence. The SFENCE instruction
provides an efficient way of ensuring ordering between procedures that produce
weakly-ordered data and procedures that consume that data.
The maximal advantage is that SSE
instruction enhances the throughput to transfer data. One cycle of instruction
can transfer 16 bytes’ data, but the original bandwidth of MOV only holds 4
bytes. Moreover, by cooperating with PREFETCHNTA, the hit rating of data cache
takes few effects after SSE’s memcpy if comparing with the traditional way.
Ok, we have spent too much time
in describing the un-core contents, and let’s enter the main branch.
[The block of
source code]
[Comments of
source codes]
1) The memcpy of system is adopted when the
length of data is small in the source codes. If you prefer all by yourself,
small_fast_mempcy listed in the next session maybe is an ideal selection to
handle the small block of data.
2) To ease to catch the emphasis of source
code, the simply logical judegement is implemented by C syntax. In fact, it is
very easy to convert them into ASM, too. If readers do not prefer hybrid way,
why no to do it now by yourselfJ!
3) MACROs are utilized in the source code
to eliminate the modification if instructions are 64-byte alignment(The partial
instructions of SSE4x need 64-bytes alignment). Please refer the below block.
4) The interlace of XMMx register is to
decrease the reading and writing dependence, and improve the parallel ability.
5) The interlace of reading and writing
operations is to add the delaying time to cover the current prefetching
latency.
6) The prefetching size is 64bytes and it
is my length of data cache line in my PC. It should be adjusted in the real
environment, but it is appropriate size in normal case.
[Summarization]
1. It only can acquire the improvement of
speed when the data’s size is larger than the size of last data cache.
Otherwise, the performance becomes worse. By testing it for big data, there is
+50% improvement.
2. Be careful to use prefetch instruction.
Its abusing operation can heavily disturb the performance of system.
3. It is not the best solution by using
SSEx instruction. If the higher SSEx’s instructions are used, +1% gain is
available if compared with SSE2.
Anyway, I hope that readers can give me any
feedback to improve/correct the usage of SSEx. I think that it will help me
understand inside deeply.
performance of memory copy?
It is an ageless topic for
programmers to discuss how to improve the performance to transfer data. For the
common scenario, the memcpy provided by C/C++ lib is enough to handle
the boring duplicating tasks. However, if the size of data is larger than the
size of last data cache in your system, for example, the last data cache is L2
data cache in my system whose size equals 1Mbytes, what will happen when
copying 10Mbytes data? Obviously, the speed will become slow because the cache
is polluted after invoking memcpy.
As a result, we often ask
whether there is a good solution that can decrease the polluting degree in the
data cache, and improve its performance at the same time. Fortunately, it
exists in the SSEx instructions’ set. This article only focus on the SSE2’s
instructions, the advanced SSEx will be discussed in the following articles.
Before showing the source codes,
we had better understand several basic instructions in the SSE, and they are:
a. PREFETCHNTA.
Non-temporal data—fetch data
into location close to the processor, minimizing cache pollution
• Pentium III
processor—1st-level cache
• Pentium 4 and Intel Xeon
processor—2nd-level cache
b. MOVNTDQ.
The MOVNTDQ (store double
quadword using non-temporal hint) instruction stores packed integer data from
an XMM register to memory, using a non-temporal hint.
c. MOVDQA.
The MOVDQA (move aligned
double quadword) instruction transfers a double quadword operand from memory to
an XMM register or vice versa; or between XMM registers. The memory address
must be aligned to a 16-byte boundary; otherwise, a general-protection
exception (#GP) is generated.
d. MOVDQU.
The MOVDQU (move unaligned
double quadword) instruction performs the same operations as the MOVDQA
instruction, except that 16-byte alignment of a memory address is not required.
Its efficiency is lower than MOVDQA.
e. SFENCE.
The SFENCE (Store Fence)
instruction controls write ordering by creating a fence for memory store
operations. This instruction guarantees that the result of every store
instruction that precedes the store fence in program order is globally visible
before any store instruction that follows the fence. The SFENCE instruction
provides an efficient way of ensuring ordering between procedures that produce
weakly-ordered data and procedures that consume that data.
The maximal advantage is that SSE
instruction enhances the throughput to transfer data. One cycle of instruction
can transfer 16 bytes’ data, but the original bandwidth of MOV only holds 4
bytes. Moreover, by cooperating with PREFETCHNTA, the hit rating of data cache
takes few effects after SSE’s memcpy if comparing with the traditional way.
Ok, we have spent too much time
in describing the un-core contents, and let’s enter the main branch.
[The block of
source code]
void* sse2_fast_memcpy( void *pDst, const void *pSrc, size_t len ) { void *pBegin = pDst; int offset; if ( (offset = ((unsigned long) pDst) & SSE_ALIGNMENT_MASK) > 0 ) offset = SSE_ALIGNMENT_VAL - offset; if ( len < (size_t) offset + 16 ) { return memcpy( pDst, pSrc, len ); } if ( offset > 0 ) { memcpy( pDst, pSrc, offset ); len -= offset; pDst = ((char *) pDst) + offset; pSrc = ((char *) pSrc) + offset; } if ( SSE_CAN_ALIGN( pDst, pSrc ) ) { _asm { mov ecx, len mov esi, pSrc mov edi, pDst cmp ecx, 128 jb LA2 prefetchnta [esi] LA1: prefetchnta XMMWORD PTR[esi + 16 * 4] movdqa xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqa xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqa xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqa xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3 prefetchnta XMMWORD PTR[esi + 16 * 8] movdqa xmm4, XMMWORD PTR[esi + 16 * 4] movntdq XMMWORD PTR[edi + 16 * 4], xmm4 movdqa xmm5, XMMWORD PTR[esi + 16 * 5] movntdq XMMWORD PTR[edi + 16 * 5], xmm5 movdqa xmm6, XMMWORD PTR[esi + 16 * 6] movntdq XMMWORD PTR[edi + 16 * 6], xmm6 movdqa xmm7, XMMWORD PTR[esi + 16 * 7] movntdq XMMWORD PTR[edi + 16 * 7], xmm7 add esi, 128 add edi, 128 sub ecx, 128 cmp ecx, 128 jae LA1 LA2: cmp ecx, 64 jb LA3 prefetchnta XMMWORD PTR[esi] sub ecx, 64 movdqa xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqa xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqa xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqa xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3 add esi, 64 add edi, 64 LA3: prefetchnta XMMWORD PTR[esi] cmp ecx, 32 jb LA4 sub ecx, 32 movdqa xmm4, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm4 movdqa xmm5, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm5 add esi, 32 add edi, 32 LA4: cmp ecx, 16 jb LA5 sub ecx, 16 movdqa xmm6, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm6 //add esi, 16 //add edi, 16 LA5: sfence } } else // Unalignment { _asm { mov ecx, len mov esi, pSrc mov edi, pDst cmp ecx, 128 jb LB2 prefetchnta [esi] LB1: prefetchnta XMMWORD PTR[esi + 16 * 4] movdqu xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqu xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqu xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqu xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3 prefetchnta XMMWORD PTR[esi + 16 * 8] movdqu xmm4, XMMWORD PTR[esi + 16 * 4] movntdq XMMWORD PTR[edi + 16 * 4], xmm4 movdqu xmm5, XMMWORD PTR[esi + 16 * 5] movntdq XMMWORD PTR[edi + 16 * 5], xmm5 movdqu xmm6, XMMWORD PTR[esi + 16 * 6] movntdq XMMWORD PTR[edi + 16 * 6], xmm6 movdqu xmm7, XMMWORD PTR[esi + 16 * 7] movntdq XMMWORD PTR[edi + 16 * 7], xmm7 add esi, 128 add edi, 128 sub ecx, 128 cmp ecx, 128 jae LB1 LB2: cmp ecx, 64 jb LB3 prefetchnta XMMWORD PTR[esi] sub ecx, 64 movdqu xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqu xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqu xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqu xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3 add esi, 64 add edi, 64 LB3: prefetchnta XMMWORD PTR[esi] cmp ecx, 32 jb LB4 sub ecx, 32 movdqu xmm4, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm4 movdqu xmm5, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm5 add esi, 32 add edi, 32 LB4: cmp ecx, 16 jb LB5 sub ecx, 16 movdqu xmm6, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm6 //add esi, 16 //add edi, 16 LB5: sfence } } // End if ( SSE_CAN_ALIGN( pDst, pSrc ) ) offset = len & 0x0F; if ( offset > 0 ) { memcpy( ((char *) pDst) + (len - offset), ((char *) pSrc) + (len - offset), offset ); } return pBegin; } |
source codes]
1) The memcpy of system is adopted when the
length of data is small in the source codes. If you prefer all by yourself,
small_fast_mempcy listed in the next session maybe is an ideal selection to
handle the small block of data.
void* small_fast_memcpy( void *pDst, const void *pSrc, size_t len ) { _asm { mov ecx, len mov edi, pDst mov esi, pSrc rep movsb } return pDst; } |
code, the simply logical judegement is implemented by C syntax. In fact, it is
very easy to convert them into ASM, too. If readers do not prefer hybrid way,
why no to do it now by yourselfJ!
3) MACROs are utilized in the source code
to eliminate the modification if instructions are 64-byte alignment(The partial
instructions of SSE4x need 64-bytes alignment). Please refer the below block.
#define SSE_ALIGNMENT_VAL (16) #define SSE_ALIGNMENT_MASK (SSE_ALIGNMENT_VAL - 1) #define SSE_CAN_ALIGN( addr1, addr2 ) / ((((unsigned long) (addr1)) & SSE_ALIGNMENT_MASK) == (((unsigned long) (addr2)) & SSE_ALIGNMENT_MASK)) |
decrease the reading and writing dependence, and improve the parallel ability.
5) The interlace of reading and writing
operations is to add the delaying time to cover the current prefetching
latency.
6) The prefetching size is 64bytes and it
is my length of data cache line in my PC. It should be adjusted in the real
environment, but it is appropriate size in normal case.
[Summarization]
1. It only can acquire the improvement of
speed when the data’s size is larger than the size of last data cache.
Otherwise, the performance becomes worse. By testing it for big data, there is
+50% improvement.
2. Be careful to use prefetch instruction.
Its abusing operation can heavily disturb the performance of system.
3. It is not the best solution by using
SSEx instruction. If the higher SSEx’s instructions are used, +1% gain is
available if compared with SSE2.
Anyway, I hope that readers can give me any
feedback to improve/correct the usage of SSEx. I think that it will help me
understand inside deeply.
相关文章推荐
- How to improve the performance of MB5B
- How to Improve the Performance of CLR
- How to fix the dreaded "java.lang.OutOfMemoryError: PermGen space" exception (classloader leaks)
- How to Improve the Quality of Impact Crusher Rotor
- FLEX how to reduce the amount of code to reduce memory consumption
- Using AJAX to Improve the Bandwidth Performance of Web Applications
- How to troubleshoot the performance of Ad-Hoc queries in SQL Server [ZT-from MS]
- How to improve the performance networking (TCP & UDP) on Linux 2.4+ for high-bandwidth applications
- How to customise the TWebBrowser user interface (part 1 of 6)
- How to monitor the full range of CPU performance events
- C++ AMP: How to measure the performance of C++ AMP algorithms?
- How to fix the dreaded "java.lang.OutOfMemoryError- PermGen space" exception (classloader leaks)
- FW: How to spawn a process that runs under the context of the impersonated user in Microsoft ASP.NET pages
- How do you copy the contents of an array to a std::vector in C++ without looping? (From stack over flow)
- How to customise the TWebBrowser user interface (part 2 of 6)
- How to fix the dreaded "java.lang.OutOfMemoryError: PermGen space" exception (classloader leaks)
- How to copy the contents of std::vector to c-style static array,safely?
- How to improve the operating efficiency of ball mill comminution
- CSV ----- To improve the performance of Excel Output with reflection (Part 1)
- Tips to improve the performance of ASP.Net Application