all-about-SIMD-on_x86_64
  • Introduction
  • 1.Vectorized Memory Access
  • 2.Intel 64 instructions demo
  • 3.Intel 64 Base Architecture notes
  • 4.Intel 64 Base Architecture:2
  • 5.Intel 64 Base Architecture:3
  • 6.Intel Architecture Optimization
Powered by GitBook
On this page
  • 1). copy with glibc memcpy
  • 2) copy with dpdk memory utility function rte_memcopy
  • 3) copy with sse vectorized non-temoral stream instructions
  • 4) copy with avx\/avx2 vectorized non-temporal stream instructions
  • 5) copy with AVX\/AVX2 vectorized temporal stream instructions

Was this helpful?

1.Vectorized Memory Access

Here two categories of instruction sets are introduced ,they are SSE\/SSE2\/SSE3\/SSE4.1\/SSE\/4.2 and AVX\/AVX2\/AVX-512 {even though I never find a machine whose CPU is capable AVX-512} where SSE stands for Stream SIMD eXtension and AVX stands for Advanced Vector eXtension. avx is brought into X86 architecture later than SSE, and parallel data width extends to 512 bit in AVX-512.that 's to say , one avx-512 instruction and manipulate all 64-bytes with a cache line {if and usually it 's 64 bytes}.

here we use several benchmark to show how Vectorized Memory Access sometimes can accelerate accessing memory .the senerio is we construct a matrix of 2500*2500, and eche matrix element is a byte array of 64-bytes which is cache line size aligned .

what we do walk through the whole matrix sequentially or randomly and replace it with anther element which is also in the matrix but with random or sequential corresponding address.i.e. copy element with mat[i][j] = mat[2499-j][2499-i] this will potentially increase oppotunitiy for cache conflicting .

note that 2500\*2500\*64 = 400 000 000 which is almost 400MB memory,fortunately we can allocate it from DPDK eal memory and align it and even from a specific numa socket{later we will demonstrate how cross-numa memory access will inefficient as opposed to in-socket access}

here we first give thw whole benchmark framework ,it run on one lcore callback routine of dpdk example helloworld on a numa socket 1.let's check with cpu layout first:

[root@server-64 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1200.562
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
[root@server-64 ~]#

we could know that cpu RDTSC HZ(total ticks per seconds) is 2400002495 by invoking dpdk exported function:rte_get_tsc_hz() .

here is the overall skeleton:

#define is_cache_line_aligned(addr) (!(((uint64_t)(addr))&0x3f))
struct foo{
        unsigned char bar[64];
}__attribute__((aligned(64)));

static int
 lcore_hello(__attribute__((unused)) void *arg)
{
        uint64_t start,end;
        unsigned lcore_id;
        lcore_id = rte_lcore_id();
        int idx_row,idx_col;
        int idx=0;
        void * base;
        struct foo  (*ptr)[2500];
        printf("hello from core %u\n", lcore_id);
        if(lcore_id != 4)
                return 0;
        base=rte_malloc(NULL,1024*1024*405,64);
        ptr=(void*)(0+(char*)base);
        start=rte_rdtsc();
        for(idx_col=0;idx_col<2500;idx_col++)
            for (idx_row=0;idx_row<2500;idx_row++)
            {
                assert(is_cache_line_aligned(&ptr[idx_row][idx_col]));
                /*do memory copy*/

            }
        end=rte_rdtsc();
        printf("mem base:%p total cycles:%"PRIu64"\n",base,end-start);
        return 0;
}

1). copy with glibc memcpy

put this code snippet there above :

{
    memcpy(&ptr[idx_row][idx_col],&ptr[2499-idx_col][2499-idx_row],64);
}

here we could calacute the overall tsc cycles needed is :299832720 almost 125ms.

2) copy with dpdk memory utility function rte_memcopy

here is the code snippet:

{
    rte_memcpy(&ptr[idx_row][idx_col],&ptr[2499-idx_col][2499-idx_row],64);
}

the total tsc cycles needed is :297097232 which we could know from this ,DPDK optimizes very little with default compilation option maybe.

3) copy with sse vectorized non-temoral stream instructions

here is the code snippet:

{ /*sse2/4_1 optimized*/
    __m128i m0,m1,m2,m3;
    m0=_mm_stream_load_si128((__m128i *)(0+(char*)&ptr[2499-idx_col][2499-idx_row]));
    m1=_mm_stream_load_si128((__m128i *)(16+(char*)&ptr[2499-idx_col][2499-idx_row]));
    m2=_mm_stream_load_si128((__m128i *)(32+(char*)&ptr[2499-idx_col][2499-idx_row]));
    m3=_mm_stream_load_si128((__m128i *)(48+(char*)&ptr[2499-idx_col][2499-idx_row]));
    _mm_stream_si128 ((__m128i *)(0+(char*)&ptr[idx_row][idx_col]),m0);
    _mm_stream_si128 ((__m128i *)(16+(char*)&ptr[idx_row][idx_col]),m1);
    _mm_stream_si128 ((__m128i *)(32+(char*)&ptr[idx_row][idx_col]),m2);
    _mm_stream_si128 ((__m128i *)(48+(char*)&ptr[idx_row][idx_col]),m3);
}

the total tsc cycles needed is :159769811 almost 66ms ,we could know it's almost half of the time when using glibc version memcopy ,but one limitation is we we use sse stream instructions ,the address must be 16-byte aligned .

4) copy with avx\/avx2 vectorized non-temporal stream instructions

code snippet again:

{ /*avx/2 stream optimizted*/
    __m256i m0,m1;
    m0=_mm256_stream_load_si256((__m256i *)(0+(char*)&ptr[2499-idx_col][2499-idx_row]));
    m1=_mm256_stream_load_si256((__m256i *)(32+(char*)&ptr[2499-idx_col][2499-idx_row]));
    _mm256_stream_si256((__m256i *)(0+(char*)&ptr[idx_row][idx_col]),m0);
    _mm256_stream_si256((__m256i *)(32+(char*)&ptr[idx_row][idx_col]),m1);
}

with AVX\/2 the data width is 256-bit ,thus decreasing instruction count ,here is the total tsc cycles :156915100, even with less instructions count , the result is almost the same with SSE version.still we will explain the reason later.

5) copy with AVX\/AVX2 vectorized temporal stream instructions

code snippet:

{/*avx/2 temporal version*/
    __m256i m0,m1;
    m0=_mm256_load_si256((__m256i *)(0+(char*)&ptr[2499-idx_col][2499-idx_row]));
    m1=_mm256_load_si256((__m256i *)(32+(char*)&ptr[2499-idx_col][2499-idx_row]));
    _mm256_store_si256((__m256i *)(0+(char*)&ptr[idx_row][idx_col]),m0);
    _mm256_store_si256((__m256i *)(32+(char*)&ptr[idx_row][idx_col]),m1);
}

the total tsc cycles is :296072472 ,the same with glib .

here we explain why.

with glibc non-optimized memory copy , we should fetch both source and destination memory into cache by cache line size every time ,chances are so many cache misses can happen since we source memory are randomly accessed . so do not be suprised that it takes so long.

as I said ,vectorized instrucition may accelerate overall performance ,but notice that case 5) and 1), even with vectorized instructions the total time is similar .the answer is when it comes to memory io,vectorization is not the leading factor,cache matters most . actually case 3) and 4) bypass cache fetching and eviction .

next we will explain how cache bypassing happens.

let's reiterate that when now reading\/writing an address happen,it first load the corresponding cache line into L3\/L2\/L1d cache which will usually takes a lot of cpu cycles ,what 's more this will make some cache line evicting out of cache at a certain level . the idea of cache bypassing is directly loading cache into cpu register and writing cpu register to memory ,this sound simple ,but direct interaction between cpu and memory is never cheaper than cpu and cache ,butthe benefit is cache bypassing never pollutes current cache layout , still ,you may worry when you write a byte even with cache bypassing ,does that will immediately make cpu write that data to memory,the answer is probably not right ,the cpu will maintain a buffer which is called write-combining buffer ,when cache-bypassing write happen ,it first will be rembered here ,when the whole buffer is filled by cache line size or a _mm_fence,the buffer is flushed away. this is why cache bypassing instructions need address is well aligned and still get high performance .

with cache bypassing read,the first read will load the cache line into stream load buffer ,and the succeeding reads can read from the buffer as long as the data is in the buffer ,thus accelerating whole read while polluting no cache ,but remeber ,loading the entire cache line is still costy .

from case 5) we know even vecorization instruction is used ,the overall memory copy does not get much faster due to costy and frequent cache fetching and eviction .

by the way ,we just feel how cross-numa memory access is awful ,here we will demonstrate that. even with avx2 optimized memory copy ,we can see cross-numa access is still expensive :we get '324323287' ,See, it's doubled while performance degrade by half

PreviousIntroductionNext2.Intel 64 instructions demo

Last updated 4 years ago

Was this helpful?