1.Vectorized Memory Access
Here two categories of instruction sets are introduced ,they are SSE\/SSE2\/SSE3\/SSE4.1\/SSE\/4.2 and AVX\/AVX2\/AVX-512 {even though I never find a machine whose CPU is capable AVX-512} where SSE stands for Stream SIMD eXtension and AVX stands for Advanced Vector eXtension. avx is brought into X86 architecture later than SSE, and parallel data width extends to 512 bit in AVX-512.that 's to say , one avx-512 instruction and manipulate all 64-bytes with a cache line {if and usually it 's 64 bytes}.
here we use several benchmark to show how Vectorized Memory Access sometimes can accelerate accessing memory .the senerio is we construct a matrix of 2500*2500, and eche matrix element is a byte array of 64-bytes which is cache line size aligned .
what we do walk through the whole matrix sequentially or randomly and replace it with anther element which is also in the matrix but with random or sequential corresponding address.i.e. copy element with mat[i][j] = mat[2499-j][2499-i]
this will potentially increase oppotunitiy for cache conflicting .
note that 2500\*2500\*64 = 400 000 000
which is almost 400MB memory,fortunately we can allocate it from DPDK eal memory and align it and even from a specific numa socket{later we will demonstrate how cross-numa memory access will inefficient as opposed to in-socket access}
here we first give thw whole benchmark framework ,it run on one lcore callback routine of dpdk example helloworld
on a numa socket 1.let's check with cpu layout first:
we could know that cpu RDTSC HZ(total ticks per seconds) is 2400002495 by invoking dpdk exported function:rte_get_tsc_hz()
.
here is the overall skeleton:
1). copy with glibc memcpy
memcpy
put this code snippet there above :
here we could calacute the overall tsc cycles needed is :299832720
almost 125ms.
2) copy with dpdk memory utility function rte_memcopy
rte_memcopy
here is the code snippet:
the total tsc cycles needed is :297097232
which we could know from this ,DPDK optimizes very little with default compilation option maybe.
3) copy with sse vectorized non-temoral stream instructions
here is the code snippet:
the total tsc cycles needed is :159769811
almost 66ms ,we could know it's almost half of the time when using glibc version memcopy ,but one limitation is we we use sse stream instructions ,the address must be 16-byte aligned .
4) copy with avx\/avx2 vectorized non-temporal stream instructions
code snippet again:
with AVX\/2 the data width is 256-bit ,thus decreasing instruction count ,here is the total tsc cycles :156915100
, even with less instructions count , the result is almost the same with SSE version.still we will explain the reason later.
5) copy with AVX\/AVX2 vectorized temporal stream instructions
code snippet:
the total tsc cycles is :296072472
,the same with glib .
here we explain why.
with glibc non-optimized memory copy , we should fetch both source and destination memory into cache by cache line size every time ,chances are so many cache misses can happen since we source memory are randomly accessed . so do not be suprised that it takes so long.
as I said ,vectorized instrucition may accelerate overall performance ,but notice that case 5) and 1), even with vectorized instructions the total time is similar .the answer is when it comes to memory io,vectorization is not the leading factor,cache matters most . actually case 3) and 4) bypass cache fetching and eviction .
next we will explain how cache bypassing happens.
let's reiterate that when now reading\/writing an address happen,it first load the corresponding cache line into L3\/L2\/L1d cache which will usually takes a lot of cpu cycles ,what 's more this will make some cache line evicting out of cache at a certain level . the idea of cache bypassing is directly loading cache into cpu register and writing cpu register to memory ,this sound simple ,but direct interaction between cpu and memory is never cheaper than cpu and cache ,butthe benefit is cache bypassing never pollutes current cache layout
, still ,you may worry when you write a byte even with cache bypassing ,does that will immediately make cpu write that data to memory,the answer is probably not right ,the cpu will maintain a buffer which is called write-combining buffer ,when cache-bypassing write happen ,it first will be rembered here ,when the whole buffer is filled by cache line size or a _mm_fence,the buffer is flushed away. this is why cache bypassing instructions need address is well aligned and still get high performance .
with cache bypassing read,the first read will load the cache line into stream load buffer ,and the succeeding reads can read from the buffer as long as the data is in the buffer ,thus accelerating whole read while polluting no cache ,but remeber ,loading the entire cache line is still costy .
from case 5) we know even vecorization instruction is used ,the overall memory copy does not get much faster due to costy and frequent cache fetching and eviction .
by the way ,we just feel how cross-numa memory access is awful ,here we will demonstrate that. even with avx2 optimized memory copy ,we can see cross-numa access is still expensive :we get '324323287' ,See, it's doubled while performance degrade by half
Last updated
Was this helpful?