3.principle of vectorizing ring buffer

3.1 essence of the ring buffer

I would say ring buffer is kind of static queue which is hosted in continuous memory space,from data structure class in school I know that we must preserve a element as a sentinel ,which will indicates whether the qeueu is full.

but here the queue is another one which is consistent with the queue implementation of dpdk ring ,i.e. the rear and front pointer does not wrap around,instead ,a mask is bit-and applied whever we need the real in-memory location.

another important thing is that ring buffer element size is equal to cache line size,which means minimum fetching size of ring buffer queue is cache line size (usually it's 64 bytes on Intel x86 architecture),besides ,the starting address of ring buffer should be cache line size aligned,and thus ensuring the following blocks of 64-bytes are also aligned.this is extrodinary important because lots of memory relevant instructions need address to be aligned to certain bytes length.

3.2 how to copy memory effictively

copying memory will take lots of cpu cycles,i.e. that will share lots of cpu time and result in inefficient usage for other processes ,image we have mechanism with which certain cpu instruction or special device assists copying memory as DMA behaves,once copying is done ,it will notify cpu through intrupt or something ,but there is no device or native cpu can do this.if we really need to copy memory ,cpu must be involved .

another fact is memory copied to or from ring buffer is probably not gonna be referenced recently ,that's to say:these memory is not temporally local,so we should use non-temporal instructions when we access the memory , when the non-temporal instruction is executed ,memory will directly loaded from memory and written back to memory ,thus bypassing cache (this could be very critical because memory copying between packet buffer and ring buffer of virtual link imposes no negative influence on existing cache layout,it means no cache line should be evicted out of cache because of this behavior).as we know clearly ,the nearer data is to cpu ,the cheaper data accessing is .often when data is available at Level 1 data cache , it takes 1-2 cycles ,Level 2 cache needs almost cycles ,the worst case is the data only is available in main memory ,the total cycles needed is about cycles (according to Intel architecture optimization reference). one may wonder since non-temporal instruction will interfere directory memory directory with higher cost,is that really possible to bypass cache and still gaining high performance ? I would it's possible ,the reason is when you load a certain bytes from memory with non-temporal instruction ,it will load the whole cache line into stream loading buffer ,and then subsequent load operation will access this buffer as long as the referenced data is still in the same cache line. and this is true with non-temporal store instruction ,the writing combining protocol will snoop cpu store instruction ,when a cache line is fully written or before another cache line data is gonna to be written back,the previous cache line writing takes effect immediately ,this is how load and store is not as inefficient as you may think.alas ,the address must be restrictly aligned ,or exception will be triggled.

3.3 common data structure

the following data structure definition should be self explainary:

struct vecring_block64{
    union{
        struct{
            __attribute__((aligned(32))) void *dummy32_0 ;
            __attribute__((aligned(32))) void *dummy32_1 ;
        };
        struct{
            __attribute__((aligned(16))) void *dummy16_0 ;
            __attribute__((aligned(16))) void *dummy16_1 ;
            __attribute__((aligned(16))) void *dummy16_2 ;
            __attribute__((aligned(16))) void *dummy16_3 ;
        };
    };
};

struct vecring_header_t{
    uint64_t vring_ready;
    uint64_t front;
    uint64_t rear;
    uint32_t nr_block64;/*must be power of 2*/
    /*where nr_block64-1 is the mask*/
    __attribute__((aligned(64))) struct vecring_block64 data[0] ;
};

struct vecring_element_t{
    union{
        uint64_t data_start_index;
        struct {
            uint32_t data_start_index0;
            uint32_t data_start_index1;
        };
    };
    union{
        uint32_t data_u32;
        struct {
            uint16_t data_length;
            uint8_t is_fetched;
            uint8_t end_of_local_block;
        };
    };
    uint32_t reserved_32;
}__attribute__((packed));

vecring_block64 represents element definition of ring buffer,it offser helper fildes for developer to address 16 or 32 bytes offset address because of compiler alignment options.

vecring_header_t comprises whatever a queue needs to maintain ,including front and rear pointer,a field ving_ready is employed by two sides of virtual link to synchronize status.nr_block64 which is always power of 2,indicates how many elements the ring buffer can contain,also ,alignment must be exploited to make sure the header is cache line size aligned.

vecring_element_t is supposed to be the ctrl header of packet,it occupies 16 butes,and 4 ctrl header can be grouped into one cache line ,that's to say :we emerge at most 4 ctrl headers which can support more efficient enqueuing bulk data transfering. we will explain this later.

3.4 steps of enqueuing

Vecring can support packet transfering with 3 kinds of bulking enqueuing methods,they are respectively x4/x2/x1 . the reason why maximum x4 is explained just before.

next will demonstrate the scenario of how enqueuing behaves with x4 speed,the ring buffer layout is presented as below:

rember we can put at most 4 ctrl header into one 64-byte block ,and the 4th ctrl header also indicate end of ctrl block ,as a matter of fact ,receiving side will know this even this flag is not set.each ctrl header actually container data length field and begining index of 64-bit length which can be used for address the data location.

as a complementary ,the following figure depicts x2 scenario:

we should know how important the flag end of ctrl block is ,if it's not set,both sides can crash down. Last ,we use a wrapper function to judge which function will be invoked and grouped according to available blocks still in the ring .

3.5 steps of dequeuing

dequeuing from the ring buffer the opposite operation of enqueuing ,it's also kind of bulk operation .

when are trying to fetch packets from ring buffer ,when load the current first block which may contain more than one ctrl header,we decode header and then allocate dpdk native rte_mbuf ,copy data into rte_mbuf. and then update rear header.

one problem we may encounter is rte_mbuf allocation may fail , to mitigrate this failure ,we should mark previous handled ctrl header as fetched then write back where the are from.

Previous2.environment setup Next4.performance of vecring

Last updated 4 years ago

Was this helpful?