all-about-SIMD-on_x86_64
  • Introduction
  • 1.Vectorized Memory Access
  • 2.Intel 64 instructions demo
  • 3.Intel 64 Base Architecture notes
  • 4.Intel 64 Base Architecture:2
  • 5.Intel 64 Base Architecture:3
  • 6.Intel Architecture Optimization
Powered by GitBook
On this page
  • 1. copy a string
  • 2. read cpu timestamp register
  • 3. compare and set implementation
  • 4. scan least set bit
  • 5. generate cmp mask with MMX instruction

Was this helpful?

2.Intel 64 instructions demo

1. copy a string

code snippet :

char * _strcpy(char* dst,const char* src)
{
        asm("cld\n\t"
                "1:lodsb \n\t"
                "stosb \n\t"
                "testb %%al,%%al\n\t"
                "jne 1b\n\t"
        :
        :"D"(dst),"S"(src)
        :"memory"
        );
        return dst;
}

lodsX and stosX will load string [byte/word/doubleword/quadword] to [ah/ax/eax/rax] and store in that reverse way,test [and arithmetic operration] only effects EFLAGS register,and then we branch according to Z flag ,note that the label ,we add a suffix 'b'(before) or 'f'(after).

2. read cpu timestamp register

code snippet:

uint64_t rd_tsc(void)
{
        uint64_t tsc=0;
        #if 0
        asm("xorq %%rax,%%rax\n\t"
                "rdtsc\n\t"
                "shlq $32,%%rdx\n\t"
                "orq %%rax,%%rdx\n\t"
                "movq %%rdx,%0"
        :"=r"(tsc)
        :
        :"%rdx","%rax"
        );
        #else
        asm("leaq %0,%%rbx\n\t"
                "rdtsc\n\t"
                "movl %%edx,4(%%rbx)\n\t"
                "movl %%eax,(%%rbx)\n\t"
        :"=m"(tsc)
        :
        :"rbx","%rdx","%rax");
        #endif
        return tsc;
}

note that RDTSC will read cpu timestamp into EDX:EAX. two ways are all right ,the first one comprise the timestamp quadword ,and store it to memory,while second one loads effective address first ,and store EDX and EAX respectively to memory.

DPDK implementation is simpler ,it place memory store operation into input/output list,let the code show you:

static inline uint64_t
rte_rdtsc(void)
{
    union {
        uint64_t tsc_64;
        struct {
            uint32_t lo_32;
            uint32_t hi_32;
        };
    } tsc;

#ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
    if (unlikely(rte_cycles_vmware_tsc_map)) {
        /* ecx = 0x10000 corresponds to the physical TSC for VMware */
        asm volatile("rdpmc" :
                     "=a" (tsc.lo_32),
                     "=d" (tsc.hi_32) :
                     "c"(0x10000));
        return tsc.tsc_64;
    }
#endif

    asm volatile("rdtsc" :
             "=a" (tsc.lo_32),
             "=d" (tsc.hi_32));
    return tsc.tsc_64;
}

the premise is that you should define a union structure first,nothing special.

3. compare and set implementation

here is my code:

int cmpandset(uint64_t * dst_ptr,uint64_t target_data,uint64_t replace_data)
{
        uint8_t ret;
        asm volatile ("lock\n"
                "cmpxchgq %%rdx,(%%rbx)\n"
                "setz %0\n"

        :"=r"(ret)
        :"a"(target_data),"b"(dst_ptr),"d"(replace_data)
        :"memory"
        );
        return ret;

}

let's take a look at how DPDK library implemented it:

static inline int
rte_atomic64_cmpset(volatile uint64_t *dst, uint64_t exp, uint64_t src)
{
    uint8_t res;


    asm volatile(
            MPLOCKED
            "cmpxchgq %[src], %[dst];"
            "sete %[res];"
            : [res] "=a" (res),     /* output */
              [dst] "=m" (*dst)
            : [src] "r" (src),      /* input */
              "a" (exp),
              "m" (*dst)
            : "memory");            /* no-clobber list */

    return res;
}

note that DPDK version use "m" modifier while I use register to construct an effective address ,both are OK,and I use setz to test the result ,while DPDK use sete, you will find both are still OK.

4. scan least set bit

gcc builtin macro__builtin_ffsll() can scan the least set bit,if a value is all zero,it represents 0,otherwise the index(starting from 0) plus 1 is returned. now I will implement it on my own:

int least_set_bit(uint64_t val)
{
        uint64_t ret;
        asm("bsf %1,%0;\n"
                "jnz 2f;\n"
                "movq $-1,%0;\n"
                "2:"
        :"=q"(ret)
        :"m"(val)
        );
        return (int)ret;
}

this time I again forget to use bsfq prefix ,but the result is right.

5. generate cmp mask with MMX instruction

here is the code:

uint64_t cmp_byte(uint64_t v1,uint64_t v2)
{
        uint64_t ret;
        asm("movq %1,%%mm0;"
                "movq %2,%%mm1;"
                "pcmpeqb %%mm1,%%mm0;"
                "movq %%mm0,%0;"
                :"=m"(ret)
                :"m"(v1),"m"(v2)
                :"memory","mm0","mm1");
        return ret;

}
Previous1.Vectorized Memory AccessNext3.Intel 64 Base Architecture notes

Last updated 4 years ago

Was this helpful?