# 2.Intel 64 instructions demo

## 1. copy a string

code snippet :

```c
char * _strcpy(char* dst,const char* src)
{
        asm("cld\n\t"
                "1:lodsb \n\t"
                "stosb \n\t"
                "testb %%al,%%al\n\t"
                "jne 1b\n\t"
        :
        :"D"(dst),"S"(src)
        :"memory"
        );
        return dst;
}
```

`lodsX` and `stosX` will load string \[byte/word/doubleword/quadword] to \[ah/ax/eax/rax] and store in that reverse way,test \[and arithmetic operration] only effects EFLAGS register,and then we branch according to Z flag ,note that the label ,we add a suffix 'b'(before) or 'f'(after).

## 2. read cpu timestamp register

code snippet:

```c
uint64_t rd_tsc(void)
{
        uint64_t tsc=0;
        #if 0
        asm("xorq %%rax,%%rax\n\t"
                "rdtsc\n\t"
                "shlq $32,%%rdx\n\t"
                "orq %%rax,%%rdx\n\t"
                "movq %%rdx,%0"
        :"=r"(tsc)
        :
        :"%rdx","%rax"
        );
        #else
        asm("leaq %0,%%rbx\n\t"
                "rdtsc\n\t"
                "movl %%edx,4(%%rbx)\n\t"
                "movl %%eax,(%%rbx)\n\t"
        :"=m"(tsc)
        :
        :"rbx","%rdx","%rax");
        #endif
        return tsc;
}
```

note that `RDTSC` will read cpu timestamp into EDX:EAX. two ways are all right ,the first one comprise the timestamp quadword ,and store it to memory,while second one loads effective address first ,and store EDX and EAX respectively to memory.

DPDK implementation is simpler ,it place memory store operation into input/output list,let the code show you:

```c
static inline uint64_t
rte_rdtsc(void)
{
    union {
        uint64_t tsc_64;
        struct {
            uint32_t lo_32;
            uint32_t hi_32;
        };
    } tsc;

#ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
    if (unlikely(rte_cycles_vmware_tsc_map)) {
        /* ecx = 0x10000 corresponds to the physical TSC for VMware */
        asm volatile("rdpmc" :
                     "=a" (tsc.lo_32),
                     "=d" (tsc.hi_32) :
                     "c"(0x10000));
        return tsc.tsc_64;
    }
#endif

    asm volatile("rdtsc" :
             "=a" (tsc.lo_32),
             "=d" (tsc.hi_32));
    return tsc.tsc_64;
}
```

the premise is that you should define a union structure first,nothing special.

## 3. compare and set implementation

here is my code:

```c
int cmpandset(uint64_t * dst_ptr,uint64_t target_data,uint64_t replace_data)
{
        uint8_t ret;
        asm volatile ("lock\n"
                "cmpxchgq %%rdx,(%%rbx)\n"
                "setz %0\n"

        :"=r"(ret)
        :"a"(target_data),"b"(dst_ptr),"d"(replace_data)
        :"memory"
        );
        return ret;

}
```

let's take a look at how DPDK library implemented it:

```c
static inline int
rte_atomic64_cmpset(volatile uint64_t *dst, uint64_t exp, uint64_t src)
{
    uint8_t res;


    asm volatile(
            MPLOCKED
            "cmpxchgq %[src], %[dst];"
            "sete %[res];"
            : [res] "=a" (res),     /* output */
              [dst] "=m" (*dst)
            : [src] "r" (src),      /* input */
              "a" (exp),
              "m" (*dst)
            : "memory");            /* no-clobber list */

    return res;
}
```

note that DPDK version use "m" modifier while I use register to construct an effective address ,both are OK,and I use setz to test the result ,while DPDK use sete, you will find both are still OK.

## 4. scan least set bit

gcc builtin macro`__builtin_ffsll()` can scan the least set bit,if a value is all zero,it represents 0,otherwise the index(starting from 0) plus 1 is returned. now I will implement it on my own:

```
int least_set_bit(uint64_t val)
{
        uint64_t ret;
        asm("bsf %1,%0;\n"
                "jnz 2f;\n"
                "movq $-1,%0;\n"
                "2:"
        :"=q"(ret)
        :"m"(val)
        );
        return (int)ret;
}
```

this time I again forget to use bsfq prefix ,but the result is right.

## 5. generate cmp mask with MMX instruction

here is the code:

```c
uint64_t cmp_byte(uint64_t v1,uint64_t v2)
{
        uint64_t ret;
        asm("movq %1,%%mm0;"
                "movq %2,%%mm1;"
                "pcmpeqb %%mm1,%%mm0;"
                "movq %%mm0,%0;"
                :"=m"(ret)
                :"m"(v1),"m"(v2)
                :"memory","mm0","mm1");
        return ret;

}
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://linkthedevil.gitbook.io/all-about-simd-on_x86_64/2.intel-64-instructions-demo.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
