lodsX and stosX will load string [byte/word/doubleword/quadword] to [ah/ax/eax/rax] and store in that reverse way,test [and arithmetic operration] only effects EFLAGS register,and then we branch according to Z flag ,note that the label ,we add a suffix 'b'(before) or 'f'(after).
note that RDTSC will read cpu timestamp into EDX:EAX. two ways are all right ,the first one comprise the timestamp quadword ,and store it to memory,while second one loads effective address first ,and store EDX and EAX respectively to memory.
DPDK implementation is simpler ,it place memory store operation into input/output list,let the code show you:
note that DPDK version use "m" modifier while I use register to construct an effective address ,both are OK,and I use setz to test the result ,while DPDK use sete, you will find both are still OK.
4. scan least set bit
gcc builtin macro__builtin_ffsll() can scan the least set bit,if a value is all zero,it represents 0,otherwise the index(starting from 0) plus 1 is returned. now I will implement it on my own: