all-about-SIMD-on_x86_64
  • Introduction
  • 1.Vectorized Memory Access
  • 2.Intel 64 instructions demo
  • 3.Intel 64 Base Architecture notes
  • 4.Intel 64 Base Architecture:2
  • 5.Intel 64 Base Architecture:3
  • 6.Intel Architecture Optimization
Powered by GitBook
On this page
  • chapter 3,Basic Execution Environment
  • chapter 4,Data Types
  • we skip chapter 5 because it's instruction set summary.and as well as chapter 6 since it's about program runtime footprint .
  • chapter 7,Programming with General purpose instruction.
  • data transfer instructions.
  • binary arithmetic instruction.
  • logical instruction.
  • shift and rotate instruction.
  • bit and byte instruction
  • byte string operation
  • IO instruction
  • EFLAGs manipulation instruction
  • miscellaneous instruction
  • chapter 9,Programming with Intel MMX technology.
  • MMX data transfer instruction
  • MMX arithmetic instruction
  • MMX comparison instruction
  • MMX logical instruction
  • MMX shift instruction
  • chapter 10,Programming with Intel SSE technology.

Was this helpful?

3.Intel 64 Base Architecture notes

**This article is about Intel 64 Architecture software developer's manual volume 1:Base Architecture we skip all IA-32 instruction,and Intel 64 is our focus,even though we will emphasize the difference when it's needed to distinguish 32-bit and 64-bit mode.also we only concern Integer operation, floating-point related operation is not supposed to be mentioned.

chapter 3,Basic Execution Environment

address space:Intel 64-bit 's address space can range from 0 to 2^64-1 ,and the physical space is at (0,2^46-1). note that since P6 family , IA-32 process can address physical space within(0,2^36-1) through PAE paging.

basic execution register:the number of general purpose registers is 16 ,they are :RAX,RBX,RCX,RDX,RDI,RSI,RBP,RSP,R8-R15 (remember there are only 8 registers under 32-bit mode) and RIP instruction counter register,and EFLAGS control and status register. REX prefix is used to generate 64-bit operand or access R8-R15 registers. common 64-bit mode prefix includes:REX prefix , operand-size prefix 66h and address size prefix 66h segment register:are disabled in 64-bit mode.thus creating flat address space .

SIMD register:8 MM registers in MMX, 16 XMM registers introduced in SSE and 16 'YMM' registers introduced in AVX. note that in AVX-512 ,all numbers of registers are doubled. other registers. IO address space is always 32-bit . next how to calculate an address offset ? offset=base+index*scale+displacement in AT&T assembly syntax ,we could generate an offset by:disp(base, index, scale) where displacement is a 32-bit(or 16-bit or 8-bit) ,base and index are 64-bit register ,scale factor is in [2,4,8].the exception is that RSP can not be regarded as index register.

chapter 4,Data Types

the fundamental data types include byte(8-bit),word(16-bit),doubleword(32-bit),quadword(64-bit),and double quadword(128-bit,which is introduced in SSE extension).as for numeric data type,there are signed and unsigned integer,single-precision (32-bit) and double-precision(64-bit).

packed SIMD data type includes packed byte, packed word , packed double word and packed double quadword.

we skip chapter 5 because it's instruction set summary.and as well as chapter 6 since it's about program runtime footprint .

chapter 7,Programming with General purpose instruction.

In 64-bit mode,instruction with REX prefix can access 64-bit registers ,note that EFLAGS is still 64-bit then. we will introduce these instructions step by step.

data transfer instructions.

  • general data movement instruction

    mov(movcc) ,there is one limitation ,can not move between two memory operands.

  • exchange instruction

    xchg swap two registers or register and memory location without the help of the third register,this instruction will be automatically prefixed with LOCK_ to make it atomic.

    cmpxchg this is most used classic instruction.

  • stack manipulation instruction

    push/pop and pushA/popA(32-bit only) .

binary arithmetic instruction.

add/sub and inc/dec,and cmp which calculate the difference between integer operands,but does not stor the result into destination operand,then we can do jcc or setcc according flags in EFLAGS.

logical instruction.

and/or/xor/not all of them can be rex prefixed.

shift and rotate instruction.

sal/sar/shl/shr perform local or arithmetic shift operation ,note that arithmetic shift keep the most/least significant bit when shifting Left/right.shift will affect CF status bit.

bit and byte instruction

  • test and set instruction (modified the selected bit)

    bt/bts/btr/btc test a certain bit ,and keep it into CF ,then set/reset/reverse it.

  • bit scan

    bsf/bsr please refer to the demo.

  • byte set

    setcc set a byte according to EFLAGs register.

    _ test instruction

    test is very similar to and ,but only update EFLAGs.

byte string operation

movs/cmps/scas/loads/stos rely on RDI RSI register, and direction flag DF is the switch that control the growth direction .a common usage is combining string operation with REP prefix which loops and decrements RCX if and only if RCX is not zero.a concrete demo will be represented in article 2.

IO instruction

inb/outband its varies,they impose input/output operation on a port,note that even in 64-bit mode,the maximum data width is still double word.

EFLAGs manipulation instruction

set or reset a flag in EFLAGS,for example,cld will clear the direction flag in EFLAGS,and std will set this bit.

miscellaneous instruction

lea will load the effective address of a memory location into a register. cpuid will load processor features into RAX(RDX). nop do noting but stall the processor. rdrand read the 64-bit random number into the dest register.

chapter 9,Programming with Intel MMX technology.

MMX, the first generation SIMD instruction ,was introduced into IA-32 in Intel Pentium 2 processor family. There are 8 extra 64-bit registers in MMX extension,they are named MM0 through MM7.MMX instruction set and SSE/SSE2 instruction set still manipulate on these registers. there are three data types introduced in MMX,they are 64-bit packed byte integer/packed word integer/packed doubleword integer.

MMX data transfer instruction

movd/movq is OK,but only between memory and registers.

MMX arithmetic instruction

paddb/paddw/paddd/psubb/psubw/psubd do vector arithmetic operations.additional instructions will perform multiply and divide operation.

MMX comparison instruction

pcmpeqb/pcmpeqw/pcmpeqd compare the corresponding data element(byte/word/doubleword) and generate the mask, this turns out to be very useful when deciding whether a special element exists.

MMX logical instruction

pand/por/pxor without operand-size suffix.

MMX shift instruction

psll/psrl(w/d/q) perform logical shift on the specific data element. arithmetic shift is not my concern.

chapter 10,Programming with Intel SSE technology.

SSE was introduced into IA-32 in pentium III processor family,the breakthrough is it brings 128-bit wide data in SIMD framework,There are 16 XMM registers which are named MMX0 through MMX15 in 64-bit mode without changes to MMX technology. note that XMM registers can only be used to perform calculations on data ,they can not be used to address memory ,addressing memory is accomplished by using general purpose register.additional mxcsr control and status register for floating-point calculation.

There is only one type of data introduced along with SSE,packed single-precision floating-point data type.still anything about floating-point ,I will skip them all,so next we go to integer instruction first .

SSE/SSE2 have complementary instructions with MMX technology , let's come and see them: pavgb/pavgw(there is no other sized operand ) calculate respective average values.

pextrw extracts a word from a MMX register and store it into a general purpose register.this is useful when we individually extract and inspect a word.

pinsrw inserts a word into MMX register at a given position.

pmaxub/pmaxsb the max value (signed or unsigned).

pmovmskb moves the mask of a byte(the most significant bit),and constructs a new byte into a general purpose .this turns out to be of great use when we are deciding something according the final bitmask.

then skips multiply and absolute difference calculation. next is shuffle pshufw which will shuffle the words in a MMX register according to the mask.

next is cacheability control/prefetch/memory order instruction.

new instructions since SSE offers non-temporal memory load and store which means moving data between memory and registers will not pollute the cache.here in SSE, non-temporal hinted instructions include [novntq stores quadword from MMX register into memory, movntps stores single-precision floating-point data from MMX register into memory with non-temporal hint,maskmovq is another ,non-temporal store will issue stores in write-combining sematic, if the destination is already in the cache hierarchy, then the cache item will evict,otherwise it groups its stores,and writes into memory at a certain timer later, so with non-temporal ,the order of subsequent writings is weak, the coherence is maintained on our own,we should use sfence instruction which ensures all the processors have the global visibility of the stored data.

the last instruction set is prefetch,prefetcht0/prefetcht1/prefetcht2/prefetchnta will do the work with temporal or non-temporal hint or cache hierarchical level.

Previous2.Intel 64 instructions demoNext4.Intel 64 Base Architecture:2

Last updated 4 years ago

Was this helpful?