all-about-SIMD-on_x86_64
  • Introduction
  • 1.Vectorized Memory Access
  • 2.Intel 64 instructions demo
  • 3.Intel 64 Base Architecture notes
  • 4.Intel 64 Base Architecture:2
  • 5.Intel 64 Base Architecture:3
  • 6.Intel Architecture Optimization
Powered by GitBook
On this page
  • chapter 11,Programming with Intel SSE2 technology.
  • SSE2 64-bit and 128-bit integer instruction
  • cacheability control and memory ordering instruction
  • chapter 12,Programming with Intel SSE3/SSSE3/SSE4 technology.

Was this helpful?

4.Intel 64 Base Architecture:2

chapter 11,Programming with Intel SSE2 technology.

SSE2 instruction set was introduced in Pentium 4 and Xeon processor, there is no additional registers introduced SSE2

compared with SSE ,SSE2 introduces packed double-precision floating-point data type and packed integer types ,128-bit packed byte integer through packed quadword integer .128-bit packed integer data type is supported since SSE2.

here we skip all floating-point related operation in SSE2 as we did before.

SSE2 64-bit and 128-bit integer instruction

movdqa/movdqu move 16-byte memory into XMM registers in aligned/unaligned way. paddq/psubq preform arithmetic operations. pshuffd shuffles packed doubleword integers. pslldq/psrldq perform logical left/right double quadword shift. movq2dq/movdq2q moves data between MMX register and XMM registers.

cacheability control and memory ordering instruction

clflush write the cache line and invalidate te cache line associated with the specified address.

movntdq/movntpd store xmm register into mmeory with non-temporal hint.

additionally ,movnti store an integer from general-purpose into memory with non-temporal hint.

as with before,mmaskmovdqu will store masked double quadword back into memory with non-temporal hint.

as memory ordering additions ,SSE2 introduce lfence/mfence to issue a load fence.

SSE2 also introduce branch hint prefix for conditional jump instruction Jcc.2EH:branch not taken,3EH:branch taken.

chapter 12,Programming with Intel SSE3/SSSE3/SSE4 technology.

SSE3 introduced a special load instruction :lddqu which loads 128-bit double quadword into register,other instructions are almost about floating-point data ,I will not mention here therefore.

SSSE3 instruction set is to complement SSE/SSE2 instruction set,it still operates on MMX and XMM registers.SSSE3 offers horizontal addition/subtraction instruction:phadd/phsub(size suffix).

next available instruction class is packed absolute value calculation:pabsb/pabsw/pabsd

pshufb performs byte based permutation.

SSE4 comprises of two sets of extensions:SSS4.1 and SSE4.2 ,from now on ,MMX registers are not our concern any more.

pmuludq/pmuldq perform double quadword multiply operation.

movntdqa the stream loading instruction will load aligned 64 bytes cache line into stream loading buffer,then subsequent loading within the cache line will directly fetch data from buffer.the specific introduction is at 12.10.3 Streaming load hint instruction.

SSE4.1 also provides blend instructions for single-precision and double-precision and packed integer data .pblendw blends word based packed integers array.

the min/max instructions for packed byte/word/doubleword is listed below:

pinsrb/d/q and pextrb/w/d/q insert into xmm register at a given position and extract a byte/word/doubleword/quadword from a given position.

phminposuw searches a minimum unsigned word,store the minimum value and index.

ptest performs logical and on two xmm registers ,I love pcmpeqb/w/d more

pcmpeqq is similar to pcmpeqb ,but with qaudword granularity.

SSE4.2 introduced text processing instructions:pcmpestri/pcmpestrm/pcmpistri/pcmpistrm performs explicit/implicit -length string comparison,return the result into RCX or xmm0. without memory alignment limitation.

another SSE4.2 provided instruction is PCMPGTQ ,compare whether greater .

Previous3.Intel 64 Base Architecture notesNext5.Intel 64 Base Architecture:3

Last updated 4 years ago

Was this helpful?