4.Intel 64 Base Architecture:2
chapter 11,Programming with Intel SSE2 technology.
chapter 11,Programming with Intel SSE2 technology.
SSE2 instruction set was introduced in Pentium 4 and Xeon processor, there is no additional registers introduced SSE2
compared with SSE ,SSE2 introduces packed double-precision floating-point data type and packed integer types ,128-bit packed byte integer through packed quadword integer .128-bit packed integer data type is supported since SSE2.
here we skip all floating-point related operation in SSE2 as we did before.
SSE2 64-bit and 128-bit integer instruction
movdqa/movdqu
move 16-byte memory into XMM registers in aligned/unaligned way.
paddq/psubq
preform arithmetic operations.
pshuffd
shuffles packed doubleword integers.
pslldq/psrldq
perform logical left/right double quadword shift.
movq2dq/movdq2q
moves data between MMX register and XMM registers.
cacheability control and memory ordering instruction
clflush
write the cache line and invalidate te cache line associated with the specified address.
movntdq/movntpd
store xmm register into mmeory with non-temporal hint.
additionally ,movnti
store an integer from general-purpose into memory with non-temporal hint.
as with before,mmaskmovdqu
will store masked double quadword back into memory with non-temporal hint.
as memory ordering additions ,SSE2 introduce lfence/mfence
to issue a load fence.
SSE2 also introduce branch hint
prefix for conditional jump instruction Jcc.2EH:branch not taken,3EH:branch taken.
chapter 12,Programming with Intel SSE3/SSSE3/SSE4 technology.
chapter 12,Programming with Intel SSE3/SSSE3/SSE4 technology.
SSE3 introduced a special load instruction :lddqu
which loads 128-bit double quadword into register,other instructions are almost about floating-point data ,I will not mention here therefore.
SSSE3 instruction set is to complement SSE/SSE2 instruction set,it still operates on MMX and XMM registers.SSSE3 offers horizontal addition/subtraction instruction:phadd/phsub
(size suffix).
next available instruction class is packed absolute value calculation:pabsb/pabsw/pabsd
pshufb
performs byte based permutation.
SSE4 comprises of two sets of extensions:SSS4.1 and SSE4.2 ,from now on ,MMX registers are not our concern any more.
pmuludq/pmuldq
perform double quadword multiply operation.
movntdqa
the stream loading instruction will load aligned 64 bytes cache line into stream loading buffer
,then subsequent loading within the cache line will directly fetch data from buffer.the specific introduction is at 12.10.3 Streaming load hint instruction.
SSE4.1 also provides blend instructions for single-precision and double-precision and packed integer data .pblendw
blends word based packed integers array.
the min/max instructions for packed byte/word/doubleword is listed below:
pinsrb/d/q
and pextrb/w/d/q
insert into xmm register at a given position and extract a byte/word/doubleword/quadword from a given position.
phminposuw
searches a minimum unsigned word,store the minimum value and index.
ptest
performs logical and on two xmm registers ,I love pcmpeqb/w/d
more
pcmpeqq
is similar to pcmpeqb
,but with qaudword granularity.
SSE4.2 introduced text processing instructions:pcmpestri/pcmpestrm/pcmpistri/pcmpistrm
performs explicit/implicit -length string comparison,return the result into RCX or xmm0. without memory alignment limitation.
another SSE4.2 provided instruction is PCMPGTQ
,compare whether greater .
Last updated
Was this helpful?