# 4.Intel 64 Base Architecture:2

## `chapter 11,Programming with Intel SSE2 technology.`

SSE2 instruction set was introduced in Pentium 4 and Xeon processor,\
\&#xNAN;*there is no additional registers introduced SSE2*

compared with SSE ,SSE2 introduces packed double-precision floating-point data type and packed integer types ,128-bit packed byte integer through packed quadword integer .*128-bit packed integer data type is supported since SSE2*.

here we skip all floating-point related operation in SSE2 as we did before.

### SSE2 64-bit and 128-bit integer instruction

`movdqa/movdqu` move 16-byte memory into XMM registers in aligned/unaligned way.\
`paddq/psubq` preform arithmetic operations.\
`pshuffd` shuffles packed doubleword integers.\
`pslldq/psrldq` perform logical left/right double quadword shift.\
`movq2dq/movdq2q` moves data between MMX register and XMM registers.

### cacheability control and memory ordering instruction

`clflush` write the cache line and invalidate te cache line associated with the specified address.

`movntdq/movntpd` store xmm register into mmeory with non-temporal hint.

additionally ,`movnti` store an integer from general-purpose into memory with non-temporal hint.

as with before,`mmaskmovdqu` will store masked double quadword back into memory with non-temporal hint.

as memory ordering additions ,SSE2 introduce `lfence/mfence` to issue a load fence.

SSE2 also introduce `branch hint` prefix for conditional jump instruction Jcc.2EH:branch not taken,3EH:branch taken.

## `chapter 12,Programming with Intel SSE3/SSSE3/SSE4 technology.`

SSE3 introduced a special load instruction :`lddqu` which loads 128-bit double quadword into register,other instructions are almost about floating-point data ,I will not mention here therefore.

SSSE3 instruction set is to complement SSE/SSE2 instruction set,it still operates on MMX and XMM registers.SSSE3 offers horizontal addition/subtraction instruction:`phadd/phsub`(size suffix).

next available instruction class is packed absolute value calculation:`pabsb/pabsw/pabsd`

`pshufb` performs byte based permutation.

SSE4 comprises of two sets of extensions:SSS4.1 and SSE4.2 ,from now on ,MMX registers are not our concern any more.

`pmuludq/pmuldq` perform double quadword multiply operation.

`movntdqa` the *stream loading instruction* will load aligned 64 bytes cache line into `stream loading buffer`,then subsequent loading within the cache line will directly fetch data from buffer.the specific introduction is at *12.10.3 Streaming load hint instruction*.

SSE4.1 also provides blend instructions for single-precision and double-precision and packed integer data .`pblendw` blends word based packed integers array.

the min/max instructions for packed byte/word/doubleword is listed below:

![](https://3196714288-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MRJ9AEvbiVcjmd0rkS6%2Fsync%2Ffa950ecaa277c66798f784cc2cb0c5905a22a1b8.png?generation=1610950994704227\&alt=media)

`pinsrb/d/q` and `pextrb/w/d/q` insert into xmm register at a given position and extract a byte/word/doubleword/quadword from a given position.

`phminposuw` searches a minimum unsigned word,store the minimum value and index.

`ptest` performs logical and on two xmm registers ,I love `pcmpeqb/w/d` more

`pcmpeqq` is similar to `pcmpeqb` ,but with qaudword granularity.

SSE4.2 introduced text processing instructions:`pcmpestri/pcmpestrm/pcmpistri/pcmpistrm` performs explicit/implicit -length string comparison,return the result into RCX or xmm0. without memory alignment limitation.

another SSE4.2 provided instruction is `PCMPGTQ` ,compare whether greater .
