6.Intel Architecture Optimization
branch prediction optimization
in order to reduce the possibility of misprediction and reduce number of branch target buffer(BTB resource).there are several principle we should follow:
1). arrange code to make basic blocks contiguous
2). unroll loops
3). use CMOVcc
conditional movement instruction.
4). use SETcc
conditional set instruction .
inline code together.
doing so avoids too deep function calling stack and eliminates parameter passing . in a compiler ,compiler can optimize the code on its own,the branch mis-prediction maybe less than it's in a solo function . Let the code show you how inline function works. code without inlining function:
the assembly intermediate code will prepare parameter in stack and then invoke call
instruction to real function body.
with function inline ,we use GCC attribute __attribute__((always_inline)) inline
decorator ,and then we could find the function the add
could do is done in main by embed the real code :
here addl
does the addition operation.
principle of instruction selection
inc/dec
will not affect EFLAGS register bit.use
lea
instruction to construct an effective address.use addressing mode rather than general-purpose computation.
use arithmetic instruction rather than movement instruction to clear a register. such as
xor
use
test
when comparing a value in a register with zero,test
performsand
calculation but only affects eflags with modify dest register.and is better thancmp
due to smaller code size.nop
does nothing,but it actually occupies one byte code ,which means that RIP register increments when this instruction is executed.move data between registers is zero-latency.
optimizing memory accesses
alignment includes :
dynamically allocated data structure alignment.
members of a data structure.
global or local variables.
parameters on the stack.
we could use gcc builtin alignment hint to arrange data structure ,however ,if it's dynamically allocated ,we could only use allocation function hint to align the base address .
next comes to data structure layout , by default ,gcc will take platform alignment into consideration,often it's double word alignment . moreover ,according to the requirement ,we could determine whether we use array of structure
or structure of array
.
unaligned stack may leads to great penalty , so aligning stack pointer is necessary.in order to alignment ,we could use the following code snippet:
as for write combining ,why should it be there? it improves performance in two ways :
on a write miss to the 1st level cache ,it allows multiple stores to the same cache line to occur before the cache line is read for ownership(RFO).
write combining allows multiple writes to be assembled,they are evicted out of write combining buffer at a time later .
write combining is particularly important in non-temporal writes .it avoids partial writes as much as possible.
peak system bus bandwidth is often shared by several bus activities ,including read,RFO(read for ownership),write.write back memory must share bandwidth with RFO ,while non-temporal does not require RFO traffic.
Last updated
Was this helpful?