6.Intel Architecture Optimization

branch prediction optimization

in order to reduce the possibility of misprediction and reduce number of branch target buffer(BTB resource).there are several principle we should follow:

1). arrange code to make basic blocks contiguous

2). unroll loops

3). use CMOVcc conditional movement instruction.

4). use SETcc conditional set instruction .

inline code together.

doing so avoids too deep function calling stack and eliminates parameter passing . in a compiler ,compiler can optimize the code on its own,the branch mis-prediction maybe less than it's in a solo function . Let the code show you how inline function works. code without inlining function:

int add(int a,int b)
{
        return a+b;
}
int main()
{       int a=12;
        int b=34;
        int c=add(a,b);
        return 0;
}

the assembly intermediate code will prepare parameter in stack and then invoke call instruction to real function body.

add:
        ... ...
        movl    %edi, -4(%rbp)
        movl    %esi, -8(%rbp)
        movl    -8(%rbp), %eax
        movl    -4(%rbp), %edx
        addl    %edx, %eax
        popq    %rbp
        ... ...

main:
        ... ...
        movq    %rsp, %rbp
        subq    $16, %rsp
        movl    $12, -4(%rbp)
        movl    $34, -8(%rbp)
        movl    -8(%rbp), %edx
        movl    -4(%rbp), %eax
        movl    %edx, %esi
        movl    %eax, %edi
        call    add
        ... ...

with function inline ,we use GCC attribute __attribute__((always_inline)) inline decorator ,and then we could find the function the add could do is done in main by embed the real code :

main:
        ... ...
        movl    $12, -4(%rbp)
        movl    $34, -8(%rbp)
        movl    -4(%rbp), %eax
        movl    %eax, -16(%rbp)
        movl    -8(%rbp), %eax
        movl    %eax, -20(%rbp)
        movl    -20(%rbp), %eax
        movl    -16(%rbp), %edx
        addl    %edx, %eax
        movl    %eax, -12(%rbp)
        movl    $0, %eax
        popq    %rbp
        ... ...

here addl does the addition operation.

principle of instruction selection

inc/dec will not affect EFLAGS register bit.
use lea instruction to construct an effective address.
use addressing mode rather than general-purpose computation.
use arithmetic instruction rather than movement instruction to clear a register. such as xor
use test when comparing a value in a register with zero,test performs and calculation but only affects eflags with modify dest register.and is better than cmp due to smaller code size.
nop does nothing,but it actually occupies one byte code ,which means that RIP register increments when this instruction is executed.
move data between registers is zero-latency.

optimizing memory accesses

alignment includes :

dynamically allocated data structure alignment.
members of a data structure.
global or local variables.
parameters on the stack.
we could use gcc builtin alignment hint to arrange data structure ,however ,if it's dynamically allocated ,we could only use allocation function hint to align the base address .

next comes to data structure layout , by default ,gcc will take platform alignment into consideration,often it's double word alignment . moreover ,according to the requirement ,we could determine whether we use array of structure or structure of array.

unaligned stack may leads to great penalty , so aligning stack pointer is necessary.in order to alignment ,we could use the following code snippet:

#include <x86intrin.h>
int main()
{
        asm("movq %rsp,%rbp;"
            "subq 8,%rbp;"
            "andq 7,%rbp;"
           /*do something with quadword-aligned stack*/
            "movq %rbp,%rsp");
        return 0;

}

as for write combining ,why should it be there? it improves performance in two ways :

on a write miss to the 1st level cache ,it allows multiple stores to the same cache line to occur before the cache line is read for ownership(RFO).
write combining allows multiple writes to be assembled,they are evicted out of write combining buffer at a time later .

write combining is particularly important in non-temporal writes .it avoids partial writes as much as possible.

peak system bus bandwidth is often shared by several bus activities ,including read,RFO(read for ownership),write.write back memory must share bandwidth with RFO ,while non-temporal does not require RFO traffic.

Previous5.Intel 64 Base Architecture:3

Last updated 4 years ago

Was this helpful?