# 6.Intel Architecture Optimization

* branch prediction optimization&#x20;

in order to reduce the possibility of misprediction and reduce number of branch target buffer(BTB resource).there are several principle we should follow:

1\). arrange code to make basic blocks contiguous

2\). unroll loops

3\). use `CMOVcc` conditional movement instruction.

4\). use `SETcc` conditional set instruction .

* inline code together.

doing so avoids too deep function calling stack and eliminates parameter passing . in a compiler ,compiler can optimize the code on its own,the branch mis-prediction maybe less than it's in a solo function .\
Let the code show you how inline function works.\
code without inlining function:

```c
int add(int a,int b)
{
        return a+b;
}
int main()
{       int a=12;
        int b=34;
        int c=add(a,b);
        return 0;
}
```

the assembly intermediate code will prepare parameter in stack and then invoke `call` instruction to real function body.

```c
add:
        ... ...
        movl    %edi, -4(%rbp)
        movl    %esi, -8(%rbp)
        movl    -8(%rbp), %eax
        movl    -4(%rbp), %edx
        addl    %edx, %eax
        popq    %rbp
        ... ...

main:
        ... ...
        movq    %rsp, %rbp
        subq    $16, %rsp
        movl    $12, -4(%rbp)
        movl    $34, -8(%rbp)
        movl    -8(%rbp), %edx
        movl    -4(%rbp), %eax
        movl    %edx, %esi
        movl    %eax, %edi
        call    add
        ... ...
```

with function inline ,we use GCC attribute `__attribute__((always_inline)) inline` decorator ,and then we could find the function the `add` could do is done in main by embed the real code :

```c
main:
        ... ...
        movl    $12, -4(%rbp)
        movl    $34, -8(%rbp)
        movl    -4(%rbp), %eax
        movl    %eax, -16(%rbp)
        movl    -8(%rbp), %eax
        movl    %eax, -20(%rbp)
        movl    -20(%rbp), %eax
        movl    -16(%rbp), %edx
        addl    %edx, %eax
        movl    %eax, -12(%rbp)
        movl    $0, %eax
        popq    %rbp
        ... ...
```

here `addl` does the addition operation.

## principle of instruction selection

* `inc/dec` will not affect EFLAGS register bit.
* use `lea` instruction to construct an effective address.
* use addressing mode rather than general-purpose computation.
* use arithmetic instruction rather than movement instruction to clear a register. such as `xor`
* use `test` when comparing a value in a register with zero,`test` performs `and` calculation but only affects eflags with modify dest register.and is better than `cmp` due to smaller code size.
* `nop` does nothing,but it actually occupies one byte code ,which means that RIP register increments when this instruction is executed.
* move data between registers is zero-latency.

## optimizing memory accesses

alignment includes :

* dynamically allocated data structure alignment.
* members of a data structure.
* global or local variables.
* parameters on the stack.

  we could use gcc builtin alignment hint to arrange data structure ,however ,if it's dynamically allocated ,we could only use allocation function hint to align the base address .

next comes to data structure layout , by default ,gcc will take platform alignment into consideration,often it's double word alignment . moreover ,according to the requirement ,we could determine whether we use `array of structure` or `structure of array`.

unaligned stack may leads to great penalty , so aligning stack pointer is necessary.in order to alignment ,we could use the following code snippet:

```c
#include <x86intrin.h>
int main()
{
        asm("movq %rsp,%rbp;"
            "subq 8,%rbp;"
            "andq 7,%rbp;"
           /*do something with quadword-aligned stack*/
            "movq %rbp,%rsp");
        return 0;

}
```

as for write combining ,why should it be there? it improves performance in two ways :

* on a write miss to the 1st level cache ,it allows multiple stores to the same cache line to occur before the cache line is read for ownership(RFO).
* write combining allows multiple writes to be assembled,they are evicted out of write combining buffer at a time later .

write combining is particularly important in non-temporal writes .it avoids partial writes as much as possible.

peak system bus bandwidth is often shared by several bus activities ,including read,RFO(read for ownership),write.write back memory must share bandwidth with RFO ,while non-temporal does not require RFO traffic.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://linkthedevil.gitbook.io/all-about-simd-on_x86_64/6.intel-architecture-optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
