2.DPDK memory tuning

In my previous design,I aggregate all physical memory segments which may not be physically continuous into one big virtual segment,but within DPDK,the memory must be virtually continuous and physically continuous at the same time ,but after changing the memory mapping ,the kernel part code will encounter error due to the non-pyhsically-continuous memory,this is indeed what happens when de-segmentating in DPDK.

so why not use DPDK ivshmem library, please refer to DPDK prog's guide ,you will find DPDK ivshmem really can mapping only partial host DPDK hugepage memory into guest through metadata,by invoking APIs ,such as

rte_ivhshmem_metadata_add_memzone() 
rte_ivshmem_metadata_add_ring() 
rte_ivshmem_metadata_add_mempool()

and later ,dpdk ivshmem will generate command line parameter by calling rte_ivshmem_metadata_cmdline_generate() that is passed to qemu,so the limitation is obvious,the qemu must boot or startup after DPDK process already starts up, and what's more fatal ,once DPDK process crashes down ,qemu ivshmem will be homeless, and it must restart.so I bet DPDK ivshmem will find no room where the developers will think highly of.

obviously ,even vhost-dpdk can not still meet zero-copy requirment in DPDK and qemu yet, so in order to realize real zero-copy packet switching between host and guest within DPDK, the unique way is to map DPDK hugepage into guest backend device wholly or partially.

the partial memory mapping is like what DPDK ivshmem does ,the dpdk can only map the interested part of hugepage into qemu device selectivly,we call the part switching-relevant memory ,but once dpdk lane process crashes ,we still need to restart qemu to remap the newly switching-relevant memory. the will break the need that all components of virtual link must be plugable .

the utimate way is to map the whole hugepage into qemu device,this way,the device will see the whole hugepage ,and when link component plug an play,any component will impose no influence on others .

this way sounds rather simple ,but I will cope with several address spaces and translate them when memory block in different context.next few paragraphs will introduce more.

1.why memory segmentation is necessary ?

In my first desgin ,I emerged all physically contiguous zones into one large virtual contiguous zone,which ended up with unknow errors,I guess the emerging brought the disaster,the reason is simple :when the dpdk pmd got the virtually contiguous memory segment,it is represented to kernel where the memory will be regarded as physically contiguous memory,but the truth is it is not,so memory access in kernel will crash the pmd down. so one principle must be followed is the dpdk memory layout remains untouched and multi-segmentated.

1.how DPDK memory is modified to adapt to my design ?

what is mentioned before is to un-touch dpdk hugepage memory mapping layout,but instead we will generate dpdk memory metata which records hugefiles and its physical beginning adress .the format is described below:

...
/dev/hugepages/rtemap_0 0x3c800000
/dev/hugepages/rtemap_2 0x3cc00000
/dev/hugepages/rtemap_1 0x3ce00000
/dev/hugepages/rtemap_4 0x3d000000
/dev/hugepages/rtemap_3 0x3d200000
...

these items are utilized by other processes to map hugefile in same order with dpdk eal.what's more ,we do not lookup \/proc\/{ID}\/pagemap to gain physical memory again.

Intsed ,we use address translation,it's not complicated as you may think,so the easiest way to do translation(virtual2physical) is read pagemap to get the relevant physical address,please refer to this link:https:\/\/www.kernel.org\/doc\/Documentation\/vm\/pagemap.txt, but this procedure is not invertible,there is now way to translate a physical address to virtual address .

Fortunately ,dpdk hugepage memory is hugesize aligned ,which mean that we can build a index table or hash table to ease address translation between virtual space and physical space .then given an virt(phy) address addr,then the page number is calculated by:addr&(~hugepage_size),hugepage_size must be power of 2,next we can index the hash table or index table to gain relevant phy(virt) address ,denoting page_begin,and finally we can get translated address by adding the offset:page_begin+addr&(hugepage_size-1).

Basing on the address translation mechanism,we can now describe how the whole virtual link works.the following figure illustrates how addresses is translated in different contexts.

There are four stages in translations.

dpdk context maintain hugepage memory in its own process including pmd.once packets enter virtual link ,they must be translated into physical addresses.
once the packets are queued ,all the time qemu will see the packets 's physical ,qemu MUST provide function to translate physical address into virtual address in qemu's process address space.note that this step is very critical
in guest's kernel context device driver, it must execute IO operation to translate host's address into guest's address, doing this way,but not yet finished .
last would be map guest 's physical memory into its kernel virtual address or userspace virtual address ,with virtual adderess ,we can access packets now.
see, it's not easy ,also we need to make translation efficient.to obtain this objective ,we use hash of big size which lower the opportunity for two hashes conflicting.
but it does conflict sometimes,a lot of measures can be utilized ,here I used linear rehashing.

last I will introduce two functions will translate virtual address into physical one:

phys_addr_t rte_mem_virt2phy(const void *virtaddr);
uint64_t translate_virt_address(uint64_t virt_address)

both functions will do the same job and get same result,but the procedures are quite different ,rte_mem_virt2phy will open /proc/self/pagemap ,lseek to appropriate position and then fetch and parse to gain relevant address.

translate_virt_address is based on hashing and indexing ,so it's is quick, often the searching complexicity will be O(1) as long as size of the hashtable is large enough.the size is easy to estimate.

Previous1.overview Next3.Agent architecture

Last updated 4 years ago

Was this helpful?