4.virtual link host PMD
Last updated
Was this helpful?
Last updated
Was this helpful?
Host PMD is the dpdk lane of virtual link,it implements the dpdk pmd interface,the eal parameter for virtual link has the following format:--vdev=eth_vlink0,link=<tap-name>,mode=[bond|mq]
,during the pmd initialization,it tries to connect to agent,and fetch information from agent,device information includes vm_domain
name ,how many links it has and what they are respectively.then pmd will know clearly where the channels are,i.e. the channel offset
.
virtual link can work in two modes:VLINK_PMD_MODE_BONDING
and VLINK_PMD_MODE_MULTI_Q
,I am gonna demostrate in what cases they should be used respectively.
before I am gonna tell the whole story,I need to tell the truth:for a 10 Gigabit Ethernet NIC,Intel 82599
as a example,will bear the load of 14Mpps
packets of 64 bytes size.and the nic will be in fully loaded state,i.e. 10Gbps.I genrate traffic use pktgen-dpdk, but is 14Mpps the maximum transmit rate a cpu(lcore) can generate?
.no,actually not true. I use a dummy pmd driver (which does nothing but free all the packets given to pmd for xmiting).
see!,a single core can generate more than 24Mpps of 64 bytes size when occupying the whole cpu time,it's supposed to be more on other platforms with higher cpu frequency . It's really amazing that Intel 10G nic could accommodate more than 14 mpps,hardware is really fast
,and multiple queues will not enhance transmit rate except that RSS could enable traffic distribution thus other ends will share the load evenly.
now comes the reality that cpu can push much packets into a link ,how much traffic the link can bear matters a lot. often the hardware can take less time and less cost as supposed to software virtual device , but this is not absolutely right especially when use some tricks which enhace total throughput,for example multiple queues
,numa aware deployment
.next I will demostrate how performance depends on these factors.
it would be remiss not to present the test framework here,I want to describe how I conduct the experiment briefly:
pktgen-dpdk with eth_vlink
device available:./app/app/x86_64-native-linuxapp-gcc/pktgen -c ffff0000 --vdev=eth_vlink0,link=tap123,mode=bond -- -P -T -f themes/black-yellow.theme -m " [17:19].0"
the other end of virtual is a userspace application which poll the rx_stub
queue and echo them back through tx_queue
,which emulates any end-points residing in virtual machine. and I combine them with specific processor with taskset: taskset 20 ./pmd_test_main 0
here I three screenshots are presented to show the throughput when then other end-do nothing upon receiving packets,and echo them back.
first of all,with only one channel:
with two channels enabled:
with three channels enabled:
we could observe from pictures above that with three channels ,it's quite easy to make the virtual link loaded with 22mpps of 64 bytes packet when 3 channels used.that's quite a lot ,and total line rate is about 15Gbps .in summary ,the total summary is thoughput will scale with number of channels ,but not linearlly .the reason will be explained later.
address translation is to translate physical address transported along the queues into virtual address in the the end-point 's own mapping ,and then ,it will access packet payload for a few bytes which will consume a lots of cpu cycles here are the results :
with only channel enabled:
with two channels enabled:
with four channels enabled:
we can see from pictures above that once we added some overhead at the other end-point ,the total throughput decrease fiercely from 10Mpps to ~2Mpps per channel,fortunately it still scales with number of channels a little linearly . it's a little desperate when we devote a lot of a CPUs,while in return ,poor performance is paid back
.the only point which makes me comfortable is it does scale.and it's getting better when numa is considered . next I will detail things about numa.
the most common resource that virtual link accesses is memory
,and memory is distributed within all the cpus(if multiple cpus are presented),accessing local memory areas is much faster than remote areas.this affects the virtual link a lot,the idea to improve performance is arrange two ends of one virtual link in the same cpu socket
,the cpu layout is show as below:
all the cpus reside in socket 1,and pktgen-dpdk will always allocate mbufs locally ,thus other end-point of virtual link will access these memory locally. let's have a look at how much improvement this could bring. with only one channel enabled:
with two channels enabled:
with five channels enabled:
we could know that single channels is deployed,more than 4Mpps is reached ,and it scales with number of channels very well ,it even reachs 28Mpps when 5 channels is used .so numa matters a lot in virtual link.