4.performance of vecring

4.1 hardware environment

processor capability:

[root@compute1 jzheng(keystone_admin)]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               3011.625
BogoMIPS:              4803.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

we could see,two cpu sockets are avalable.next we we see cpu flags it supports:

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm 
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf 
eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr 
pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx 
f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept 
vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

we know that even AVX2 is surported.

next ,we will allocate memory for each sockets(1024 2M hugepages for each sockets):

AnonHugePages:   8855552 kB
HugePages_Total:    2052
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

note there are 4 pages that are already allocated somehow.

4.2 cores allocation

we choose socket 1 as the prefered socket in our test, next we assign cores for rx/tx task,

core id

role

location

host

LXC

4.3 pktgen-dpdk setup

host side pktgen-dpdk command line:

./app/app/x86_64-native-linuxapp-gcc/pktgen  --huge-unlink  --no-pci  \
 --huge-dir=/dev/hugepages  --socket-mem 256,256  \
 --vdev=eth_vecring0,domain=testcontainer,\
 link=tap456,master=true,mac=00:ec:f4:bb:d9:7f,socket=1   \
 -- -P -T -f themes/black-yellow.theme  -m "[13:15].0"

lxc side pktgen-dpdk command line:

./app/app/x86_64-native-linuxapp-gcc/pktgen  --huge-unlink  --no-pci  \
--huge-dir=/dev/hugepages  --socket-mem 256,256  \
--vdev=eth_vecring0,domain=testcontainer,link=tap456,mac=00:ec:f4:bb:d9:7f   \
-- -P -T -f themes/black-yellow.theme  -m "[17:19].0"

4.4 throughput rate overview

first ,we use 64-byte minimum frame size packets as the base when only one direction traffic flows ,we will get this below:

more than 20Mpps every core can generate or receive,it's rather huge load.

with dual directions ,the pps a single core can gen/recv decreases .

~18Mpps, single core performance decrease a little ,total throughput rate does not increase linearly,we would say it's basing on memory ,and memory bandwidth for every core/socket is limited.

next we increase packets size to 1514.

still ,here is what single core can generate or receive:

about 2.4Mpps(30Gbps) for 1514 frame size .

also we expect single core gens/recvs a little less than one can can do , we verify it:

see,single core capability is about 1.2Mpps(15Gbps),almost halved.

4.5 smoketest

we will generate four virtual links ,two are assigned to socket 0 and another two are assigned to socket 1,here is the host eal command line :

./app/app/x86_64-native-linuxapp-gcc/pktgen  --huge-unlink  --no-pci  \
--huge-dir=/dev/hugepages  --socket-mem 256,256 \
 --vdev=eth_vecring0,domain=testcontainer,link=tap456,master=true,mac=00:ec:f4:bb:d9:7f,socket=1 \
 --vdev=eth_vecring1,domain=testcontainer,link=tap12345678,master=true,socket=1,mac=00:ec:f4:bb:d9:50 \
 --vdev=eth_vecring2,domain=testcontainer,link=tap555,master=true,socket=0,mac=00:ec:f4:bb:d9:99 \
 --vdev=eth_vecring3,domain=testcontainer,link=tap666,master=true,socket=0,mac=00:ec:f4:bb:d9:98  \
 -- -P -T -f themes/black-yellow.theme  -m "[7:9].0,[11:13].1,[8:10].2,[12:14].3"

container side has similar parameter :

./app/app/x86_64-native-linuxapp-gcc/pktgen  --huge-unlink  \
--no-pci  --huge-dir=/dev/hugepages  --socket-mem 256,256  \
--vdev=eth_vecring0,domain=testcontainer,link=tap456,mac=00:ec:f4:bb:d9:7f  \
--vdev=eth_vecring1,domain=testcontainer,link=tap12345678 \
--vdev=eth_vecring2,domain=testcontainer,link=tap555 \
--vdev=eth_vecring3,domain=testcontainer,link=tap666  \
-- -P -T -f themes/black-yellow.theme  -m "[15:17].0,[19:21].1,[20:22].2,[26:28].3"

we have the initial port view now:

virtual links on different sockets will not influence each other that much,here we get this throughpktgen> start 0,2 :

that's to say,only virtual links on the same socket can affect each other: we get this through pktgen>start 0,1,2 :

At last ,we set frame size to 1514 and start all ports only in one direction:

note that we can achieve total 4.9Mpps(60Gbps). next we start all ports with dual directions:

so far ,we know no matter how we arrange the layout of cpu and numa and virtual link ,the memory bandwidth limits the total throughput in the system. we are definitely sure about this from the following figure:total throughput of 64-byte frame:

so,we could be a little proud that virtual link bundles can handle almost 60Gbps packets flows of frames of any size.

4.6 summary

the bottleneck of vecring is absolutely memory bandwidth,same with virtio backed virtual link .

Previous3.principle of vectorizing ring buffer

Last updated 4 years ago

Was this helpful?