Post

Kubevirt: The Next Gen Of VM Orchestration?

Kubevirt first time i hear it was from my friend kuliserper who brought this tools when he becoming speaker at openinfra meetup 13 but sadly i can’t to see his presentation cuz at that time i’m not in jkt :’( .

At first time i hear it i was not very excited honestly, cuz it’s like running a vm over container or qemu-docker which is nothing special, but all my perspective was changed after hearing “the problem with kubevirt is the vm was not accessible from outside cluster by default, if you want it accessible you should use multus-cni and need another nic for the provider network/floating ip network”

WHATTT?? WHAYY?? HOWW?

that was nonsense, what the point of having vm if no one can access it from outside right, theoretically the vm can use SNAT fuction. but attaching another nic for the provider network/floating ip network it’s technically garbage expect you have special case like you need SR-IOV into your vm. then what if i want to add another network? i should attach new nic? or i should create new vlan&bridge subnet in all nodes? that very ineffective.

ok enough with this nonsense, let’s dig dipper and proof if my theory was possible with snat&dnat func

Setup

wait let me show you my kube topology

topology

So i have 3 nodes, 1 is master and others 2 is worker also all nodes have two nic, one is for kube communication between nodes and second nic is for provider network. the networking is simple the red is provider network/floating ip, green is internal node network and purple is pod network

here the detail subnet:

  • Green: 201.0.0.0/24
  • Red: 192.168.100.0/24
  • Purple: 100.0.0.0/24

let’s start our setup, first is installing the kube-ovn since only kube-ovn who can provide snat funcion as af as i know.

Install kube-ovn

  • wget https://raw.githubusercontent.com/kubeovn/kube-ovn/release-1.12/dist/images/install.sh
  • nano install.sh #Change POD_CIDR,POD_GATEWAY,EXCLUDE_IPS
  • bash install.sh

And wait until all pods was running. to verify if the kube-ovn was running you can use

  • kubectl ko nbctl show

Install kubevirt

  • export RELEASE=$(curl https://storage.googleapis.com/kubevirt-prow/release/kubevirt/kubevirt/stable.txt)
  • kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml
  • kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml
  • kubectl get pods -A And wait until all pods was running

and don’t forget to install virtctl

  • wget https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/virtctl-${VERSION}-linux-amd64

Now let’s try to create vm

vm.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: testvm
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/size: small
        kubevirt.io/domain: testvm
    spec:
      domain:
        devices:
          disks:
            - name: containerdisk
              disk:
                bus: virtio
            - name: cloudinitdisk
              disk:
                bus: virtio
          interfaces:
          - name: default
            masquerade: {}
        resources:
          requests:
            memory: 64M
      networks:
      - name: default
        pod: {}
      volumes:
        - name: containerdisk
          containerDisk:
            image: quay.io/kubevirt/cirros-container-disk-demo
        - name: cloudinitdisk
          cloudInitNoCloud:
            userDataBase64: SGkuXG4=
1
2
3
4
5
root@ubuntu-kube-1:/home/humanz# kubectl apply -f vm.yaml
virtualmachine.kubevirt.io/testvm created
root@ubuntu-kube-1:/home/humanz# kubectl get vm
NAME                   AGE   STATUS    READY
testvm                 91s   Stopped   False

Now start the vm

1
2
3
4
5
root@ubuntu-kube-1:/home/humanz# virtctl start testvm
VM testvm was scheduled to start
root@ubuntu-kube-1:/home/humanz# kubectl get vm
NAME                   AGE    STATUS    READY
testvm                 3m6s   Running   True

Let’s try to console it

1
2
3
4
5
6
7
8
root@ubuntu-kube-1:/home/humanz# virtctl console testvm
Successfully connected to testvm console. The escape sequence is ^]

login as 'cirros' user. default password: 'gocubsgo'. use 'sudo' for root.
testvm login: cirros
Password:
$ sudo -i
#

Nice, now the vm was running and accessible.

Digging down the rabbit hole

ladies and gentlemen, let’s start our journey.

first let’s check the ip address of vm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:00:00:F8:96:4E
          inet addr:10.0.2.2  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::200:ff:fef8:964e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:8900  Metric:1
          RX packets:445 errors:0 dropped:0 overruns:0 frame:0
          TX packets:495 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:59203 (57.8 KiB)  TX bytes:53577 (52.3 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

the ip address was 10.0.2.2, huh that strange since i never define this ip on my kube cluster, let’s check from pod ip

1
2
3
4
5
root@ubuntu-kube-1:/home/humanz# kubectl get pods -o wide
NAME                                               READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
nfs-subdir-external-provisioner-5b67d5c597-55pmr   1/1     Running   0          35h   100.0.0.8    ubuntu-kube-3   <none>           <none>
virt-launcher-testvm-szq8b                         3/3     Running   0          78m   100.0.0.22   ubuntu-kube-3   <none>           1/1
virt-launcher-vm-cirros-datavolume-d9phb           2/2     Running   0          34h   100.0.0.21   ubuntu-kube-3   <none>           1/1

huh so the ip from vm was 100.0.0.22 but how can it become 10.0.2.2 in vm? let’s try ping and ssh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
root@ubuntu-kube-1:/home/humanz# ping -c 3 10.0.2.2
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data.

--- 10.0.2.2 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2039ms

root@ubuntu-kube-1:/home/humanz# ping -c 3 100.0.0.22
PING 100.0.0.22 (100.0.0.22) 56(84) bytes of data.
64 bytes from 100.0.0.22: icmp_seq=1 ttl=62 time=1.67 ms
64 bytes from 100.0.0.22: icmp_seq=2 ttl=62 time=1.46 ms
64 bytes from 100.0.0.22: icmp_seq=3 ttl=62 time=0.818 ms

--- 100.0.0.22 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.818/1.314/1.670/0.361 ms

yeah sure the ip 10.0.2.2 was unreachable but 100.0.0.22 can, let’s ssh to the vm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
root@ubuntu-kube-1:/home/humanz# ssh cirros@100.0.0.22
cirros@100.0.0.22's password:
$ sudo -i
# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:00:00:F8:96:4E
          inet addr:10.0.2.2  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::200:ff:fef8:964e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:8900  Metric:1
          RX packets:537 errors:0 dropped:0 overruns:0 frame:0
          TX packets:559 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:73430 (71.7 KiB)  TX bytes:62799 (61.3 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

sure the ssh was working well, but how can be ip from 10.0.2.2 becoming 100.0.0.22 in pod level? let’s find out.

1
2
3
4
root@ubuntu-kube-1:/home/humanz# kubectl exec -it virt-launcher-testvm-szq8b -c compute bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-5.1$ id
uid=107(qemu) gid=107(qemu) groups=107(qemu)

hemm, crap i don’t have root level >:’( i just hate it
let’s escalate it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@ubuntu-kube-3:/home/humanz# crictl ps -a | grep virt-launcher-testvm-szq8b
a5eae1b6f29a0       d340d99a7f602d364b5e26566a8e315d9f530332d8e43bd5a141a1058c62330e                                                                      2 hours ago         Running             guest-console-log                 0                   8f9ba90bc68bb       virt-launcher-testvm-szq8b
8d23679bfd2d2       quay.io/kubevirt/cirros-container-disk-demo@sha256:0e5ac38b20abcc7752293425b239a147868facd62cd5030dede6da6f2fc526a1                   2 hours ago         Running             volumecontainerdisk               0                   8f9ba90bc68bb       virt-launcher-testvm-szq8b
621a06bc3851e       d340d99a7f602d364b5e26566a8e315d9f530332d8e43bd5a141a1058c62330e                                                                      2 hours ago         Running             compute                           0                   8f9ba90bc68bb       virt-launcher-testvm-szq8b
56654182c5130       quay.io/kubevirt/cirros-container-disk-demo@sha256:0e5ac38b20abcc7752293425b239a147868facd62cd5030dede6da6f2fc526a1                   2 hours ago         Exited              volumecontainerdisk-init          0                   8f9ba90bc68bb       virt-launcher-testvm-szq8b
ad71e10028d6a       d340d99a7f602d364b5e26566a8e315d9f530332d8e43bd5a141a1058c62330e                                                                      2 hours ago         Exited              container-disk-binary             0                   8f9ba90bc68bb       virt-launcher-testvm-szq8b
root@ubuntu-kube-3:/home/humanz# crictl inspect 621a06bc3851e | grep pid
    "pid": 3300896,
          "pids": {
            "type": "pid"
root@ubuntu-kube-3:/home/humanz# nsenter --all -S 0 -G 0 -t 3300896
[root@testvm /]# id
uid=0(root) gid=0(root) groups=0(root)
[root@testvm /]#

ggez, anyway let’s check the network

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@testvm /]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: k6t-eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP group default qlen 1000
    link/ether 02:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.1/24 brd 10.0.2.255 scope global k6t-eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ff:fe00:0/64 scope link
       valid_lft forever preferred_lft forever
3: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc fq_codel master k6t-eth0 state UP group default qlen 1000
    link/ether 5e:4b:5a:69:35:8e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5c4b:5aff:fe69:358e/64 scope link
       valid_lft forever preferred_lft forever
34: eth0@if35: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP group default
    link/ether 00:00:00:f8:96:4e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 100.0.0.22/16 brd 100.0.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::200:ff:fef8:964e/64 scope link
       valid_lft forever preferred_lft forever
[root@testvm /]# ss -tulpn
Netid                            State                             Recv-Q                             Send-Q                                                         Local Address:Port                                                         Peer Address:Port                            Process
udp                              UNCONN                            0                                  0                                                                    0.0.0.0:67                                                                0.0.0.0:*                                users:(("virt-launcher",pid=13,fd=12))
[root@testvm /]#

hemmm, nothing special in here. we have the pod ip eth0 who veth peer with cni in host and tap interface pairing with k6t-eth0 which is qemu interface

***psttt, see some fun fact

1
2
3
4
5
6
7
8
9
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: k6t-eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 02:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
3: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc fq_codel master k6t-eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 5e:e0:c9:74:76:65 brd ff:ff:ff:ff:ff:ff
36: eth0@if37: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP mode DEFAULT group default
    link/ether 00:00:00:f8:96:4e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    alias f5f4cbdae79f_c

look at tap0 that interface have interface master k6t-eth0, let verify if tap0 was vm interface

1
2
3
4
5
6
7
8
9
10
11
[root@testvm ~]# virsh dumpxml 1 | grep -m 1 interface -A 8
Authorization not available. Check if polkit service is running or see debug message for more information.
    <interface type='ethernet'>
      <mac address='00:00:00:f8:96:4e'/>
      <target dev='tap0' managed='no'/>
      <model type='virtio-non-transitional'/>
      <mtu size='8900'/>
      <alias name='ua-default'/>
      <rom enabled='no'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>

yep tap0 vm interface, let me draw it

vm-peer

the funcion of k6t-eth0 interface was to becoming master interface of vm and to serve dhcp-server

now the question is how can pod ip serve as vm ip, as you can see i can ping and ssh into vm it self with pod ip
what magic behind this?

the most logical answer is DNAT(Destination Network Address Translation) because the Destination was changed right? the Destination was 100.0.0.22 but it’s changed into 10.0.2.2. now let me check the iptables

1
2
root@testvm:~# iptables -t nat -nvL
iptables v1.8.7 (nf_tables): table `nat' is incompatible, use 'nft' tool.

huh, look like kube-virt dev perfer use nftables rathet than iptables

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
root@testvm:~# nft list table nat
table ip nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
                iifname "eth0" counter packets 3 bytes 180 jump KUBEVIRT_PREINBOUND
        }

        chain input {
                type nat hook input priority 100; policy accept;
        }

        chain output {
                type nat hook output priority -100; policy accept;
                ip daddr 127.0.0.1 counter packets 0 bytes 0 dnat to 10.0.2.2
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
                ip saddr 10.0.2.2 counter packets 3 bytes 202 masquerade
                oifname "k6t-eth0" counter packets 5 bytes 624 jump KUBEVIRT_POSTINBOUND
        }

        chain KUBEVIRT_PREINBOUND {
                counter packets 3 bytes 180 dnat to 10.0.2.2
        }

        chain KUBEVIRT_POSTINBOUND {
                ip saddr 127.0.0.1 counter packets 0 bytes 0 snat to 10.0.2.1
        }
}

And bingoo the natting process was happening in here, as you can see the prerouting chain was redirect/jumping all packet from eth0 into KUBEVIRT_PREINBOUND chain and change the dest ip to 10.0.2.2

Great now we understand how kubevirt working with pod ip and the VM behavor, next part maybe i’ll trying with multus-cni


Fun fact about kube-virt dhcp server

As you can see if kubevirt was creating their own dhcp-server and kubevirt dev who decade write a empty file just for verification if the dhcp server was already running

dhcp-server

empty-file

source code

giga chad dev
giga-chad

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.