Vyos and the mystery conntrack counter

VyOS

Router Go Fast!

For those that know me, it’s no secret I’m a huge Vyos fan.  I moved away from pfSense after they released a version that had issues with more than 1 vCPU, instead trying Vyos.  Once I’d seen how much better it performed under Proxmox I stayed on it and have never looked back.
I really feel the pfSense project lost its way when they fucked up Wireguard and then lashed out at everyone who was just trying to help get it into FreeBSD.  But I digress.

Flow Offload (Flowtable) Bug?

I recently upgraded from Vyos 1.3 to Vyos 1.4. 1.4 is a huge step forward for the project as it moves from iptables to nftables.  It also brings some great new features, like Flowtable (software/hardware flow offload).  This means that once a flow has created a conntrack entry, all future packets that match this flow are fastpath’d through the conntrack service, assuming you have a rule to allow this like so:

firewall {
    flowtable FastVyos {
        description "Vyos Fast Software Offload Table"
        interface eth1
        interface eth0
        offload software
    }
    ipv4 {
        forward {
            filter {
                default-action accept
                description "Filter for packets being forwarded through the router"
                rule 10 {
                    action offload
                    description "Offload Established TCP and UDP Traffic"
                    offload-target FastVyos
                    protocol tcp_udp
                    state established
                    state related
                }

This results in better performance and on older/slower hardware will increase the number of packets-per-second that a device can handle.  This is a very good thing!  You can see its working if you see [OFFLOAD] in your conntrack table:

conntrack -L -u offload
<snip>
tcp      6 src=x.x.x.x dst=x.x.x.x sport=43392 dport=x packets=2317 bytes=192092 src=192.168.0.5 dst=x.x.x.x sport=x dport=43392 packets=2644 bytes=149748 [OFFLOAD] mark=0 use=2
tcp      6 src=x.x.x.x dst=x.x.x.x sport=55400 dport=x packets=1336 bytes=111233 src=192.168.0.5 dst=x.x.x.x sport=x dport=55400 packets=1535 bytes=88580 [OFFLOAD] mark=0 use=2
tcp      6 src=x.x.x.x dst=x.x.x.x sport=49836 dport=x packets=850 bytes=70559 src=192.168.0.5 dst=x.x.x.x sport=x dport=49836 packets=1060 bytes=58247 [OFFLOAD] mark=0 use=2
tcp      6 src=x.x.x.x dst=x.x.x.x sport=60797 dport=x packets=10014 bytes=827040 src=192.168.0.5 dst=x.x.x.x sport=x dport=60797 packets=10125 bytes=570347 [OFFLOAD] mark=0 use=2
conntrack v1.4.7 (conntrack-tools): 693 flow entries have been shown.

 

Once I’d turned on Flowtable though, I started to have issues with Firebase Cloud Messaging on my Android phones.  It’d keep timing out and I wouldn’t get push notifications until I woke up my phone.  I spent ages debugging, Wiresharking, testing with Flowtable on and off.  It always would work with Flowtable off, but would fail/disconnect with Flowtable enabled. In the end, convinced I had found a bunch in nftables (quite the accusation to make!) I logged a bug in the Netfilter BugTracker. Turns out I was actually correct, there was an issue with PPPoE encapsulation and Flowtable.  I’d actually switched ISPs to one that does DHCP (not because of the bug!) and I hadn’t noticed the problem with DHCP, it was good validation to see it was a PPPoE + Flowtable bug.
I should point out for any PPPoE users out there, the bug is fixed in Linux 6.6.30 onwards, which the latest Vyos 1.5 rolling images are using.

Conntrack Clashes?

So now my router was working perfectly, but for some reason at some stage I decided to look at the conntrack table statistics.  I just like to see how things work “under the hood” I guess.

Wait, what’s this? What the hell is clash_resolve in my conntrack statistics and why is it going up by ~300 a minute? That can’t be a good thing, can it?

tim@ferrari:~$ conntrack -S
cpu=0 found=13872 invalid=64978 insert=0 insert_failed=2130 drop=2130 early_drop=0 error=1966 search_restart=0 clash_resolve=1091611 chaintoolong=0 
cpu=1 found=13353 invalid=64876 insert=0 insert_failed=2164 drop=2164 early_drop=0 error=1760 search_restart=0 clash_resolve=1098408 chaintoolong=0

I spent a lot of time googling, but trying to find any real information about what it does was hard.  There were the main links I found that offered some insight.

It turns out what a clash_resolve is, at least to my understanding, is that when conntrack tries to create an entry, if there’s already an entry for that tuple [source IP, source port, dest port]:[destination ip, source port, dest port] that it will instead shift the source port of the incoming packet so that it’s unique, and create a conntrack entry based on that.  I haven’t explained that very well because I never could quite find exactly what was going on myself to a level I felt I understood.  Probably I’m too stupid really, so if you have a better plain english explanation I’d welcome it!

But I did find the cause of the problem.  My Vyos router runs a caching nameserver, it’s my home router so it makes sense for it to cache most DNS lookups.  I have a Zabbix Server at home too and it generates A LOT of DNS requests.  I found as soon as I turned off my Zabbix server that the clash_resolve stopped incrementing.  After looking at the conntrack table I realised there were hundreds of conntrack entries between the DNS Server on the router and my Zabbix server:

<snip snip>
udp      17 19 src=192.168.0.253 dst=192.168.0.1 sport=48288 dport=53 packets=2 bytes=124 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=48288 packets=2 bytes=189 mark=0 use=1
udp      17 12 src=192.168.0.253 dst=192.168.0.1 sport=56102 dport=53 packets=1 bytes=68 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=56102 packets=1 bytes=84 mark=0 use=1
udp      17 12 src=192.168.0.253 dst=192.168.0.1 sport=56240 dport=53 packets=2 bytes=136 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=56240 packets=2 bytes=201 mark=0 use=1
udp      17 28 src=192.168.0.253 dst=192.168.0.1 sport=38695 dport=53 packets=2 bytes=124 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=38695 packets=2 bytes=189 mark=0 use=1
udp      17 10 src=192.168.0.253 dst=192.168.0.1 sport=33689 dport=53 packets=2 bytes=128 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=33689 packets=2 bytes=193 mark=0 use=1
udp      17 13 src=192.168.0.253 dst=192.168.0.1 sport=36236 dport=53 packets=2 bytes=128 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=36236 packets=2 bytes=193 mark=0 use=1
udp      17 10 src=192.168.0.253 dst=192.168.0.1 sport=49932 dport=53 packets=2 bytes=116 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=49932 packets=2 bytes=181 mark=0 use=1
udp      17 3 src=192.168.0.253 dst=192.168.0.1 sport=54581 dport=53 packets=2 bytes=126 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=54581 packets=2 bytes=191 mark=0 use=1
udp      17 1 src=192.168.0.253 dst=192.168.0.1 sport=40209 dport=53 packets=2 bytes=126 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=40209 packets=2 bytes=191 mark=0 use=2
udp      17 12 src=192.168.0.253 dst=192.168.0.1 sport=48388 dport=53 packets=2 bytes=136 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=48388 packets=2 bytes=201 mark=0 use=1
udp      17 4 src=192.168.0.253 dst=192.168.0.1 sport=51367 dport=53 packets=2 bytes=128 src=192.168.0.1 dst=192.168.0.253 sport=53 dport=51367 packets=2 bytes=193 mark=0 use=1
conntrack v1.4.7 (conntrack-tools): 167 flow entries have been shown.

Fixing Conntrack

And here’s the fix: Those connections don’t need to be in conntrack! There’s no NAT going on and I’m not doing any firewalling.  It’s local LAN traffic.  So the fix is to put in an exception rule, so that all traffic from my LAN, talking to my DNS Server on the router, bypasses conntrack.  Meaning there’s no state at all, it’s just a normal routed packet.
Note that you have to create a rule in both directions, otherwise the router sending back replies generates a conntrack entry.

The configuration looks like this, placed under the “system conntrack” stanza:

show system conntrack
 ignore {
     ipv4 {
         rule 10 {
             description "Ignore Conntrack for LAN DNS Requests to Router"
             destination {
                 address 192.168.0.1
                 port 53
             }
             inbound-interface eth1
             protocol udp
             source {
                 address 192.168.0.0/24
             }
         }
         rule 200 {
             description "Ignore Conntrack for LAN DNS Replies from Router"
             destination {
                 address 192.168.0.0/24
             }
             protocol udp
             source {
                 address 192.168.0.1
                 port 53
             }
         }
     }

With this in place, the conntrack table statistics for clash_resolve have stopped going up rapidly.  I still see some increasing, but that’s expected.  In fact clash_resolve isn’t even a problem as such, it’s just saying a clash was noticed and resolved.

I’ve also saved myself 650+ entries in the conntrack table that I didn’t need:

[Before the change above was made]
tim@ferrari# conntrack -L -s 192.168.0.0/24 -d 192.168.0.0/24
<snip>
conntrack v1.4.7 (conntrack-tools): 665 flow entries have been shown.

And my Vyos router is as performant as ever!

Tim

Leave a Reply

Your email address will not be published. Required fields are marked *