DNS over TLS forward zone error

vectorsolutions · 19 January 2021 13:27

Hi,

IPFire 2.25 Core 153

Unbound is setup with supplied unbound.conf file (Note that this file has some temp options)

#
# Unbound configuration file for IPFire
#
# The full documentation is available at:
# https://nlnetlabs.nl/documentation/unbound/unbound.conf/
#

server:
# CUSTOM
val-permissive-mode: yes

# Common Server Options
chroot: ""
directory: "/etc/unbound"
username: "nobody"
do-ip6: no
port: 53

# System Tuning
include: "/etc/unbound/tuning.conf"

# Logging Options
use-syslog: yes
log-time-ascii: yes

# Unbound Statistics
statistics-interval: 86400
extended-statistics: yes

# Prefetching
prefetch: yes
prefetch-key: yes

# Randomise any cached responses
rrset-roundrobin: yes

# Privacy Options
hide-identity: yes
hide-version: yes

# DNSSEC
auto-trust-anchor-file: "/var/lib/unbound/root.key"
#trust-anchor-file: "/var/lib/unbound/root.key"
val-log-level: 2
log-servfail: yes

# Hardening Options
harden-large-queries: yes
harden-referral-path: yes
aggressive-nsec: yes

# TLS
tls-cert-bundle: /etc/ssl/certs/ca-bundle.crt

# EDNS Buffer Size (#12240)
edns-buffer-size: 1232

# Harden against DNS cache poisoning
unwanted-reply-threshold: 1000000

# Listen on all interfaces
interface-automatic: yes
interface: 0.0.0.0

# Allow access from everywhere
access-control: 0.0.0.0/0 allow

# Bootstrap root servers
root-hints: "/etc/unbound/root.hints"

# Include DHCP leases
include: "/etc/unbound/dhcp-leases.conf"

# Include hosts
include: "/etc/unbound/hosts.conf"

# Include any forward zones
include: "/etc/unbound/forward.conf"

remote-control:
control-enable: yes
control-use-cert: no
control-interface: 127.0.0.1

# Import any local configurations
include: "/etc/unbound/local.d/*.conf"

Here’s the forward.conf:

# This file is automatically generated and any changes
# will be overwritten. DO NOT EDIT!

forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 8.8.8.8@853#dns.google
forward-addr: 8.8.4.4@853#dns.google

resolv.conf:

search [the firewall hostname]
nameserver 127.0.0.1
options trust-ad

Ok I struggled to death to get the correct anchor file, but that was eventually resolved(get it…), by specifying -f parameter to unbound-anchor to provide own resolve.conf file (which had a nameserver of 8.8.8.8 instead of 127.0.0.1), that gave me the correct anchor file.

But it still didnt work, so I found a topic on here and an answer from Micheal that said to try switching PROTO on the dns.cgi WUI to UDP/TCP/TLS. Now here is where it gets interresting:

On UDP: It just gives an unbound error saying the anchor isn’t trusted
On TCP: same error as UDP - anchor isn’t trusted.
On TLS however it works, exactly as expected, so cudos. Perfect

But now comes my problem…

When I restart the machine, DNS fails completely with error:

<SERVFAIL> <domain - this includes all domains that needs resolving A/AAAA IN>: all the configured stub or forward servers failed, at zone .

Obviously this goes without saying but, i’m gonna say it anyway: unbound is running after restart, it’s just failing with that error in /var/log/messages

the forward.conf file remains unchanged between reboots, so does the unbound.conf file - double checked that.

ONLY WAY TO FIX IT:
At this point if you check the firewall connections page, the firewall isn’t opening any connections to the nameservers (8.8.8.8,8.8.4.4) on ports 53 or 853

Step 1: Go onto the WUI -> Network -> Domain Name System and just click the Save button without changing any options.

This makes it so that the connections page starts showing connections to nameservers on port 853, but still no DNS resolutions from firewall clients(clients connected to the green interface). unbound log also stops complaining about the forward zones. And name resolution on the ipfire server works (nslookup eith 127.0.0.1 and dig)
If you check DNSSEC with dig CLI commands, everything checks out - so unbound DNSSEC is now working perfectly.
Also, if you click the “Check DNS Servers” button on the Network -> Domain Name System Page (dns.cgi), it says OK (before step 1 it failed), but the title Status still says “broken”

Step 2: go on CLI -> /etc/init.d/unbound restart

This fixes it completely, now all clients can send DNS queries, you can see open connections on the connections page, and there are no more errors on the unbound log.

I inspected the dns.cgi file briefly, here is what I discovered:

Firstly after you click the Save button it generates a yaml file with the correct nameservers that are configured on the dns page. I assume it gets copied over to forward.conf by another service, didn’t investigate that far yet.

Also I don’t know perl all that well, so it’s slow going.

Then it goes on to issue a system command “suricatactrl restart”

As far as I can tell tha’ts all the WUI does after clicking the Save button.

Even if I issue the suricatactrl restart command on CLI then /etc/init.d/unbound restart, this still does not fix my problem.
If I remember right now, the suricatactrl restart command also restarts the firewall service.
Also, I disabled the proxy service and IPS/IDS services while resolving all the errors I got after the core 153 update. So they are still disabled and not starting

So i tried to restart the firewall (/etc/init.d/firewall restart) then /etc/init.d/unbound restart too with no success. I’m running out of things to check next, I can’t spend 3 full days investigating and fixing something that should be working by default.

I need a fix for this

Please can I get some feedback from someone who knows or maybe had a similar issue

Let me know if you need more log files or conf files

Thank you

bonnietwin · 19 January 2021 14:57

Hi @vectorsolutions,

Welcome to the IPFire Community.

I am not an expert at all about DNS but I do have a question as to what drove the modification of the resolve.conf file. There is no mention of what errors were being seeing that led to changing that file from using the localhost (ie IPFire) as the nameserver and changing it to google.

On my Core 153 IPFire I have the default resolve.conf file with nameserver 127.0.0.1 and I have 6 DNS over TLS servers specified in the DNS WUI page and they are all working with no errors. This status is maintained over reboots.

vectorsolutions · 19 January 2021 16:28

No, i didn’t alter the resolv.con file.

It’s just an option on the unbound-anchor binary that lets you specify a different resolver since the binary can’t reach the required domain on 127.0.0.1 while you are still setting up unbound.

Thus while unbound can’t resolve names, trying to get the anchor and using 127.0.0.1 as the nameserver will fail.

Here unbound specifically mentions this problem in their documentation.

(NLnet Labs Documentation - Unbound - unbound-anchor.8)

the -f switch

So basically i cp’d the /etc/resolv.conf file into a temp file (/etc/resolv-tmp.conf) and then specified that file using the -f switch (unbound-anchor -f /etc/resolv-tmp.conf -a /var/lib/unbound/root.key -vv)

And yes, your anchor shouldn’t be a problem because you probably installed from a more recent core upgrade, this installation was on core 113, when I updated and broke everything and had to fix a metric shiet ton of stuff.

In my research I read that the standards for the anchor changed in 2017 or something so I had to get a new anchor for DNSSEC to work after I upgraded to core 153
The devs will know more about what core upgrade it was when the anchor standards changed

bonnietwin · 19 January 2021 16:51

Thanks for the clarification. I better understand the difficulties you are having.

Unfortunately this problem is beyond my capabilities to help with.

Hopefully you will get feedback from others better capable of helping on this topic.

vectorsolutions · 19 January 2021 16:58

Atleast you engaged me, thanks for trying.

Maybe my faults will someday help you, if you get a similar issue.

PS: DNSSEC on the core 113 was also setup and worked fine, only without TLS

vectorsolutions · 20 January 2021 01:51

*** UPDATE ***

/etc/init.d/network stop
/etc/init.d/network start

recreates the problem without rebooting

Bringing up the green0 interface...
Adding IPv4 address x.x.x.x to green0 interface... #Correct
Bringing up the red0 interface...
Adding IPv4 address x.x.x.x to the red0 interface... #Correct
Setting up the default gateway x.x.x.x... #Correct
Wait for carrier on red0............. # This is where I can clearly see there is some connectivity issue because it takes way longer then it should
Adding static routes...
Reloading firewall...
DNS not functioning... Trying to sync time with ntp.ipfire.org (81.3.27.46)...
RTNETLINK answers: No such file or directory... #Suspicious
Adding static routes...
Adding static routes
Mounting network filesystems...

Waiting for carrier on red0 takes (about) 17 dots to complete, i assume it’s the timeout period
Then the RTNETLINK seems suspicious but the network works (after the next part) and it looks like some driver option that isn’t understood by the ethernet pci card (its a D-Link).

Then by running two CLI commands I can restore the network to how it should be:

suricatactrl reload
unboundctrl restart

suricatactrl reload is the command run by the dns.cgi page (Step 1)
unboundctrl reload is the command run by dns.cgi
unboundctrl restart is the one i need to run to fix everything

This doesn’t make any sense to me, sure I guess suricata can totally block all network access but, i’m not getting any relevant log entries to explain no network access.
And note dns.cgi is reloading while I have to restart to restore full access.

This test was run with IPS/IDS active.
I will test it next with IPS inactive then IDS/IPS inactive

bonnietwin · 20 January 2021 09:31

Hi @vectorsolutions

Don’t worry about this message. I see it on my production physical IPFire system and also on my VM testbed systems.

Some people on other Linux systems have found it to be related to a kernel module (sch_netem) that has been compiled in but not loaded.

rtnetlink - Linux IPv4 routing socket - allows the kernel’s routing tables to be read and altered.

Network emulator (NETEM) - if you want to emulate network delay, loss, and packet re-ordering. This is often useful to simulate networks when testing applications or protocols.

In IPFire it could be due to another module related to RTNETLINK but either way I don’t believe it’s a problem.

vectorsolutions · 20 January 2021 10:31

Yea, I came to the same conclusion.

Also found NETEM

I’m not worried about that, but the total loss or network access is a BIG problem.
Client wants their firewall back, they JUST phoned me.

I told them i’m going to put it back only with DNSSEC turned off.
And then fix it over the weekend

I’m convinced it has something to do with the unbound protocol used.

When I stopped suricata last night and retested, I couldn’t get it back on the network till I selected UDP then Save, then TLS -> Save.

So even with IPS/IDS completely turned off the network still broke after a /etc/init.d/network restart.

vectorsolutions · 20 January 2021 11:25

When the network is restarted and the server has no access on red0, i can see connections (on the connections page WUI):

TCP
Source: 127.0.0.1:xxxxx
Destination: 127.0.0.1:8953

Im thinking it has something to do with iptables

bonnietwin · 20 January 2021 12:30

Hi @vectorsolutions

You indicated that the previous core version was 113 and that in going to 153 a lot of stuff broke which you had to fix.

If so much needed fixing there could still be some things not correctly setup.
What about taking a backup of the system and then re-installing Core 153 from scratch, confirming that DNS and unbound is working with the minimum of setting required and then restoring the backup to see if that then works.

vectorsolutions · 20 January 2021 17:52

Hi,

Thanks for the help so far Adolf.
Where is Michael, why hasn’t he chimed in?

I would like to know what is causing this, i’m one of those people that likes to solve the puzzle, even if it’s no longer my job.

I spoke to the client and explained the situation.
They are opting for a new server, with a re-installation of ipfire and a re-setup of all the components and services.

I was only trying to fix the setup as it is a rather complex setup, with raid, samba, IPS/IDS, proxy, DNS, DHCP, you know the whole workx. So even if I restore from a backup it’s not going to be a seemless process, it’s going to take a couple days to re-setup everything.

I was also dreading the RAID setup, as they have very complex permissions setup on all the devices and folders.

Reinstalling will also not carry over all the data they gathered so far like data usage, file access etc.
And even if i could migrate all that stuff, I don’t want to as it can break stuff, i’ve seen it happen
For instance, after the update, i had to reissue the OpenVPN and IPSec certs and re-create the users and nets. On the relevant Status page, it still shows the OpenVPN and IPSec users even after I deleted them, so that part is gonna be another day or 2 just to get that fixed, and then it still might give errors later on.

So anyway, this is going to be a completely new installation on a rack mount server.

Issue no longer applies, but while the server is still in testing conditions and running I would still like to know what is causing this so while I decide which server to quote them, im still gonna be fiddling to find the cause.

Lets try long shots…What else can I check?

PS: this server ran for 6 years with ZERO issues (oh except for Guardian - I’m glad you finally adopted suricata).
So I’m satisfied with how IPFire handled itself on this clients setup - CUDOS