Hi, I’m a new user of ipfire, trying it after the hardware running pfsense needed replacing.
A failure to reconnect to the external network after a power cycle is reproducible. I am on the stable build ( IPFire 2.27 (x86_64) - Core-Update 172)
The problem is as described above: my HFC modem takes 30 to 60s to reconnect.
By the time, ipfire has booted and failed to get a DHCP lease etc. My home network does not work.
Power outages are rare and my home network gear is not protected by UPS (although I have now ordered a small one which should provide around 30 minutes). Any UPS provides only a few minutes of backup. When we do get a power outage, it is often longer. I want my home network to be resilient, in case I am travelling, I want everything to come back. My server will, any IP security devices will (e.g. cameras) but now, my connection to the internet won’t.
I don’t think I can use ipfire without this resilience. I am very surprised by this problem.
Is there really no workaround, no retry?
EDIT: Could I try a cron script that reboots if I can’t ping google.com for example?
I don’t understand that. I don’t know what Unbound is (new user here).
In the past 30 minutes, I discovered watchdog, and I have configured it to reboot when it can not successfully ping 8.8.8.8 (google DNS).
I just tested it, and it works. After powering off all network gear, the “race” condition happens predictably, and ipfire does not establish an internet connection (RED interface). But this time, watchdog detects that, and reboots ipfire. After booting, the connection is established (since the modem is by now ready and waiting).
For future readers, this means
a) installing the add on watchdog (from ‘pakfire’)
b) edit the existing file /etc/watchdog.conf
All you have to do is copy one of the inactive ping tests, uncomment the line and choose your test IP address. No need to touch anything else. In particular, do not uncomment the #watchdog-device
row near the top. It works just fine with this line commented.
ps -ef | grep watchdog will tell if you the service is running. For me, after a reboot, it was working immediately.
I’ve just tried this.
But the immediate shutdown clears the statistics held in RAM ( system graphics for example ).
This doesn’t matter after a power failure, but in case of temporary DHCP problem.
Would it be advisable to define ipfirereboot boot as ‘repair program’?
I struggled with the same issue. When my gateway (my cable modem) rebooted at the same time as my firewall then I’d have a non-working network.
Buying the right size UPS solved all of my issues. I have a 1500W CyberPower UPS. No more power failures! (hint: you may need a bigger UPS to last a little longer).
The UPS powers my:
Cable Modem
Firewall
Wifi Access Point
and three Netgear hubs
note: I don’t need all three hubs and could easily delete one.
When I set this all up I knew the power of everything and how long the CyberPower UPS would last. (Sorry I don’t remember these numbers at this moment).
This has an unexpected benefit - when the power went out to our house & the neighborhood we still had Internet in the house. Not expected but still somewhat nice!
I think pretending that a UPS is the solution to the recovery from a power outage is just wrong. It’s an expensive and unreliable workaround. It solves other problems: power filtering and short term power outs. In my case, I’ve never had a power surge, and power outages are rare. But when they happen, they are long (utility maintenance, large scale failure such as a major storm). I speculate that this profile of power outages is fairly common in urban areas in the developed world. The idea of spending $300 for a solution that only works some of the time is not appealing when the problem is a software issue.
I think that it best not to suggest this as a solution to the problem of ipfire not retrying failed connections, I think I am not the only person who would find it frustrating. It is the first router where I have experienced this problem.
It is clear that there is a $0 software fix (rebooting with watchdog) which is therefore already a vastly better solution to the problem. I will optimise it now to use the suggested recovery executable to avoid the reboot. [EDIT I misunderstood, I was hoping that ipfirereboot did some kind of ipfire restart rather a machine reboot, but in fact it is a machine reboot]
Also I am new to ipfire, and I am not sure how robust it is when the internet connection drops out, which happens much more often than power outages… watchdog will deal with that too.
I have anyway bought a small UPS to provide line filtering for my network hardware. I will only connect the router hardware to the battery backup, which buys probably about 30m.
Sorry if I have not explained myself well, since English is not my mother tongue and I have to use the translators…
What I was trying to say, is that modifying the UnboundWD attachment, calling it NetWD, for example, and changing the code as follows, wouldn’t it work for this purpose?:
#!/bin/bash
# -q quiet
# -c nb of pings to perform
service=network
ping -q -c5 google.com > /dev/null
if [ $? -eq 0 ]
then
echo "ok"
else
/etc/init.d/$service restart
fi
When you run “/etc/init.d/network restart” all network interfaces are restarted and connections are re-established. Wouldn’t this work better than not rebooting IPFire completely?.
This worked. After watchdog noticed the internet connection was not working, it restarted the network service. After a few seconds the internet connection was working again.
This is the fastest and most elegant method to fix the problem.
Possibly this could be documented in the wiki. I don’t mind doing that, but I will document what I understand, which is the use of watchdog. Perhaps the fcon solution is better since it does not need watchdog to be installed.
While I initially suggested cron, I did not know about watchdog, and I like that watchdog is so configurable, including a fallback to rebooting if the repair binary does not work. It gives me a lot of confidence, since I have tested the power cycle nearly ten times now, and both the service restart and the reboot have fixed the problem every time. So if the first approach, service restart, does not work, the next step is reboot. And this solution needs only a small number of steps.
I have edited my reply to be more friendly for machine translation, I hope.
Just setting the link down and up doesn’t do any more.
Cheching the status on RED via DNS is a very bad idea. Also a reboot seems to me a bad idea.
You will enter a reboot loop whenever your modem is offline oder your DNS just does not work. Not good.
That’s why I recommend to write a script with ethtool, ifconfig and grep to get basic information of RED.
First: is there even a active link on RED:
ethtool red0 | grep "Link detected"
Second: If ‘Link detected: yes’ then check if the interface is set to DHCP:
grep "RED_TYPE" /var/ipfire/ethernet/settings
Third: If the interface is set to DHCP, check the IP:
ifconfig red0 | grep "inet"
If the IP is 0.0.0.0 or 169.x.x.x then renew the IP for RED:
ifconfig red0 renew
However somebody needs to help out with the code so we really only get the information needed with grep or anything else that makes it much easier with the if/else loops.
These techniques might be necessary for aarch64, but for x86_64 a much simpler approach should work.
GRUB bootloader boots the default menu item after 8 seconds delay. Change that to “set timeout=60”. After proof-of-concept it would be necessary to also make the equivalent alteration in /etc/grub.d/00_header as well as in /etc/default/grub, so that the change persists through remaking of the grub.cfg file.
That probably is not long enough. dhcpcd, which gets the IP lease from the red wan connection already has a timeout of 60 secs defined foir how long it will try to get an IP from the WAN connection, so the modem must be taking longer than 60 secs to re-connect.
That timeout in dhcpcd.conf could be made longer but as IPFire2.x works via sysvinit then nothing else will be done in the boot up cycle until that timeout is passed or the IP is obtained from the modem connection.
The timeout used to be the default for dhcpcd.conf which is 30 secs but back in mid 2021 the value was increased to the current value of 60 secs.
dhcpcd gives up after some time to try to get an IP. Some sort of ‘watchdog’ which restarts the red interface can help. The condition must be defined ( ping an IP, analysis of the interface state, … )
Some ISPs ( at least my ISP ) do not handle the DHCP protocol adequate. A renew request isn’t answered, resulting in a rebind after a number of trials. I couldn’t find the reason for this behaviour yet, because I didn’t find a competent technical support until now at Vodafone.
According to our experience ( other topics in the community and mine ) it is sufficient to restart the dhcp client, which is done by a interface restart.
The ‘watchdog’ can be just a little script testing the connectivity and periodically started by fcrontab.
Only open problem is a sufficient test condition: ping or interface state.
Because it is a bit tricky to reproduce the problem, I can’t really give a adequate condition.
My analysis of threads about this problem did show a common situation/state so far.
I think there are people who are hinting at better solutions.
However, I have written what I did (with the idea of submitting this to the wiki)
I agree that this means that potentially the ipfire machine goes into a reboot loop if there is an innocent reason for Google DNS to be offline. But to put this in context, within only two hours of swapping to ipfire, I achieved in testing what would be for me a complete disaster: no access to my home office network remotely after a power failure. I bought a UPS for the network gear right away, but this is not a deterministic solution. I think I am not the only one who consider the risk of a reboot loop by far the lessor of two evils. I do not doubt that testing by pinging 8.8.8.8 is far from the best solution.
I don’t know enough to easily come up with a better solution.
I am also surprised that we have to reach for watchdog-style workarounds; I would have thought that recovery from this situation is core business for a router. It seems we need to find a way for developers to reproduce this. But can it not be reproduced by physically disconnecting an ipfire machine from the RED network booting it, waiting for DHCP to fail, and then reconnecting RED?
UPDATE:
I’ve commented out clientid, duid, and option rapid_commit.
Since then the effect has gone.
Will try stepwise re-enabling of these settings and report. Cases
clientid : ok
clientid,rapid_commit : ok
duid : ok
duid, rapid_commit : ok
clientid, duid : ok
clientid, duid, rapid_commit : ok
Seems my problem has gone
Don’t know why.
Test: Modem power down for ~3 min.
After power up and modem restart ( ~4 min! ) no problems, IP is acquired.