Hi, I’m a new user of ipfire, trying it after the hardware running pfsense needed replacing.
A failure to reconnect to the external network after a power cycle is reproducible. I am on the stable build ( IPFire 2.27 (x86_64) - Core-Update 172)
The problem is as described above: my HFC modem takes 30 to 60s to reconnect.
By the time, ipfire has booted and failed to get a DHCP lease etc. My home network does not work.
Power outages are rare and my home network gear is not protected by UPS (although I have now ordered a small one which should provide around 30 minutes). Any UPS provides only a few minutes of backup. When we do get a power outage, it is often longer. I want my home network to be resilient, in case I am travelling, I want everything to come back. My server will, any IP security devices will (e.g. cameras) but now, my connection to the internet won’t.
I don’t think I can use ipfire without this resilience. I am very surprised by this problem.
Is there really no workaround, no retry?
EDIT: Could I try a cron script that reboots if I can’t ping google.com for example?
I don’t understand that. I don’t know what Unbound is (new user here).
In the past 30 minutes, I discovered watchdog, and I have configured it to reboot when it can not successfully ping 184.108.40.206 (google DNS).
I just tested it, and it works. After powering off all network gear, the “race” condition happens predictably, and ipfire does not establish an internet connection (RED interface). But this time, watchdog detects that, and reboots ipfire. After booting, the connection is established (since the modem is by now ready and waiting).
For future readers, this means
a) installing the add on watchdog (from ‘pakfire’)
b) edit the existing file /etc/watchdog.conf
All you have to do is copy one of the inactive ping tests, uncomment the line and choose your test IP address. No need to touch anything else. In particular, do not uncomment the #watchdog-device
row near the top. It works just fine with this line commented.
ps -ef | grep watchdog will tell if you the service is running. For me, after a reboot, it was working immediately.
I’ve just tried this.
But the immediate shutdown clears the statistics held in RAM ( system graphics for example ).
This doesn’t matter after a power failure, but in case of temporary DHCP problem.
Would it be advisable to define ipfirereboot boot as ‘repair program’?
I think pretending that a UPS is the solution to the recovery from a power outage is just wrong. It’s an expensive and unreliable workaround. It solves other problems: power filtering and short term power outs. In my case, I’ve never had a power surge, and power outages are rare. But when they happen, they are long (utility maintenance, large scale failure such as a major storm). I speculate that this profile of power outages is fairly common in urban areas in the developed world. The idea of spending $300 for a solution that only works some of the time is not appealing when the problem is a software issue.
I think that it best not to suggest this as a solution to the problem of ipfire not retrying failed connections, I think I am not the only person who would find it frustrating. It is the first router where I have experienced this problem.
It is clear that there is a $0 software fix (rebooting with watchdog) which is therefore already a vastly better solution to the problem. I will optimise it now to use the suggested recovery executable to avoid the reboot. [EDIT I misunderstood, I was hoping that ipfirereboot did some kind of ipfire restart rather a machine reboot, but in fact it is a machine reboot]
Also I am new to ipfire, and I am not sure how robust it is when the internet connection drops out, which happens much more often than power outages… watchdog will deal with that too.
I have anyway bought a small UPS to provide line filtering for my network hardware. I will only connect the router hardware to the battery backup, which buys probably about 30m.
This worked. After watchdog noticed the internet connection was not working, it restarted the network service. After a few seconds the internet connection was working again.
This is the fastest and most elegant method to fix the problem.
Possibly this could be documented in the wiki. I don’t mind doing that, but I will document what I understand, which is the use of watchdog. Perhaps the fcon solution is better since it does not need watchdog to be installed.
While I initially suggested cron, I did not know about watchdog, and I like that watchdog is so configurable, including a fallback to rebooting if the repair binary does not work. It gives me a lot of confidence, since I have tested the power cycle nearly ten times now, and both the service restart and the reboot have fixed the problem every time. So if the first approach, service restart, does not work, the next step is reboot. And this solution needs only a small number of steps.
I have edited my reply to be more friendly for machine translation, I hope.
These techniques might be necessary for aarch64, but for x86_64 a much simpler approach should work.
GRUB bootloader boots the default menu item after 8 seconds delay. Change that to “set timeout=60”. After proof-of-concept it would be necessary to also make the equivalent alteration in /etc/grub.d/00_header as well as in /etc/default/grub, so that the change persists through remaking of the grub.cfg file.
That probably is not long enough. dhcpcd, which gets the IP lease from the red wan connection already has a timeout of 60 secs defined foir how long it will try to get an IP from the WAN connection, so the modem must be taking longer than 60 secs to re-connect.
That timeout in dhcpcd.conf could be made longer but as IPFire2.x works via sysvinit then nothing else will be done in the boot up cycle until that timeout is passed or the IP is obtained from the modem connection.
The timeout used to be the default for dhcpcd.conf which is 30 secs but back in mid 2021 the value was increased to the current value of 60 secs.
dhcpcd gives up after some time to try to get an IP. Some sort of ‘watchdog’ which restarts the red interface can help. The condition must be defined ( ping an IP, analysis of the interface state, … )
Some ISPs ( at least my ISP ) do not handle the DHCP protocol adequate. A renew request isn’t answered, resulting in a rebind after a number of trials. I couldn’t find the reason for this behaviour yet, because I didn’t find a competent technical support until now at Vodafone.
According to our experience ( other topics in the community and mine ) it is sufficient to restart the dhcp client, which is done by a interface restart.
The ‘watchdog’ can be just a little script testing the connectivity and periodically started by fcrontab.
Only open problem is a sufficient test condition: ping or interface state.
Because it is a bit tricky to reproduce the problem, I can’t really give a adequate condition.
My analysis of threads about this problem did show a common situation/state so far.
I think there are people who are hinting at better solutions.
However, I have written what I did (with the idea of submitting this to the wiki)
I agree that this means that potentially the ipfire machine goes into a reboot loop if there is an innocent reason for Google DNS to be offline. But to put this in context, within only two hours of swapping to ipfire, I achieved in testing what would be for me a complete disaster: no access to my home office network remotely after a power failure. I bought a UPS for the network gear right away, but this is not a deterministic solution. I think I am not the only one who consider the risk of a reboot loop by far the lessor of two evils. I do not doubt that testing by pinging 220.127.116.11 is far from the best solution.
I don’t know enough to easily come up with a better solution.
I am also surprised that we have to reach for watchdog-style workarounds; I would have thought that recovery from this situation is core business for a router. It seems we need to find a way for developers to reproduce this. But can it not be reproduced by physically disconnecting an ipfire machine from the RED network booting it, waiting for DHCP to fail, and then reconnecting RED?