Power Cycle Issue - modem takes 30+s to reconnect, ipfire boots & fails to get a DHCP lease

timattrn · 6 January 2023 20:16

Hi, I’m a new user of ipfire, trying it after the hardware running pfsense needed replacing.
A failure to reconnect to the external network after a power cycle is reproducible. I am on the stable build ( IPFire 2.27 (x86_64) - Core-Update 172)
The problem is as described above: my HFC modem takes 30 to 60s to reconnect.
By the time, ipfire has booted and failed to get a DHCP lease etc. My home network does not work.

Power outages are rare and my home network gear is not protected by UPS (although I have now ordered a small one which should provide around 30 minutes). Any UPS provides only a few minutes of backup. When we do get a power outage, it is often longer. I want my home network to be resilient, in case I am travelling, I want everything to come back. My server will, any IP security devices will (e.g. cameras) but now, my connection to the internet won’t.

I don’t think I can use ipfire without this resilience. I am very surprised by this problem.
Is there really no workaround, no retry?

EDIT: Could I try a cron script that reboots if I can’t ping google.com for example?

roberto · 6 January 2023 20:37

Hi @timattrn.

You can try this:

This is valid to restart any demon modifing the code.

Bye.

timattrn · 6 January 2023 21:06

I don’t understand that. I don’t know what Unbound is (new user here).
In the past 30 minutes, I discovered watchdog, and I have configured it to reboot when it can not successfully ping 8.8.8.8 (google DNS).
I just tested it, and it works. After powering off all network gear, the “race” condition happens predictably, and ipfire does not establish an internet connection (RED interface). But this time, watchdog detects that, and reboots ipfire. After booting, the connection is established (since the modem is by now ready and waiting).

For future readers, this means
a) installing the add on watchdog (from ‘pakfire’)
b) edit the existing file /etc/watchdog.conf
All you have to do is copy one of the inactive ping tests, uncomment the line and choose your test IP address. No need to touch anything else. In particular, do not uncomment the
#watchdog-device
row near the top. It works just fine with this line commented.

ps -ef | grep watchdog will tell if you the service is running. For me, after a reboot, it was working immediately.

you can force a reboot to test it by

echo "hello" > /dev/watchdog

bbitsch · 6 January 2023 22:51

I’ve just tried this.
But the immediate shutdown clears the statistics held in RAM ( system graphics for example ).
This doesn’t matter after a power failure, but in case of temporary DHCP problem.
Would it be advisable to define ipfirereboot boot as ‘repair program’?

jon · 7 January 2023 04:43

Hi Tim,

I struggled with the same issue. When my gateway (my cable modem) rebooted at the same time as my firewall then I’d have a non-working network.

Buying the right size UPS solved all of my issues. I have a 1500W CyberPower UPS. No more power failures! (hint: you may need a bigger UPS to last a little longer).

The UPS powers my:

Cable Modem
Firewall
Wifi Access Point
and three Netgear hubs
- note: I don’t need all three hubs and could easily delete one.

When I set this all up I knew the power of everything and how long the CyberPower UPS would last. (Sorry I don’t remember these numbers at this moment).

This has an unexpected benefit - when the power went out to our house & the neighborhood we still had Internet in the house. Not expected but still somewhat nice!

So my vote would be for a bigger UPS!

xperimental · 7 January 2023 05:09

My parents got the same issue, but they restart the devices manually.

Isn’t it much better to implement a function:

RED link up
RED setup with DHCP
RED no IP or fallback IP address

→ cycle: ifconfig red0 renew

timattrn · 7 January 2023 05:25

I think pretending that a UPS is the solution to the recovery from a power outage is just wrong. It’s an expensive and unreliable workaround. It solves other problems: power filtering and short term power outs. In my case, I’ve never had a power surge, and power outages are rare. But when they happen, they are long (utility maintenance, large scale failure such as a major storm). I speculate that this profile of power outages is fairly common in urban areas in the developed world. The idea of spending $300 for a solution that only works some of the time is not appealing when the problem is a software issue.

I think that it best not to suggest this as a solution to the problem of ipfire not retrying failed connections, I think I am not the only person who would find it frustrating. It is the first router where I have experienced this problem.

It is clear that there is a $0 software fix (rebooting with watchdog) which is therefore already a vastly better solution to the problem. I will optimise it now to use the suggested recovery executable to avoid the reboot. [EDIT I misunderstood, I was hoping that ipfirereboot did some kind of ipfire restart rather a machine reboot, but in fact it is a machine reboot]

Also I am new to ipfire, and I am not sure how robust it is when the internet connection drops out, which happens much more often than power outages… watchdog will deal with that too.

I have anyway bought a small UPS to provide line filtering for my network hardware. I will only connect the router hardware to the battery backup, which buys probably about 30m.

timattrn · 7 January 2023 06:22

I have made a script: /usr/local/bin/ipfirereboot_boot.sh and made it executable.

#!/bin/sh
/usr/local/bin/ipfirereboot boot

and used that as the repair binary.
It reboots ok, so if that’s a better way of rebooting, very good.

timattrn · 7 January 2023 06:44

I tried this script as the repair binary

#!/bin/sh
ip link set red0 down
ip link set red0 up

but it does not fix the connection in the power cycle scenario.

roberto · 7 January 2023 07:45

Sorry if I have not explained myself well, since English is not my mother tongue and I have to use the translators…

What I was trying to say, is that modifying the UnboundWD attachment, calling it NetWD, for example, and changing the code as follows, wouldn’t it work for this purpose?:

#!/bin/bash
 
# -q quiet
# -c nb of pings to perform
 
service=network
 
ping -q -c5 google.com > /dev/null
 
if [ $? -eq 0 ]
then
	echo "ok"
else
	/etc/init.d/$service restart
fi

When you run “/etc/init.d/network restart” all network interfaces are restarted and connections are re-established. Wouldn’t this work better than not rebooting IPFire completely?.

Here is Log:

Here is file:

NetWD.zip (285 Bytes)

You need copy to /etc/fcrom/cyclic and modify permissions to rwxr-xr-x (0755)

If there is a connection and it is not necessary to restart “network” this is output:

[root@bs ~]# /etc/fcron.cyclic/NetWD
ok
[root@bs ~]#

Bye.

timattrn · 7 January 2023 09:39

Thanks. I implemented your solution with watchdog since I am impressed with watchdog. This must sound a bit funny, apparently it is ancient Linux.

Therefore my repair binary is this script

[root@ipfire bin]# cat restart_network_service.sh

#!/bin/sh
service=network
/etc/init.d/$service restart

This worked. After watchdog noticed the internet connection was not working, it restarted the network service. After a few seconds the internet connection was working again.

This is the fastest and most elegant method to fix the problem.

Possibly this could be documented in the wiki. I don’t mind doing that, but I will document what I understand, which is the use of watchdog. Perhaps the fcon solution is better since it does not need watchdog to be installed.

While I initially suggested cron, I did not know about watchdog, and I like that watchdog is so configurable, including a fallback to rebooting if the repair binary does not work. It gives me a lot of confidence, since I have tested the power cycle nearly ten times now, and both the service restart and the reboot have fixed the problem every time. So if the first approach, service restart, does not work, the next step is reboot. And this solution needs only a small number of steps.

I have edited my reply to be more friendly for machine translation, I hope.

xperimental · 7 January 2023 11:17

It is a little bit more complex than that.

Just setting the link down and up doesn’t do any more.

Cheching the status on RED via DNS is a very bad idea. Also a reboot seems to me a bad idea.

You will enter a reboot loop whenever your modem is offline oder your DNS just does not work. Not good.

That’s why I recommend to write a script with ethtool, ifconfig and grep to get basic information of RED.

First: is there even a active link on RED:

ethtool red0 | grep "Link detected"

Second: If ‘Link detected: yes’ then check if the interface is set to DHCP:

grep "RED_TYPE" /var/ipfire/ethernet/settings

Third: If the interface is set to DHCP, check the IP:

ifconfig red0 | grep "inet"

If the IP is 0.0.0.0 or 169.x.x.x then renew the IP for RED:

ifconfig red0 renew

However somebody needs to help out with the code so we really only get the information needed with grep or anything else that makes it much easier with the if/else loops.

rodneyp · 9 January 2023 10:06

These techniques might be necessary for aarch64, but for x86_64 a much simpler approach should work.

GRUB bootloader boots the default menu item after 8 seconds delay. Change that to “set timeout=60”. After proof-of-concept it would be necessary to also make the equivalent alteration in /etc/grub.d/00_header as well as in /etc/default/grub, so that the change persists through remaking of the grub.cfg file.

bonnietwin · 9 January 2023 10:47

That probably is not long enough. dhcpcd, which gets the IP lease from the red wan connection already has a timeout of 60 secs defined foir how long it will try to get an IP from the WAN connection, so the modem must be taking longer than 60 secs to re-connect.

That timeout in dhcpcd.conf could be made longer but as IPFire2.x works via sysvinit then nothing else will be done in the boot up cycle until that timeout is passed or the IP is obtained from the modem connection.

The timeout used to be the default for dhcpcd.conf which is 30 secs but back in mid 2021 the value was increased to the current value of 60 secs.

bbitsch · 9 January 2023 11:48

My experience with this problem shows two topics

dhcpcd gives up after some time to try to get an IP. Some sort of ‘watchdog’ which restarts the red interface can help. The condition must be defined ( ping an IP, analysis of the interface state, … )
Some ISPs ( at least my ISP ) do not handle the DHCP protocol adequate. A renew request isn’t answered, resulting in a rebind after a number of trials. I couldn’t find the reason for this behaviour yet, because I didn’t find a competent technical support until now at Vodafone.

tphz · 9 January 2023 16:11

The current topic is similar to the one below

hvacguy · 9 January 2023 16:48

Should this not be a core functionality?

bbitsch · 9 January 2023 17:33

According to our experience ( other topics in the community and mine ) it is sufficient to restart the dhcp client, which is done by a interface restart.
The ‘watchdog’ can be just a little script testing the connectivity and periodically started by fcrontab.
Only open problem is a sufficient test condition: ping or interface state.
Because it is a bit tricky to reproduce the problem, I can’t really give a adequate condition.

My analysis of threads about this problem did show a common situation/state so far.

timattrn · 18 January 2023 06:22

I think there are people who are hinting at better solutions.
However, I have written what I did (with the idea of submitting this to the wiki)

I agree that this means that potentially the ipfire machine goes into a reboot loop if there is an innocent reason for Google DNS to be offline. But to put this in context, within only two hours of swapping to ipfire, I achieved in testing what would be for me a complete disaster: no access to my home office network remotely after a power failure. I bought a UPS for the network gear right away, but this is not a deterministic solution. I think I am not the only one who consider the risk of a reboot loop by far the lessor of two evils. I do not doubt that testing by pinging 8.8.8.8 is far from the best solution.
I don’t know enough to easily come up with a better solution.
I am also surprised that we have to reach for watchdog-style workarounds; I would have thought that recovery from this situation is core business for a router. It seems we need to find a way for developers to reproduce this. But can it not be reproduced by physically disconnecting an ipfire machine from the RED network booting it, waiting for DHCP to fail, and then reconnecting RED?

bbitsch · 18 January 2023 13:19

UPDATE:
I’ve commented out clientid, duid, and option rapid_commit.
Since then the effect has gone.
Will try stepwise re-enabling of these settings and report.
Cases

clientid : ok
clientid,rapid_commit : ok
duid : ok
duid, rapid_commit : ok
clientid, duid : ok
clientid, duid, rapid_commit : ok

Seems my problem has gone
Don’t know why.

Test: Modem power down for ~3 min.
After power up and modem restart ( ~4 min! ) no problems, IP is acquired.