Rapid_Commit in dhcpcd Results in Non-Functional Connection (was: No internet after reboot)

jb68 · 7 October 2023 02:17

Hi,
I have a weird bug. After reboot I have no internet but if I physically unplug and then plug back in the network cable will work.
Also if I do:

/etc/rc.d/init.d/networking/red stop
/etc/rc.d/init.d/networking/red start

will also drop the internet connection but it claims it got the right IP.

Bringing up the red0 interface…
Starting dhcpcd on the red0 interface… [ OK ]
DHCP Assigned Settings for red0:
IP Address: 173.180.240.144
Subnet Mask: 255.255.252.0
Default Gateway: 173.180.240.1
DNS Server: 75.153.171.67 75.153.171.116

My power and internet are very stable so I don’t need to do this very often but I’d like to fix it for the time I am not at home and need VPN.

I also tried with a new install and is the same so it seems that is somehow my hardware, I guess some incompatibility with my modem, but before I had zeroshell and didn’t have this issue.
Any ideas?

rodneyp · 7 October 2023 04:48

Further info needed:

is your red0 via USB-Ethernet ?
does the modem include a switch ?

hvacguy · 7 October 2023 11:50

Sound similar to this?

jb68 · 7 October 2023 16:58

My Hw is an APU 2 with bios up2date so no usb on red0
Modem is in bridge mode but it does include a switch.
I did more digging and on

[root@ipfire ~]# /etc/rc.d/init.d/networking/red stop
Stopping dhcpcd on the red0 interface… [ OK ]
Bringing down the red0 interface… [ OK ]
[root@ipfire ~]# /etc/rc.d/init.d/networking/red start
Bringing up the red0 interface…
Starting dhcpcd on the red0 interface… [ OK ]
DHCP Assigned Settings for red0:
IP Address: 173.180.243.195
Subnet Mask: 255.255.252.0
Default Gateway: 173.180.240.1
DNS Server: 75.153.171.67 75.153.171.116

On parallel ssh instance we have:
tail -f /var/log/messages | grep dhc
will bring

Oct 7 09:27:41 d173-180-240-140 dhcpcd[18116]: sending signal ALRM to pid 11618
Oct 7 09:27:41 d173-180-240-140 dhcpcd[18116]: waiting for pid 11618 to exit
Oct 7 09:27:41 d173-180-240-140 dhcpcd[11619]: received SIGALRM, releasing
Oct 7 09:27:41 d173-180-240-140 dhcpcd[11619]: red0: removing interface
Oct 7 09:27:41 d173-180-240-140 dhcpcd[11619]: red0: releasing lease of 173.180.242.226
Oct 7 09:27:41 d173-180-240-140 dhcpcd[11619]: red0: deleting route to 173.180.240.0/22
Oct 7 09:27:41 d173-180-240-140 dhcpcd[11619]: red0: deleting default route via 173.180.240.1
Oct 7 09:27:41 d173-180-240-140 dhcpcd.exe[18117]: red0 has been brought down (STOP)
Oct 7 09:27:43 d173-180-240-140 dhcpcd[11619]: main: control_stop: No such file or directory
Oct 7 09:27:43 d173-180-240-140 dhcpcd[11619]: dhcpcd exited
Oct 7 09:28:47 d173-180-240-140 dhcpcd[18445]: dhcpcd-10.0.1 starting
Oct 7 09:28:47 d173-180-240-140 dhcpcd[18448]: DUID 00:01:00:01:24:4e:69:c1:00:0d:b9:4a:50:18
Oct 7 09:28:47 d173-180-240-140 dhcpcd[18448]: red0: waiting for carrier
Oct 7 09:28:50 d173-180-240-140 dhcpcd[18448]: red0: carrier acquired
Oct 7 09:28:50 d173-180-240-140 dhcpcd[18448]: red0: IAID b9:4a:50:18
Oct 7 09:28:50 d173-180-240-140 dhcpcd[18448]: red0: soliciting a DHCP lease
Oct 7 09:28:50 d173-180-240-140 dhcpcd[18448]: red0: probing address 173.180.243.195/22
Oct 7 09:28:55 d173-180-240-140 dhcpcd[18448]: red0: leased 173.180.243.195 for 14400 seconds
Oct 7 09:28:55 d173-180-240-140 dhcpcd[18448]: red0: adding route to 173.180.240.0/22
Oct 7 09:28:55 d173-180-240-140 dhcpcd[18448]: red0: adding default route via 173.180.240.1
Oct 7 09:28:56 d173-180-240-140 dhcpcd.exe[18505]: red0 has been (re)configured with IP=173.180.243.195
Oct 7 09:30:23 d173-180-240-140 dhcpd: DHCPREQUEST for 192.168.7.143 from 00:15:00:c6:c5:f0 (X250) via green0
Oct 7 09:30:23 d173-180-240-140 dhcpd: DHCPACK on 192.168.7.143 to 00:15:00:c6:c5:f0 (X250) via green0

Now physically unplug the cable:

Oct 7 09:45:25 d173-180-240-140 dhcpcd[21072]: red0: carrier lost
Oct 7 09:45:25 d173-180-240-140 dhcpcd[21072]: red0: deleting route to 173.180.240.0/22
Oct 7 09:45:25 d173-180-240-140 dhcpcd[21072]: red0: deleting default route via 173.180.240.1
Oct 7 09:45:25 d173-180-240-140 dhcpcd.exe[21945]: red0 has been brought down (EXPIRE)
Oct 7 09:45:29 d173-180-240-140 dhcpcd[21072]: red0: carrier acquired
Oct 7 09:45:29 d173-180-240-140 dhcpcd[21072]: red0: IAID b9:4a:50:18
Oct 7 09:45:31 d173-180-240-140 dhcpcd[21072]: red0: rebinding lease of 173.180.242.184
Oct 7 09:45:31 d173-180-240-140 dhcpcd[21072]: red0: probing address 173.180.242.184/22
Oct 7 09:45:37 d173-180-240-140 dhcpcd[21072]: red0: leased 173.180.242.184 for 14400 seconds
Oct 7 09:45:37 d173-180-240-140 dhcpcd[21072]: red0: adding route to 173.180.240.0/22
Oct 7 09:45:37 d173-180-240-140 dhcpcd[21072]: red0: adding default route via 173.180.240.1
Oct 7 09:45:37 d173-180-240-140 dhcpcd.exe[22167]: red0 has been (re)configured with IP=173.180.242.184

Route ( good state ):

root@ipfire ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default gateway 0.0.0.0 UG 1002 0 0 red0
10.153.89.0 10.153.89.2 255.255.255.0 UG 0 0 0 tun0
10.153.89.2 0.0.0.0 255.255.255.255 UH 0 0 0 tun0
173.180.240.0 0.0.0.0 255.255.252.0 U 1002 0 0 red0
192.168.7.0 0.0.0.0 255.255.255.0 U 0 0 0 green0

Route on bad state

[root@ipfire ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default gateway 0.0.0.0 UG 1002 0 0 red0
10.153.89.0 10.153.89.2 255.255.255.0 UG 0 0 0 tun0
10.153.89.2 0.0.0.0 255.255.255.255 UH 0 0 0 tun0
173.180.240.0 0.0.0.0 255.255.252.0 U 1002 0 0 red0
192.168.7.0 0.0.0.0 255.255.255.0 U 0 0 0 green0

As far as I can tell everything is pretty much the same. I also did some log filtering on red0, everything is messier but there one more line that is present on bad state
Oct 7 09:48:29 d173-180-240-140 kernel: 8021q: adding VLAN 0 to HW filter on device red0
Not sure if is important or not.

jb68 · 7 October 2023 17:39

If config in good state:

red0: flags=67<UP,BROADCAST,RUNNING> mtu 1500
inet 173.180.242.226 netmask 255.255.252.0 broadcast 173.180.243.255
ether 00:0d:b9:4a:50:18 txqueuelen 1000 (Ethernet)
RX packets 805605 bytes 713703116 (680.6 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 525169 bytes 66607207 (63.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xd0300000-d031ffff

If config in bad state:

red0: flags=67<UP,BROADCAST,RUNNING> mtu 1500
inet 173.180.242.226 netmask 255.255.252.0 broadcast 173.180.243.255
ether 00:0d:b9:4a:50:18 txqueuelen 1000 (Ethernet)
RX packets 798861 bytes 706643543 (673.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 522420 bytes 66148590 (63.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xd0300000-d031ffff

Don’t see any diff.

rodneyp · 7 October 2023 21:03

I have found that the FIB table in a switch, at least for un-managed switches, can get out of kilter. It has also occurred with my modem, although that is in routing mode. Your re-plugging of the network cable might be temporarily fixing it.

I find that a switch has to be powered down for several minutes to correct this, although that will probably vary with switches. In your case, try powering down the modem for several minutes.

jb68 · 7 October 2023 21:56

Good point, my modem is alive for more than 1 year so lets kill it.
Unfortunately didn’t fix the issue.
I also tried to power cycle the modem and the apu in the same time. It seems that the apu come alive first, cannot get an IP then remain in this state forever.
This is a different bug but will affect me as well because if power goes down this is the case. Here is what I could get after 10 minutes

root@ipfire ~]# /etc/rc.d/init.d/networking/red stop
Stopping dhcpcd on the red0 interface… Not running. [ WARN ]
[root@ipfire ~]# /etc/rc.d/init.d/networking/red start
Bringing up the red0 interface…
Starting dhcpcd on the red0 interface… [ OK ]
DHCP Assigned Settings for red0:
IP Address: 173.180.241.178
Subnet Mask: 255.255.252.0
Default Gateway: 173.180.240.1
DNS Server: 75.153.171.67 75.153.171.116
[root@ipfire ~]# ping 173.180.240.1
PING 173.180.240.1 (173.180.240.1) 56(84) bytes of data.
From 173.180.241.178 icmp_seq=1 Destination Host Unreachable

As you can see pinging the gatway doesn’t work. After plug in out the cable then

[root@ipfire ~]# ping 173.180.240.1
PING 173.180.240.1 (173.180.240.1) 56(84) bytes of data.
64 bytes from 173.180.240.1: icmp_seq=1 ttl=64 time=4.68 ms
64 bytes from 173.180.240.1: icmp_seq=2 ttl=64 time=4.56 ms

I also tried the power cycle ( no reboot or soft reset ) the apu and surprise, is coming alive ok which is good for me, except… in real world will be in same time with modem.
So I have 2 bugs.:

networking is dead on red after doing /etc/rc.d/init.d/networking/red start
dhcp client does not retry on failure to acquire an IP after hard reboot.

rodneyp · 7 October 2023 22:15

I recall that the issue of delaying the timing of red0 being brought up has been discussed previously in this forum.

It is also fairly simple to edit the /boot/grub/grub.conf file to impose a much longer delay on the default boot, although that was deemed, at the time, to not be the optimum solution.

hvacguy · 7 October 2023 22:52

The link I posted above speaks of setting the delay to bring up the red interface.(dhcpcd)
To a longer time.

jb68 · 8 October 2023 22:00

I am aware of the “delay” option, but being a devops and doing automation I can tell you that is a bad solution. Usually the right way is to have a number of retries. That sounds easy right… we just change the /etc/init.d/networking/red where we have

		if [ -z "${pid}" ]; then
				# No wpa_supplicant is running. So it's save to start dhcpcd.
				dhcpcd_start "${DEVICE}"
			fi

I tried to apply this and was hit by the poorly written library. Instead of using return the function use exit and that is not even consistent ( bug number 3 ) . So in order to check a return you need to run the function in a subshell.
This is also really tough to debug because no messages are logged to syslog to be able to see / inspect after reboot.
I also discovered that if red0 network cable is not present at boot time then dhcpcd fails and if cable is connected later on will stay in that status which brings ( bug number 4 ). This is because if dhcpcd is already started it will handle nicely the red0 up and down.
Here is the diff that fix bug #2

[root@ipfire networking]# diff red red.orig
85d84
< 			logger "Bringing up the ${DEVICE} interface..."
102d100
< 				logger "Interface ${DEVICE} doesn't exists. ${FAILURE}"
175,185c173
< 				for i in {1..20}
< 				do
< 					if ( dhcpcd_start "${DEVICE}" );
< 					then
< 						logger "dhcpcd successfully started"
< 						break
< 					fi
< 					boot_mesg "dhcpcd does not start... retry #: $i"
< 				      	logger "dhcpcd does not start.... retry #: $i"	
< 					sleep 20
< 				done
---
> 				dhcpcd_start "${DEVICE}"

OOOPs … turn out I didn’t bother to save functions.network … LOL so here is the pertinent part

dhcpcd_start() {
	# This function will start a dhcpcd on a speciefied device.
	local device="$1"
	shift

	local dhcp_start=()

	boot_mesg -n "Starting dhcpcd on the ${device} interface..."
	logger "Starting dhcpcd on the ${device} interface... _start"
	# Check if a dhcpcd is already running.
	local pid="$(dhcpcd_get_pid "${device}")"

	if  dhcpcd_is_running "${pid}"; then
		boot_mesg "dhcpcd already running!" ${WARNING}
		logger "dhcpcd already running"
		echo_warning
		exit 2
	fi

	# Check if a DHCP hostname has been set.
	if [ -n "${RED_DHCP_HOSTNAME}" ]; then
		dhcp_start+=( "-h" "${RED_DHCP_HOSTNAME}" )
	fi

	# Tell dhcpcd to use the configured MTU
	if [ -n "${RED_DHCP_FORCE_MTU}" ]; then
		dhcp_start+=( "--static" "mtu=${RED_DHCP_FORCE_MTU}" )
	fi

	# Append any further command line options
	dhcp_start+=( $@ )

	# Start dhcpcd.
	echo "Invoking dhcpcd"
	/sbin/dhcpcd "${dhcp_start[@]}" ${device} >/dev/null 2>&1
	ret="$?"
	echo "dhcpcd return code: $ret"
	if [ "${ret}" -eq 0 ]; then
		. /var/ipfire/dhcpc/dhcpcd-"${device}".info

			if [ $ip_address ]; then
			echo ""
			echo_ok
			boot_mesg "           DHCP Assigned Settings for ${device}:"
			boot_mesg_flush
			boot_mesg "           IP Address:      $ip_address"
			boot_mesg_flush

			if [ -n "${RED_DHCP_HOSTNAME}" ]; then
				boot_mesg "           Hostname:        $RED_DHCP_HOSTNAME"
				boot_mesg_flush
			fi

			boot_mesg "           Subnet Mask:     $subnet_mask"
			boot_mesg_flush
			boot_mesg "           Default Gateway: $routers"
			boot_mesg_flush
			boot_mesg "           DNS Server:      $domain_name_servers"
			boot_mesg_flush
		else
			echo ""
			echo_ok
			boot_mesg "DHCP for ${device} still running..."
			boot_mesg_flush
		fi
	else
		echo "Can't start dhcpcd ${ret}"
		logger "Can't start dhcpcd ${ret}"
		$(exit "${ret}")
 		evaluate_retval
		exit "$ret"
	fi
}

Basically I added exit "$ret" to fix a rookie mistake and some logger. I also reduced the delay to 20 seconds but… it doesn’t fix my original bug. However it fix the delay of my modem start, makes ipfire to boot faster and is way more flexible.
If you have a github I can make a PR.
I also noticed that ifconfig red0 down then ifconfig red0 up worked once so I will try more tomorrow on this direction.
I’ll will also try to install pfsense to see if this issue is there, maybe is my HW combination.

jb68 · 8 October 2023 22:09

Also here are some logs for this change. It needed 3 tries…

Oct 8 14:09:48 ipfire dhcpcd[2337]: dhcpcd exited
Oct 8 14:09:48 ipfire root: Can’t start dhcpcd 1
Oct 8 14:09:48 ipfire root: dhcpcd does not start… retry #: 3
Oct 8 14:10:08 ipfire root: starting dhcpcd on red0
Oct 8 14:10:08 ipfire root: Starting dhcpcd on the red0 interface… _start
Oct 8 14:10:08 ipfire dhcpcd[2409]: dhcpcd-10.0.1 starting
Oct 8 14:10:08 ipfire dhcpcd[2412]: DUID 00:01:00:01:24:4e:69:c1:00:0d:b9:4a:50:18
Oct 8 14:10:08 ipfire dhcpcd[2412]: red0: IAID b9:4a:50:18
Oct 8 14:10:09 ipfire dhcpcd[2412]: red0: soliciting a DHCP lease
Oct 8 14:10:09 ipfire dhcpcd[2412]: red0: probing address 205.250.125.234/22
Oct 8 14:10:14 ipfire dhcpcd[2412]: red0: leased 205.250.125.234 for 14400 seconds
Oct 8 14:10:14 ipfire dhcpcd[2412]: red0: adding route to 205.250.124.0/22
Oct 8 14:10:14 ipfire dhcpcd[2412]: red0: adding default route via 205.250.124.1
Oct 8 14:10:15 ipfire dhcpcd.exe[2451]: red0 has been (re)configured with IP=205.250.125.234
Oct 8 14:11:40 ipfire root: dhcpcd successfully started

hvacguy · 8 October 2023 22:41

Perhaps submit as a patch. or Partial patch.
Your skills are far beyond mine.

bonnietwin · 9 October 2023 12:21

IPFire is also on github but PR’s are not accepted for change submissions as the PR process has no structural review process.

Any patch submissions should be submitted as per the wiki
https://wiki.ipfire.org/devel/submit-patches

When I am submitting any changes I do them on a branch of a clone of the IPFire-2.x git repo.

I then run git commit -sv to create the git message and have it signed off by my email (specified in the git config).
https://wiki.ipfire.org/devel/git/commit-messages

Then I run git format-patch which creates the patch in the correct format for submitting. I find that this is much simpler than trying to manually format the patch into the correct style.

Then I run git send-email with the formatted patch as the argument and the development@lists.ipfire.org as the email destination (again defined in the git config together with the mail server details that you use for sending your mails)

Any patch submissions will then be automatically picked up by the IPFire mail server and added into the IPFire Patchwork system.

jb68 · 9 October 2023 16:50

Wow, you have a bit cumbersome process. Thanks for detailing it.
Usually people nowadays follow the GH / GL PR process. Work on their own clone, then do a PR with review.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/about-pull-request-reviews

I will clean it, take a look on the boot_mesg function cause instead of logger "bla bla I should have something like logger $1 so more correct it should be like 2 individual commits / patches.

jb68 · 9 October 2023 19:26

LOL, did some reading and it turns out I was misunderstanding how dhcpcd works.
According to docs dhcpcd - ArchWiki
The dhcpcd does the request then wait so the timeout is how long it will wait for a response and not a delay before requesting. Technically it can wait indefinitely ( with 0 )
The problem is that in current implementation the boot on the router will stop for the wait, there is way around that described in docs but not implemented in ipfire.
This kind of makes my retry patch useless. However the exit is a good fix but not necessary either. Did some tries and 5 minutes works in same way as 10 seconds.

Back to my issues, I made some progress. ifconfig red0 down then ifconfig red0 up fix my issue so I can easy make some script to automate this.
However on bad state:

[root@ipfire ~]# cat /var/ipfire/dhcpc/dhcpcd-red0.info
broadcast_address=205.250.127.255
dhcp_lease_time=14400
dhcp_message_type=2
dhcp_server_identifier=10.27.191.10
domain_name_servers='75.153.171.67 75.153.171.116'
host_name=ipfire.local
ip_address=205.250.124.237
network_number=205.250.124.0
rapid_commit=
routers=205.250.124.1
subnet_cidr=22
subnet_mask=255.255.252.0

And in good state

[root@ipfire ~]# cat /var/ipfire/dhcpc/dhcpcd-red0.info
broadcast_address=205.250.127.255
dhcp_lease_time=14400
dhcp_message_type=5
dhcp_server_identifier=10.27.191.10
domain_name_servers='75.153.171.67 75.153.171.116'
host_name=ipfire.local
ip_address=205.250.124.237
network_number=205.250.124.0
routers=205.250.124.1
subnet_cidr=22
subnet_mask=255.255.252.0

The only difference is Message Type and rapid_commit
Based on other post I said to try to remove the rapid_commit option and give it a try.
It turns out that this fixed my issue. Really weird.
The good:

no patches or scripts are necessary,

The I’d like to:

fix boot_log to log to /var/log/messages
fix library to exit consistent on error and red script to proper log errors.

I will try to make a patch for this.

cfusco · 10 October 2023 08:32

can you explain this part? I got lost here.

EDIT: got it. I am slow.

Few info from analyzing your logs and chatting with GPT4.

In the good state log, dhcp_message_type value is 5, which stands for DHCPACK, an acknowledgment message. In the bad state, the value is 2, which stands for DHCPOFFER, an offer message from the provider.

The rapid_commit option is part of the DHCP protocol and is supported by dhcpcd. It’s used to speed up the initial DHCP handshake. Normally, a DHCP handshake is a 4-step process: DISCOVER, OFFER, REQUEST, and ACK. With rapid_commit, the process is shortened to just 2 messages: SOLICIT (which is a combination of DISCOVER and REQUEST) and a combined OFFER/ACK. This makes the acquisition of an IP address faster.

If rapid_commit is present in the dhcpcd information file, it indicates that the DHCP server and client have agreed to use the Rapid Commit option for faster IP address assignment.

To me this means that shortened handshake fails with your provider but it is attempted nonetheless.

bonnietwin · 10 October 2023 08:46

The rapid_commit option in dhcpcd should be used if the ISP has enabled rapid_commit on their server. If not enabled on the ISP server then having it enabled on the client should just be ignored. That is the status if the ISP is following the dhcp rfc’s.

Unfortunately not all do and other posters have had to comment out the rapid_commit otherwise the ISP’s dhcp server does not complete the dhcp communication flow properly.
The ISP has obviously configured their dhcp server not in line with the rfc’s.

For most people it is fine but there have been around 4 or 5 posters that have had problems that were solved by commenting out that option.

Another poster solved their problem by commenting out another standard option.

bonnietwin · 10 October 2023 08:51

What should happen is that the client asks if the 2 message option can be used. The server ignores it and starts with the normal 4 message communication.

Some ISP’s have configured their dhcp servers so that they stop or do something very non-standard if they get the 2 message request

It doesn’t indicate that anything has been agreed between the server and the client. It just means that the client will ask if the 2 message option can be used. The server then either agrees and uses the 2 message option or ignores the request and starts with the full 4 message approach.

For most people with their ISP’s that is working fine but not with some ISP’s.

cfusco · 10 October 2023 08:56

which is what I wrote, quote:

the message was present in the bad state, therefore from the point of view of dhcpcd the server agreed to the rapid handshake. As you said, the server is improperly configured.

EDIT: after reading again, now I see what you meant. Yes, your language is more precise in describing what happened.

bonnietwin · 10 October 2023 09:02

Sorry, i misunderstood what you wrote.
We are in agreement.