Interfaces disappear after upgrade from 168 to 169

I made a backup, downloaded it, and upgraded from 168 to 169. After the upgrade, everything still looked fine, so I reboot. On boot, none of my interfaces existed, so I reboot again in case it was a fluke. No luck, so I set this aside for later troubleshooting and pressed another system into service with a fresh install of 169 and the backup. That’s working fine.

As a quick sanity check, I boot the interface-less system with a live CD, and it sees all the interfaces just fine. I suspect a fresh install and restore would get it running again, but I’d like to understand what happened and how it could be fixed without a reinstall. I figure this might also be useful info for the community (assuming it doesn’t already exist somewhere I overlooked).

How do I go about troubleshooting from here?

Thanks.

Do you have a “normal” system ( on separate HW, without virtualisation )?
What kind of NICs do you use?

The boot sequence, and messages, can be monitored on the console ( serial or keyboard/monitor ).

BTW: I didn’t have those problems on update from 168 → 169.

No virtualization. This is a Dell Poweredge R620 and the following Ethernet controllers:

  • Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
  • Intel Corporation I350 Gigabit Network Connection (rev 01)

I have test system (different hardware) that I upgraded last week without issue.

I’ll reboot to try to catch earlier boot messages, but I’ll attach two pics showing some of the text.


I had trouble capturing the text because it was scrolling by so quickly, but I didn’t really see any errors there. Just a complaint that “Alternate GPT header not at the end of the disk.”

I decided to run the diagnostics in Dell’s Lifecycle Controller. No problems were found, but after another reboot (maybe the tenth after update), all the interfaces are back again! (FWIW, It still complained about GPT.)

I’ve not yet opened the case, tried to reseat anything, etc. I’m having trouble imagining what would cause it to stop working for IPFire, work for a Peppermint live boot (USB), continue to not work for IPFire, and then work for IPFire after system diagnostics. I’ve not changed anything.

Maybe some ( most ? ) messages are stored in the /var/log/bootlog files.
Old bootlog files are bootlog..gz.

Your screenshots show the logical interfaces green0, orange0, red0 don’t exist. The boot code wasn’t able to bind them to physical NICs and their drivers. This should be documented in error messages.

I feel silly, but I’ve just discovered that there’s another disk in the system, with IPFire v167. This is the current boot disk. I’ll update when I know more.

Here’s a bootlog from yesterday while the interfaces were still not showing up. I don’t see any reference to the nics, errors or no.

https://gist.github.com/imneedham/b8400cf2c61a6ff7a92df65eaa6491ef

For comparison, here’s a bootlog where the nics show up - both in the bootlog and the OS.

https://gist.github.com/imneedham/e88dbb7ac675990955a15bc1305256df

Each bootlog is from the same disk, so I think the disk weirdness mentioned yesterday is probably a red herring, and there’s an intermittent hardware problem.with the Dell. It’s from 2013, and it seems like the best explanation for both nics disappearing and reappearing at the same time.

Hi,

hm, a RAID issue fixed in Core Update 168 comes to my mind while reading this. In case that machine is running a RAID setup, could you please check if the RAID is consistent, and every disk has all the files from Core Update 169?

Thanks, and best regards,
Peter Müller

2 Likes

Your thinking is correct and I have found the remnants of a RAID at /dev/md127. It’s very strange: mirroring a 256GB SSD and a 256GB partition on a 1.9TB HD.

I’m still looking into it, but this at least would explain some of the disk weirdness. I can’t see how it would be responsible for the interfaces disappearing (or reappearing), but it’s something!

Thanks.

Hi,

during Core Update 168, this script has been executed to fix previously broken RAID setups. This was caused by a dracut change, see here for further information on it.

To the best of my understanding, there is a slim chance that this script will not fully repair a RAID. Unfortunately, it looks like you have hit such an edge case, where your IPFire experience depends on which disk the machine has been booting, since they are not fully in sync anymore.

Reinstalling IPFire is unfortunately the only way to fix this situation. :expressionless:

Apologies for the inconvenience, and sorry to disappoint,
Peter Müller

2 Likes

from the way I see the situation, you have nothing to apologize for. Unfortunately these things happen and while anyone can empathize with users of IPFire that have been affected by this or any other bug, there is nothing the developers and contributors to this project could have done to deserve making an apology to the users. I want to take this opportunity to thank you for the work you are doing here.

4 Likes

It’s unfortunate to have bugs, but not a big deal for me. And as far as I can tell, I experienced an unrelated transient hardware failure (disappearing nics), and I may not have noticed the RAID issue had the system booted with its interfaces intact.

Maybe it’s different for some people, but reinstallation is easy and takes minimal time. After a few minutes I restore settings from backup and I’m back in service. I’m not sure how it could be any easier or quicker unless IPfire included a mechanism for automatic failover to a secondary system.

2 Likes

That’s a brilliant idea for a new feature request, I think.