My offices in India recently had an issue where their ldap & license server image became corrupted. We’re still trying to root cause the issue, but I thought I would put a heads-up here to see if anyone else has had an issue.
Our ldap VM was a 50G qcow2 image (actual size was 7.9G with on-demand expansion enabled). After a recent reboot, the VM image somehow became corrupted. The host os (ipfire) doesn’t report any filesystem issues, but qemu-img showed massive leaks and the VM filesystems were mostly garbled. I tried connecting to the image with a boot iso in rescue mode, but was barely able to recover the ldap database. I was able to restore their vm by copying their ldap database to a clone of our vm and copying it back up, but 24 hours later it is now corrupt as well.
Their firewall is a fairly modern Dell workstation with 1T drive and 8G memory (similar to what I am running here in Portland Oregon). They are on IPFire build 139, we’re on 136 (I’ve been too busy to update our system).
this should have most likely nothing to do with libvirt but instead with qemu (which was updated to version 4.1.0 ) or with a broken filesystem.
Could you post some logs or anything similar, so we can see what caused the corruption?
Looks like it may be an incompatibility with an older qcow2 format maybe? I’ll have to deep dive into qemu history to see.
The qemu log for the VM has this repeating multiple times:
qcow2_free_clusters failed: Invalid argument
I set this up originally almost 2 years ago here in our Oregon datacenter. When we opened our branch office in Bangalore, I cloned the ldap VM (far easier than starting from scratch). When the BLR office updated the firewall to build 139 was when we started seeing issues. Our PDX office is still on build 136.
I was able to recover the ldap database from the corrupt image and copy a fresh clone of the PDX VM. As soon as I copy the VM to IPFire, qemu-img check reports errors (leading me to believe in a qcow2 format change). I now have the VM running in a raw format image. So far no errors, but it has only been 24 hours.
I plan on testing this theory here locally on a test system. Will update this thread when I know more.