Skip to content

Snapshot deletion issues#3969

Merged
andrijapanicsb merged 18 commits into4.13from
snapshot-deletion-issues
Apr 11, 2020
Merged

Snapshot deletion issues#3969
andrijapanicsb merged 18 commits into4.13from
snapshot-deletion-issues

Conversation

@DaanHoogland
Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland commented Mar 16, 2020

Description

copy of PR #3649 on a joint account for easier cooperation.

Tested both cases:

  • Snapshot backed up on Secondary - manual and scheduled
  • Snapshot stored only on primary storage (works with KVM + Ceph only)

Note: @GabrielBrascher changed the name of the strategy from XenserverSnapshotStrategy to DefaultSnapshotStrategy due to the fact that the "Xenserver" strategy handles also Ceph, KVM, and do return StrategyPriority.DEFAULT in any case except REVERT.

    @Override
    public StrategyPriority canHandle(Snapshot snapshot, SnapshotOperation op) {
        if (SnapshotOperation.REVERT.equals(op)) {
            long volumeId = snapshot.getVolumeId();
            VolumeVO volumeVO = volumeDao.findById(volumeId);

            if (volumeVO != null && ImageFormat.QCOW2.equals(volumeVO.getFormat())) {
                return StrategyPriority.DEFAULT;
            }
            return StrategyPriority.CANT_HANDLE;
        }
        return StrategyPriority.DEFAULT;
    }

Fixes: #3646

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

WHAT HAS BEEN FIXED / WORKs FINE:

  • VMware, KVM - no garbage on Secondary/Primary store (snapshot.backup.to.secondary=TRUE)👍
    -- snapshot.backup.to.secondary=FALSE - works (by design) only with KVM+Ceph - no garbage on Secondary/Primary storage 👍
    --- snapshot.backup.to.secondary=FALSE does NOT work (by design) with other hypervisors/storage combo - so don't bother trying 😄
  • ad-hoc single XenServer volume snapshot (i.e.NOT consecutive snaps/scheduled snaps 👎 )

WHAT HAS NOT BEEN FIXED (tracked via #4018 ):

  • Major: XenServer consecutive/scheduled volume snapshots, since having parent/child relation (vs. VMware and KVM where consecutive/scheduled snaps are NOT chained (no parent/child relation)) - continues to be broken - i.e. one can continue to create/use XS volume snaps, but they are not removed properly on Primary and Secondary storage and thus leave garbage all over the place - eventually, there might happen the "Snapshot chain too long" error from Xen (see Snapshots GC from DB and XenServer Primary Storage garbage #4018 for the possible workaround/fix and a bit more explanation on how it works atm) 👎 👎 👎
  • Minor: VMware and KVM PRIMARY references (role_type=primary) are NOT removed from the snapshot_store_ref table - minor garbage, but to be fixed eventually 👎

@DaanHoogland
Copy link
Copy Markdown
Contributor Author

DaanHoogland commented Mar 16, 2020

using KVM/NFS, I tested this scenario:

  1. created VM with data disk
  2. both disks snapshotted
  3. both deleted
  4. both marked deleted in the DB and removed secondary
  5. snapshot_store_ref has entries for both disks on primary and on secondary, all four are marked 'Destroyed'.
  6. no changes in primary everything had already been clean (not sure if the snapshots were removed before)

basically different results than reported by @GabrielBrascher .

this is with @GabrielBrascher 's fix in this PR.
that would mean we are good to go, but it could also mean there are unknown circumstances that create garbage. (cc @andrijapanicsb )

@DaanHoogland DaanHoogland added this to the 4.13.1.0 milestone Mar 16, 2020
@DaanHoogland DaanHoogland changed the base branch from master to 4.13 March 16, 2020 15:49
@andrijapanicsb
Copy link
Copy Markdown
Contributor

andrijapanicsb commented Mar 23, 2020

Not good (still testing a few more scenarios and will update this comment once done):

XENSERVER (7.1) + NFS

snapshot.backup.to.secondary = TRUE (default)

MANUAL SNAPS: LGTM

  • deletion works fine - removed from both primary/secondary store, DB records in snapshost_store_reg marked as DESTROYED, all good.

SCHEDULED SNAPS: MAYBE OK - someone should advise

  • for disk snaps we are making snap chains (parent_snapshot_id pointing to the ID of the previous scheduled snapshot) and while DB says that older snaps are deleted, it's not really true on storage side, but as those are snap diff snaps (chain) - it makes sense that it doesn't delete the older/parent snap while there are child/linked snaps), but an older snap can be selectively deleted manually - which is a bit strange, as it would break the snap chain)

snapshot.backup.to.secondary = FALSE

MANUAL SNAPS: LGTM
deletion works fine - removed from both primary store, DB records in snapshost_store_ref marked as DESTROYED, all good.

SCHEDULED SNAPS:
Works fine - garbage deleted in both DB and on Primary NFS server

KVM (CENT7.1) + NFS

snapshot.backup.to.secondary TRUE (default)

MANUAL SNAPS: LGTM

  • deletion works fine - removed from both primary/secondary store, DB records in snapshost_store_reg marked as DESTROYED, all good.
  • I also see we are NOT keeping the snap on the QCOW2 file (good, as we made a copy to sec storage) - and even though the "PRIMARY" record (store_role) is still created and kept as "Ready" in snapshost_store_ref - it does get properly marked as DESTROYED when snap is deleted with no exception

SCHEDULED SNAPS:

  • works fine, created and deleted as expected in both DB and on the primary/secondary stor.

snapshot.backup.to.secondary = FALSE

MANUAL SNAPS: BROKEN

  • can't even create manual KVM volume snap

SCHEDULED SNAPS: NA

  • not tested as manual ones fail

VMware not tested,

but will re-test that it's NOT affected once we have everything else fixed.

@GabrielBrascher
Copy link
Copy Markdown
Member

@andrijapanicsb thanks for all the tests, I am going to check this issue.

@GabrielBrascher
Copy link
Copy Markdown
Member

@andrijapanicsb thank you very much for all the tests. Here follows an update on snapshot.backup.to.secondary = FALSE.

I have been testing this implementation (tested with 4.13.1 and also on a port to 4.14.0). The snapshotting with snapshot.backup.to.secondary = FALSE seems to be working fine for me. Created and deleted on (i) XenServer + NFS, (ii) KVM + NFS, (iii) KVM + Ceph.

Environment:

  • CloudStack management node: Ubuntu 18.04
  • KVM nodes: Ubuntu 18.04
  • XenServer 6.5

Can you please share more details on the issue that you had? Are there exceptions on management or the agent node? Is the DB being correctly updated (snapshots, volumes, snapshot_store_ref)?

Note that snapshots on NFS as primary storage is a bit different than on secondary. The snapshot handling for NFS (primary) is executed with the help of the managesnapshot.sh script. That script on KVM + NFS (primary) creates a lvm, which makes a bit tricky to find the snapshot volume compared to other approaches (e.g. ceph: rbd snap ls foo, nfs secondary: ls -lah /export/secondary/snapshots/, XenServer: xe vdi-list).

Basically the script runs a lvm lvcreate: lvm lvcreate --size ${lv_bytes}b --name "${snapshotname}-cow" ${vg} >&2 || return 2. Let's say that the volume path is /mnt/<uuid primary>/<uuid volume>.

Thus, the snapshot path will become something as /mnt/<uuid primary>/<uuid volume>/<uuid snapshot>. However, one looking at the NFS storage via ls /mnt/<uuid primary>/ will see something as /mnt/<uuid primary>/<uuid volume>.

Is that happening on your test environment?

@davidjumani
Copy link
Copy Markdown
Contributor

@GabrielBrascher That is happening in our environment. However, when I create a first snapshot of a data volume, it also makes a copy in the secondary storage. Successive snapshots do not get created in the secondary storage but when I try to restore them or create a new volume from them it fails, while I can only use the first snapshot that I had created to create a new volume / restore the snapshot.

@GabrielBrascher
Copy link
Copy Markdown
Member

Got it @davidjumani, thanks for the feedback.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

So - yes, when using KVM+NFS and snapshot.backup.to.secondary=false - there is all kind of mess with snapshots happening - both in 4.11.3 and in master - which implies lack of proper code implementation for this combo.

The rest looks good and I'm tempted to give this specific PR a LGTM (after some more testing of what was already tested before as good)

@andrijapanicsb
Copy link
Copy Markdown
Contributor

I proposed that we continue/merge this PR as it is (I want to confirm a few more things with testing, specifically that VMware is still not affected) and close the "garbage" problem as it seems solved.

For the KVM+NFS/local very bad behaviour when snapshot.backup.to.secondary=false - that is traced back to the original implementation, which seems only done for KVM + CEPH, and so I propose that we simply change the description of this global settinh to ... Applicable only for KVM = Ceph"

@GabrielBrascher sounds ok to you?

@GabrielBrascher
Copy link
Copy Markdown
Member

@andrijapanicsb sounds good to me, considering the raised context and all tests done so far. Let's run a few more tests then.

if (isSnapshotOnPrimaryStorage(snapshotId) && snapshotSvr.deleteSnapshot(snapshotOnPrimaryInfo)) {
snapshotOnPrimary.setState(State.Destroyed);
snapshotStoreDao.update(snapshotOnPrimary.getId(), snapshotOnPrimary);
snapshotDao.remove(snapshotId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is leading to the record having a removed date on the snapshots table which later on causes issues during GC ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like the issue indeed, I will take a second look at this.

@andrijapanicsb andrijapanicsb reopened this Apr 9, 2020
@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Apr 9, 2020

So it does @Slair1 ...

@andrijapanicsb could you double-check the code? When i manually check 4.13, i don't see the lines added in the screenshot below from #3194 .
image

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@Slair1 I can't, I'm off as from 10min ago (late night, time for beer) - perhaps @DaanHoogland can confirm once more tomorrow....

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Is this for volume or VM snapshots @Slair1 ? - as volume snapshots are NOT possible for KVM while the VM is running (someone thought that's interesting to do that, due to qcow2 safety...so now it's useless for KVM...)

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Apr 9, 2020

Is this for volume or VM snapshots @Slair1 ? - as volume snapshots are NOT possible for KVM while the VM is running (someone thought that's interesting to do that, due to qcow2 safety...so now it's useless for KVM...)

It is for volume snapshots, but you're correct the global setting kvm.snapshot.enabled needs to be enabled. But this brings up another point, we are wondering if we need to do a suspend before deleting VM Snapshots also. From our examination of the code, it looks like both VM Snapshots and Volume Snapshots use the java libvirt libraries to take and delete the snapshot... so I would think they could both have the same corruption issue if the snapshot is deleted while the VM is powered-on and not suspended during the snapshot delete...

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1401)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32576 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3969-t1401-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Smoke tests completed. 76 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 258.83 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 253.80 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 379.55 test_privategw_acl.py

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@blueorangutan test centos7 xenserver-71

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + xenserver-71) has been kicked to run smoke tests

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1403)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30283 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3969-t1403-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Smoke tests completed. 75 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 252.71 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 244.28 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 323.21 test_privategw_acl.py
test_01_scale_vm Failure 24.57 test_scale_vm.py

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1404)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30295 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3969-t1404-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 76 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 186.51 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 179.71 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 244.52 test_privategw_acl.py

@DaanHoogland
Copy link
Copy Markdown
Contributor Author

those were spurious failures obviously, @andrijapanicsb as they passed after applying insanity. If you manual verification checks out we should merge and release.

Copy link
Copy Markdown
Contributor

@andrijapanicsb andrijapanicsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side:

  • 3 x LGTMs (2 x Approvals), test matrix (KVM,VMware,XS) runs fine (known privategtw test issues, fixed in master)
  • Manual tests fine (with some outstanding fix/refactoring work to be done via issue #4018

Will update the PR description to explain what HAS and HAS NOT been fixed.

Merging.

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Apr 13, 2020

So it does @Slair1 ...

@andrijapanicsb could you double-check the code? When i manually check 4.13, i don't see the lines added in the screenshot below from #3194 .
image

@andrijapanicsb just wanted to ping you on this again thanks!

@andrijapanicsb
Copy link
Copy Markdown
Contributor

I don't see it myself, despite @DaanHoogland confirming the commit being present in 4.13 branch... Some regression by someone?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

but you can test 4.13.1 RC1 and see if VMs are suspended as expected - perhaps some parts of the code were refractored or so...

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Apr 13, 2020

but you can test 4.13.1 RC1 and see if VMs are suspended as expected - perhaps some parts of the code were refractored or so...

@andrijapanicsb @DaanHoogland from our testing it is just gone... so strange - i hope other commits didn't get dropped somehow. I even checked the "blame" and I don't see a refactor - the last change in branch 4.13 was 4-years ago for the section of code.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Agree, it's gone/messed up (trying to understand what happened with some of my colleagues who are good with git) - some code refactoring happened between from 4.11 to 4.12, and it seems gone during this... (from 4.12 and onwards) - so it's NOT new issue (that code missing)

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@Slair1 do you mind creating a PR for MASTER (not 4.13 - we already have RC1 for vote today) to add back those few lines... it's incredible - the global desc that the VM will be suspended is STILL there in 4.12/4.13/master, but the "vm.suspend" code is
@DaanHoogland your opinion?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@Slair1 - the bug you shared (RHEL bug), was marked as WON'T FIX - and was raised against qemu-kvm-0.12.1.2-2.355.el6_4.1.x86_64

Can you please describe in which env (versions of qemu) you are seeing the corruption (as master/4.14 does NOT even support centos6)
The issue happened during emerging 4.11 into 4.12 branch - we identified WHEN it was messed up, but not HOW (which is now irrelevant)

I would like to first REPRODUCE this issue on centos7+ instead of just suspending the VM (now that we think to AGAIN introduce the suspend thing)

@Slair1
Copy link
Copy Markdown
Contributor

Slair1 commented Apr 14, 2020

@Slair1 - the bug you shared (RHEL bug), was marked as WON'T FIX - and was raised against qemu-kvm-0.12.1.2-2.355.el6_4.1.x86_64

Can you please describe in which env (versions of qemu) you are seeing the corruption (as master/4.14 does NOT even support centos6)
The issue happened during emerging 4.11 into 4.12 branch - we identified WHEN it was messed up, but not HOW (which is now irrelevant)

I would like to first REPRODUCE this issue on centos7+ instead of just suspending the VM (now that we think to AGAIN introduce the suspend thing)

Thanks @andrijapanicsb . We have always been on CentOS7 - never 6. We have had many corrupt VMs from using the scheduled volume snapshot functionality... We had to turn off the ability to do volume snapshots while a VM is powered-on... We also stopped using VM Snapshots since the Delete VM Snapshot functionality deletes snapshots the same way as volume snapshots are deleted from the qcow2 files (after the snap is backed up to secondary storage). We are hoping the pause VM, delete snapshot, resume VM fixes this - we are still testing

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Clear, I did a bit of reading myself. Say, what about corruption when creating snapshots? - in recent qemu releases (2.10) qemu-img actually opens qcow2 with read-write access instead of read-only previously (don't ask me why, documented behavior). This would probably mean that even backing up the snap to Secondary Storage (with qemu 2.10) will not work (requires a use of --force-share flag, which we don't have implemented in ACS) unless the VM is suspended - did you have a change to test qemu 2.10 (that is separate Centos SIG repo), etc? What about corruption while creating snapshots with stock qemu version 1.53?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

andrijapanicsb commented Apr 14, 2020

tested, confirmed the issue with latest centos7 host (centos5.5 ext3 guest - simulated high IO load while deleting the snap and broke the VM)

Raised PR (for master only) #4029

EDIT: well, I "broke" the image (qemu-img check showing thousands of errors) but not the EXT3 inside the VM - EXT3 can't be broken as it seems (retrying testing with centos7 and the famous XFS which has the tendency to self-crash for fun...)

@ravening
Copy link
Copy Markdown
Member

ravening commented Sep 1, 2020

@DaanHoogland @GabrielBrascher @andrijapanicsb @rhtyd

Found an issue with this commit. Below are the steps to reproduce

  1. Create VM snapshot from running vm
  2. Create a snapshot from the above VM snapshot
  3. Delete the snapshot
  4. Try to create snapshot again. This will fail.
    This is because when we delete the snapshot for the first time, its deleting the VM snapshot as well in the primary storage.
    So when we try to create snapshot again, its not able to find the vm snapshot and throws an exception

I have made a fix so that it checks if the vm snapshot from which the snapshot is created, still exists on primary storage or not. If it exists then its not deleted from primary but only from secondary

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@ravening for which hypervisor is this?
if VM snapshots is deleted, we should also delete it from the Primary Storage - that is fine.
This sounds more like some DB garbage, due to which we can't create a new one (chained, parent/child snaps?)
Which PR did you create?

@ravening
Copy link
Copy Markdown
Member

ravening commented Sep 1, 2020

@ravening for which hypervisor is this?
if VM snapshots is deleted, we should also delete it from the Primary Storage - that is fine.
This sounds more like some DB garbage, due to which we can't create a new one (chained, parent/child snaps?)
Which PR did you create?

@andrijapanicsb This is for KVM

The VM snpashot is not yet deleted. Ofcourse if VM snapshot is deleted then we should clean up the garbage.
In my test case, I have not yet deleted VM snapshot. I just created a Snapshot from VM snapshot, deleted it and tried to create snapshot again, which failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Garbage snapshots are left on Primary and Secondary storage after a volume snapshot is deleted

9 participants