Skip to content

kvm: suspend/resume in deleting vm snapshot on kvm#4033

Merged
DaanHoogland merged 1 commit intoapache:4.13from
ustcweizhou:4.13-suspend-vm-delete-vmsnapshot
Apr 16, 2020
Merged

kvm: suspend/resume in deleting vm snapshot on kvm#4033
DaanHoogland merged 1 commit intoapache:4.13from
ustcweizhou:4.13-suspend-vm-delete-vmsnapshot

Conversation

@ustcweizhou
Copy link
Copy Markdown
Contributor

Description

To void qcow2 image corruption, we'd better suspend vm when delete a vmsnapshot, and resume it when vmsnapshot is removed.
Fixes: #3193
related to: #3194 #4029

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@weizhouapache being Wei Zhou, as usual... :) thanks!

Shall we try to squeeze this in master only (4.13 is already out for voting) - or shall we craft another 4.13 RC2 due to this? (I don't see a problem there) - this seems to be a serious enough issue/fix to warrant RC2? (and simple enough change to NOT ask for a serious full-blown retesting of everything) ?

/cc @DaanHoogland @rhtyd

@andrijapanicsb andrijapanicsb added this to the 4.13.1.0 milestone Apr 15, 2020
@andrijapanicsb
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✔debian. JID-1171

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Good boy., ape....

@luhaijiao
Copy link
Copy Markdown

Would expect to have it in 4.13 RC2 given it's potential but serious impact.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1420)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 27668 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4033-t1420-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 76 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 174.10 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 176.12 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 237.66 test_privategw_acl.py

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good but for one scenario, When.the VM was suspended and the user wants to delete a snapshot without starting the VM up again. I'm not sure if this can happen. Can we be sure it can't?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Not sure ai understand your question @DaanHoogland?
In general, we should first check if VM is running, and then suspend/delete/resume.
Not sure this would work for a stopped VM? (about to test the PR now)

@yadvr yadvr requested a review from andrijapanicsb April 16, 2020 11:48
@yadvr yadvr closed this Apr 16, 2020
@yadvr yadvr reopened this Apr 16, 2020
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Apr 16, 2020

@andrijapanicsb @DaanHoogland this also LGTM, smaller than the other two. The code to suspend is okay, but a check may be preferable.

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

Copy link
Copy Markdown
Contributor

@andrijapanicsb andrijapanicsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Tested manually and it works as expected - VM get's suspended very briefly.

NOTE: VM does get resumed unconditionally - I'm fine with that (if someone suspends the VM manually, outside of the ACS, then ACS resumes his VM - "sucks to be you") :)

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos7 ✔debian. JID-1173

@andrijapanicsb
Copy link
Copy Markdown
Contributor

2 x LGTMs/Approvals, manual testing done fine, regression tests are fine.

Ready for merge @DaanHoogland

@DaanHoogland DaanHoogland merged commit 2637a86 into apache:4.13 Apr 16, 2020
@andrijapanicsb
Copy link
Copy Markdown
Contributor

andrijapanicsb commented May 6, 2020

@weizhouapache ... should we also suspend the VM while we delete VM snapshot right after Backuping up volume snap to Secondary Storage, below:

https://github.com/ustcweizhou/cloudstack/blob/6f30fdfed2986a52c78fb095b2cdc917f1e8f216/plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtBackupSnapshotCommandWrapper.java#L175

This PR code 4033 is executed when we delete a VM snapshot that we manually created previously : (i.e. proper resume during creation and deletion of the VM snap) - all good here.

But when kvm.snapshots.enabled=true, and you snap just the volume of the running VM - I could see that the whole VM is being paused, whole VM snapshot was taken (all volumes and RAM) and then VM is being resumed and then a single volume snapshots (from the qcow2) is being exported via qemu-img to Secondary STorage and then the original VM snapshot is deleted - during this VM SNAP deletion I could not confirm the VM is being ALWAYS paused/suspended (no logs messages ever, but I could see it being paused when DATA volume snap is deleted, most of the time, but not paused when ROOT volumes are snapped)- i.e. I could see that the code of this PR is NOT executed, but some other code (what I shared above I guess?) - since I can't see for example that this logging is NOT done: s_logger.debug("Suspending domain " + vmName); (from this PR)

The only thing I would like to see verified is that the VM is always suspended (like in this PR) when we CREATE VOLUME SNAPSHOT, due to the whole VM SNAPSHOTS is taken/deleted when we do just a volume snapshots (running VM, with kvm.snapshots.enable=true)

Could you please advise?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants