Skip to content

Speed up SSVM startup time by postponing checks#4641

Closed
ravening wants to merge 1 commit intoapache:mainfrom
ravening:ssvm-speedup
Closed

Speed up SSVM startup time by postponing checks#4641
ravening wants to merge 1 commit intoapache:mainfrom
ravening:ssvm-speedup

Conversation

@ravening
Copy link
Copy Markdown
Member

@ravening ravening commented Feb 2, 2021

Description

Currently SSVM takes long time to startup.
Reduce the startup time by moving volume/template
check to post connect TimerTask to speed up ssvm startup

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Currently SSVM takes long time to startup.
Reduce the startup time by moving volume/template
check to post connect TimerTask to speed up ssvm startup
Comment on lines +325 to +326
private void postConnect(Host agent, StartupCommand cmd) throws ConnectionException {
if (cmd instanceof StartupSecondaryStorageCommand) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only ever called with a StartupSecondaryStorageCommand so can we change the parameter type and skip the class check?

Comment on lines +318 to +319
if (_postConnectTask != null)
_postConnectTask.cancel();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only ever working on one dowload job? it seems we are canceling the download check for someone else here.

@shwstppr shwstppr added this to the 4.16.0.0 milestone Feb 8, 2021
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jun 17, 2021

I think it may cause some issue if we don't do this, esp in env with very large no. of templates, volumes, snapshots - pl discuss on dev/users@ @ravening

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Sep 8, 2021

Ping @ravening

@ravening
Copy link
Copy Markdown
Member Author

ravening commented Sep 8, 2021

@weizhouapache you have any feedback on this pr?

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Sep 14, 2021

Pl discuss on dev@, in my opinion this should be closed as postponing check isn't something we would want. The tradeoff of speedup vs check needs to be discussed.

@weizhouapache
Copy link
Copy Markdown
Member

Pl discuss on dev@, in my opinion this should be closed as postponing check isn't something we would want. The tradeoff of speedup vs check needs to be discussed.

@ravening
maybe I miss something, I think this is good (it has been used for many years as far as I know).
anyway, it is not a major bug fix.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Sep 14, 2021

@weizhouapache I'm just being risk-averse :) I see that checks are only postponed - but I worry if not doing checks at the time of initial connection causes any failures? For example, you allow the agent to connect and it starts doing some storage related operation, and later the post-connect thread/check runs and finds out we shouldn't have connected, or it fails but since the agent is connected, agent continues to process new commands? (a case can be the template/volume scanning logic times out, but the connect is still connected so agent gets some new operation say re-download template from a URL which is wrong/404 etc and it ends up overwriting existing files?)

@weizhouapache
Copy link
Copy Markdown
Member

@weizhouapache I'm just being risk-averse :) I see that checks are only postponed - but I worry if not doing checks at the time of initial connection causes any failures? For example, you allow the agent to connect and it starts doing some storage related operation, and later the post-connect thread/check runs and finds out we shouldn't have connected, or it fails but since the agent is connected, agent continues to process new commands? (a case can be the template/volume scanning logic times out, but the connect is still connected so agent gets some new operation say re-download template from a URL which is wrong/404 etc and it ends up overwriting existing files?)

@rhtyd these are very good questions.
@ravening are you ok to close this pr ? the issue this pr aims to fix is, SSVM will be Up only if all imags/volumes are checked which I think it is not a major issue.

@ravening
Copy link
Copy Markdown
Member Author

closing it after discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants