-
Notifications
You must be signed in to change notification settings - Fork 16
Allow suspend/resume of simulations on Batch runs #1668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… to work with AZ_* env vars.
- takes the supplied job id and makes it the full path to the saved job - not perfect
- Okay after simulation is setup
- useful when all draws can be resumed from a a single specified draw
- we may be restoring simulation from a baseline without any parameters - the draw itself may override parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements suspend/resume functionality for simulations running on Azure Batch, allowing users to pause and later continue long-running simulations. The implementation adds support for specifying suspend/resume parameters through command-line arguments and handles the path resolution for pickled simulation files in the Azure Batch environment.
- Adds command-line argument parsing to
batch-submitto handle suspend/resume parameters - Modifies simulation loading logic to support resuming from pickled simulations with flexible path handling
- Includes a test scenario file to validate suspend/resume functionality
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/tlo/scenario.py | Enhanced simulation loading logic to support resuming from pickled files with environment variable expansion and flexible path handling |
| src/tlo/cli.py | Added argument parsing to batch-submit command to handle suspend/resume parameters and construct Azure Batch file paths |
| src/scripts/dev/scenarios/suspend-resume-test.py | Added test scenario for validating suspend/resume functionality |
Comments suppressed due to low confidence (1)
src/tlo/cli.py:1
- This code assumes that '--resume-simulation' is always followed by exactly one argument, but doesn't validate that
i+1is within bounds. If '--resume-simulation' is the last argument, accessingscenario_args[i+1]will raise an IndexError.
"""The TLOmodel command-line interface"""
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…/UCL/TLOmodel into tamuri/1665-suspend-resume-batch Rebase to remote
|
Hi @tamuri and @tbhallett, Below the results of my tests as well as general feedback on the suspend/resume functionality TEST SET-UP
OUTPUTS
OUTCOME OF TEST
Possible reasons for discrepancy:
GENERAL FEEDBACK
even if no changes are actually contemplated by draw 0 upon resuming, so that each draw can “use” that event if needed.
|
|
I don't think this is the cause of the problem but I noticed that the It needs to change so that when TLOmodel/src/tlo/methods/healthsystem.py Lines 882 to 907 in cb63057
|
Oh absolutely - I had assumed that the parameter value was looked up when the parameter-change event is ran, but clearly it isn't. I can issue a PR to update that (also notice that the mode switch, which works correctly with suspend/resume, does this already) |
|
Confirming that fix brought in by PR #1777 has now resolved all discrepancies between suspend/resume and standard batch-submit, as shown by the plots below. Noting here again that the rescaling of mode 2 is not working as expected in master, likely due to issue highlighted in #1770. Until fix in PR #1773 (which will also have to simultaneously bring in PRs #1662 and #1689) is merged, it might be safer to reverse PR #1617 in master.
|
Suspend/resume could potentially be used for any reason, not just when testing interventions using year of change (e.g. when you need to suspend because you'll run out of compute time on your institution's compute cluster). So I prefer if this is something the analyst has to check themselves.
Good idea. I'll create an issue for this.
I think this is now working as expected, at least for the built-in switches. We at least have a very simple example of how it should work.
Yes - we should guide users to schedule a future event to update all variables that need to change after resuming, as we do for the healthsystem switches.
We implemented suspend/resume to work both ways - resume everything from draw 0 or resume all runs in all draws. If we think there's no use for the other case (I think there is, but happy to be convinced otherwise).
Could output this on simulation end?
👍 |
Yes this is fair enough (and also your point about the suspend date being calculated from the year_of_change). I agree that the fact that there are two distinct applications for this makes it difficult to create in-built checks that apply to both. My counter suggestion would be to add a check-list on the wiki for the second kind of application, including:
|















Towards implementing #1665, for suspend/resume when using
tlo batch-submitwith following caveats:Currently only suspends/resumes exactly the draw+run that was suspended (i.e. cannot use other draw's runs to resume different draws - which is what we want)doneHave to know some internal Azure Batch environment variables to get the path to pickled simulation - want to get rid of that, somehow.done