Skip to content

Conversation

@tamuri
Copy link
Collaborator

@tamuri tamuri commented Aug 15, 2025

Towards implementing #1665, for suspend/resume when using tlo batch-submit with following caveats:

  1. Currently only suspends/resumes exactly the draw+run that was suspended (i.e. cannot use other draw's runs to resume different draws - which is what we want) done
  2. Have to know some internal Azure Batch environment variables to get the path to pickled simulation - want to get rid of that, somehow. done

tamuri added 9 commits August 15, 2025 10:58
- takes the supplied job id and makes it the full path to the saved job
- not perfect
- useful when all draws can be resumed from a a single specified draw
- we may be restoring simulation from a baseline without any parameters
- the draw itself may override parameters
@tamuri tamuri requested a review from Copilot August 26, 2025 10:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements suspend/resume functionality for simulations running on Azure Batch, allowing users to pause and later continue long-running simulations. The implementation adds support for specifying suspend/resume parameters through command-line arguments and handles the path resolution for pickled simulation files in the Azure Batch environment.

  • Adds command-line argument parsing to batch-submit to handle suspend/resume parameters
  • Modifies simulation loading logic to support resuming from pickled simulations with flexible path handling
  • Includes a test scenario file to validate suspend/resume functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/tlo/scenario.py Enhanced simulation loading logic to support resuming from pickled files with environment variable expansion and flexible path handling
src/tlo/cli.py Added argument parsing to batch-submit command to handle suspend/resume parameters and construct Azure Batch file paths
src/scripts/dev/scenarios/suspend-resume-test.py Added test scenario for validating suspend/resume functionality
Comments suppressed due to low confidence (1)

src/tlo/cli.py:1

  • This code assumes that '--resume-simulation' is always followed by exactly one argument, but doesn't validate that i+1 is within bounds. If '--resume-simulation' is the last argument, accessing scenario_args[i+1] will raise an IndexError.
"""The TLOmodel command-line interface"""

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@marghe-molaro
Copy link
Collaborator

marghe-molaro commented Jan 12, 2026

Hi @tamuri and @tbhallett,

Below the results of my tests as well as general feedback on the suspend/resume functionality

TEST SET-UP

  • Submitted from branch molaro/test-suspend-resume, which branched off this PR's branch and additionally incorporated bug fix Add missing clinic eligibility info when performing mode transition #1775
  • 20k pop
  • Considered 3 runs per draw (to ensure seeds are matched when resuming)
  • Considered an initial period of 4 years (2010-2013) before suspension (to ensure period is long enough to accrue discrepancies if suspend/resume not working as expected)
  • Suspend-date considered was 2013-12-31
  • Considered 5 draws: no changes to HS, mode change, mode change with rescaling of capabilities, mode change + change of HRH scenario, cons. avail. change

OUTPUTS
These can be downloaded with the following IDs:

  • Using suspend/resume (Resumed from effect_of_capabilities_scaling-2026-01-12T132957Z): Job ID: effect_of_capabilities_scaling-2026-01-12T144859Z
  • Submitted with standard tlo batch-submit: Job ID: effect_of_capabilities_scaling-2026-01-12T144942Z

OUTCOME OF TEST
The plots below were made based on the outputs of the two job ID above (i.e. the output from first period, effect_of_capabilities_scaling-2026-01-12T132957Z, was never downloaded). The different thickness of the lines distinguishes the different runs. Outcome:

  • The resume function seems to work well for the baseline case and for two of the mode transitions (with or without rescaling). (On a separate note, it should be noted that the rescaling appears to have no effect on outcome, not even in the standard submission, which is indicative of an issue).
  • The resume function seems to struggle with the transition to perfect consumables and with combining the transition to mode 2, rescaling, and switching to a different HRH configuration (Mode 2 rescaling and funded plus)

Possible reasons for discrepancy:

  • My error: it may be that the scenario file I submitted did not adequately prepare the simulation for the transition, I will double check.
  • Issue with using suspend/resume specifically when transitioning to a different consumable or HRH configuration
plot_run_Baseline plot_run_Mode 2 no rescaling plot_run_Mode 2 with rescaling plot_run_Mode 2 with rescaling and funded plus plot_run_Mode 1 perfect consumables

GENERAL FEEDBACK

  1. Typo in Wiki
    “Suspend the simulation at a date before the intervention is applied. So, if the intervention is 2020-01-01, we suspend the simulation 2019-12-01.”
    Should read “suspend the simulation in 2019-12-31”

  2. Ensuring dates treated consistently
    (related to 1) There are currently two variables floating around:

  • a “year (or date) of change” in the scenario file,
  • the “suspend date”,
    It would be safer for only one of these to be specified by the user, and the other one calculated accordingly (otherwise at least issue an error message if the suspend_date differs from year_of_change by more than expected +1 day difference?)
  1. tlo batch-download
    It seems unlikely that the user might ever need to download the suspended simulation, could tlo batch-download be modified to only download standard outputs/ignore suspended sim?

  2. Ensuring draw 0 in first period schedules all possible event changes
    As mentioned by tamuri during post Q3 2025 discussion, it should be made clear to the user that draw 0 from which the analysis will resume should, upon initialisation, schedule all events enacting changes that will be considered by at least one of the draws upon resuming the simulation.
    Example that could be provided: analysis where changes to the HS are enforced in YEAR_OF_CHANGE=2025. Among the four draws considered, the first draw considers no changes, the second considers a HealthSystem mode shift in that year, the third a consumable shift, and the fourth both.
    Draw 0 submitted for the first period (2010/01/01 - 2024/12/31) should ensure that events to enforce a mode and consumable switch are both scheduled for YEAR_OF_CHANGE upon initialisation, i.e. draw 0 should be submitted with the following parameters:
    year_cons_availability_switch:YEAR_OF_CHANGE ← will ensure event to switch cons availability will be “accessible” to all draws when resuming
    year_mode_switch:YEAR_OF_CHANGE ← will ensure event to switch will be accessible to all draws when resuming
    mode_appt_constraints: standard for first period (i.e. 1),
    cons_availability: standard for first period (i.e. “default”),
    mode_appt_constraints_postSwitch": 1, ← will be overwritten by each draw upon resuming
    cons_availability_postSwitch":"default", ← will be overwritten by each draw upon resuming

even if no changes are actually contemplated by draw 0 upon resuming, so that each draw can “use” that event if needed.

  1. Clarifying use of instance variable initialisation for the user
    It should be made clear to the user (if I understood this correctly) that any instance variables initialised from parameter values at the start of the first period will not be updated upon resuming the simulation. I can imagine this may be an issue in quite a few modules.

  2. Remind user to the “switch off” all draws >0 when submitting scenario for the first period if intending to copy and paste first period across draws!
    It is easy to forget this, however doing so would compromise the number one reason for introducing suspend/resume, i.e. to save on Azure resources. Can this be highlighted more prominently in the Wiki, or (better) could inbuilt checks be introduced? E.g. if tlo batch-submit with suspend-date, but number of draws > 0, check that this is really the user’s intention?

  3. Issue with rescaling population
    Information on scaling parameter not available when downloading only the second period of the simulation after resuming, hence couldn’t rescale.

  4. Longer term improvements
    Issues 2, 3, and 6 listed here could be addressed by creating the unique submission option that Tim suggested after Q3, whereby a user would specify the suspend date + which draw to use for the first period upfront, and then behind the curtains the code would ‘switch off’ all other draws for the first period (issue 6), resume a day after the suspension date (issue 2), and combine outputs for the user (issue 3).

@marghe-molaro
Copy link
Collaborator

Re-ran with standard tlo batch-submission (effect_of_capabilities_scaling-2026-01-13T110935Z), and results are consistent. Including outputs below

plot_run_Mode 2 with rescaling2 plot_run_Baseline2 plot_run_Mode 1 perfect consumables2 plot_run_Mode 2 no rescaling2 plot_run_Mode 2 with rescaling and funded plus2

@tamuri
Copy link
Collaborator Author

tamuri commented Jan 14, 2026

I don't think this is the cause of the problem but I noticed that the HealthSystemChangeParameters event is scheduled at the start of the simulation with the value of (for example) cons_availability_postSwitch at that moment, which will be default in draw 0. This means it can't be used for resuming a draw with a different value.

It needs to change so that when HealthSystemChangeParameters runs, it picks up the "live" value from the HealthSystem module, not one recorded at the start of the sim.

# Schedule a consumables availability switch
sim.schedule_event(
HealthSystemChangeParameters(
self, parameters={"cons_availability": self.parameters["cons_availability_postSwitch"]}
),
Date(self.parameters["year_cons_availability_switch"], 1, 1),
)
# Schedule an equipment availability switch
sim.schedule_event(
HealthSystemChangeParameters(
self, parameters={"equip_availability": self.parameters["equip_availability_postSwitch"]}
),
Date(self.parameters["year_equip_availability_switch"], 1, 1),
)
# Schedule an equipment availability switch
sim.schedule_event(
HealthSystemChangeParameters(
self,
parameters={
"use_funded_or_actual_staffing": self.parameters["use_funded_or_actual_staffing_postSwitch"]
},
),
Date(self.parameters["year_use_funded_or_actual_staffing_switch"], 1, 1),
)

@marghe-molaro
Copy link
Collaborator

marghe-molaro commented Jan 14, 2026

I don't think this is the cause of the problem but I noticed that the HealthSystemChangeParameters event is scheduled at the start of the simulation with the value of (for example) cons_availability_postSwitch at that moment, which will be default in draw 0. This means it can't be used for resuming a draw with a different value.

It needs to change so that when HealthSystemChangeParameters runs, it picks up the "live" value from the HealthSystem module, not one recorded at the start of the sim.

# Schedule a consumables availability switch
sim.schedule_event(
HealthSystemChangeParameters(
self, parameters={"cons_availability": self.parameters["cons_availability_postSwitch"]}
),
Date(self.parameters["year_cons_availability_switch"], 1, 1),
)
# Schedule an equipment availability switch
sim.schedule_event(
HealthSystemChangeParameters(
self, parameters={"equip_availability": self.parameters["equip_availability_postSwitch"]}
),
Date(self.parameters["year_equip_availability_switch"], 1, 1),
)
# Schedule an equipment availability switch
sim.schedule_event(
HealthSystemChangeParameters(
self,
parameters={
"use_funded_or_actual_staffing": self.parameters["use_funded_or_actual_staffing_postSwitch"]
},
),
Date(self.parameters["year_use_funded_or_actual_staffing_switch"], 1, 1),
)

Oh absolutely - I had assumed that the parameter value was looked up when the parameter-change event is ran, but clearly it isn't. I can issue a PR to update that

(also notice that the mode switch, which works correctly with suspend/resume, does this already)

@tamuri tamuri marked this pull request as ready for review January 14, 2026 14:15
@marghe-molaro
Copy link
Collaborator

Confirming that fix brought in by PR #1777 has now resolved all discrepancies between suspend/resume and standard batch-submit, as shown by the plots below.

Noting here again that the rescaling of mode 2 is not working as expected in master, likely due to issue highlighted in #1770. Until fix in PR #1773 (which will also have to simultaneously bring in PRs #1662 and #1689) is merged, it might be safer to reverse PR #1617 in master.

plot_run_Baseline_post_fix plot_run_Mode 1 perfect consumables_post_fix plot_run_Mode 2 no rescaling_post_fix plot_run_Mode 2 with rescaling and funded plus_post_fix plot_run_Mode 2 with rescaling_post_fix

@tamuri
Copy link
Collaborator Author

tamuri commented Jan 15, 2026

  1. Typo in Wiki
    “Suspend the simulation at a date before the intervention is applied. So, if the intervention is 2020-01-01, we suspend the simulation 2019-12-01.”
    Should read “suspend the simulation in 2019-12-31”

    1. Ensuring dates treated consistently
      (related to 1) There are currently two variables floating around:
    • a “year (or date) of change” in the scenario file,

    • the “suspend date”,
      It would be safer for only one of these to be specified by the user, and the other one calculated accordingly (otherwise at least issue an error message if the suspend_date differs from year_of_change by more than expected +1 day difference?)

Suspend/resume could potentially be used for any reason, not just when testing interventions using year of change (e.g. when you need to suspend because you'll run out of compute time on your institution's compute cluster). So I prefer if this is something the analyst has to check themselves.

3. tlo batch-download
   It seems unlikely that the user might ever need to download the suspended simulation, could tlo batch-download be modified to only download standard outputs/ignore suspended sim?

Good idea. I'll create an issue for this.

4. Ensuring draw 0 in first period schedules all possible event changes
   As mentioned by tamuri during post Q3 2025 discussion, it should be made clear to the user that draw 0 from which the analysis will resume should, upon initialisation, schedule all events enacting changes that will be considered by at least one of the draws upon resuming the simulation.
   Example that could be provided: analysis where changes to the HS are enforced in YEAR_OF_CHANGE=2025. Among the four draws considered, the first draw considers no changes, the second considers a HealthSystem mode shift in that year, the third a consumable shift, and the fourth both.
   Draw 0 submitted for the first period (2010/01/01 - 2024/12/31) should ensure that events to enforce a mode and consumable switch are both scheduled for YEAR_OF_CHANGE upon initialisation, i.e. draw 0 should be submitted with the following parameters:
   year_cons_availability_switch:YEAR_OF_CHANGE ← will ensure event to switch cons availability will be “accessible” to all draws when resuming
   year_mode_switch:YEAR_OF_CHANGE ← will ensure event to switch will be accessible to all draws when resuming
   mode_appt_constraints: standard for first period (i.e. 1),
   cons_availability: standard for first period (i.e. “default”),
   mode_appt_constraints_postSwitch": 1, ← will be overwritten by each draw upon resuming
   cons_availability_postSwitch":"default", ← will be overwritten by each draw upon resuming

even if no changes are actually contemplated by draw 0 upon resuming, so that each draw can “use” that event if needed.

I think this is now working as expected, at least for the built-in switches. We at least have a very simple example of how it should work.

5. Clarifying use of instance variable initialisation for the user
   It should be made clear to the user (if I understood this correctly) that any instance variables initialised from parameter values at the start of the first period will not be updated upon resuming the simulation. I can imagine this may be an issue in quite a few modules.

Yes - we should guide users to schedule a future event to update all variables that need to change after resuming, as we do for the healthsystem switches.

6. Remind user to the “switch off” all draws >0 when submitting scenario for the first period if intending to copy and paste first period across draws!
   It is easy to forget this, however doing so would compromise the number one reason for introducing suspend/resume, i.e. to save on Azure resources. Can this be highlighted more prominently in the Wiki, or (better) could inbuilt checks be introduced? E.g. if tlo batch-submit with suspend-date, but number of draws > 0, check that this is really the user’s intention?

We implemented suspend/resume to work both ways - resume everything from draw 0 or resume all runs in all draws. If we think there's no use for the other case (I think there is, but happy to be convinced otherwise).

7. Issue with rescaling population
   Information on scaling parameter not available when downloading only the second period of the simulation after resuming, hence couldn’t rescale.

Could output this on simulation end?

8. Longer term improvements
   Issues 2, 3, and 6 listed here could be addressed by creating the unique submission option that Tim suggested after Q3, whereby a user would specify the suspend date + which draw to use for the first period upfront, and then behind the curtains the code would ‘switch off’ all other draws for the first period (issue 6), resume a day after the suspension date (issue 2), and combine outputs for the user (issue 3).

👍

@marghe-molaro
Copy link
Collaborator

We implemented suspend/resume to work both ways - resume everything from draw 0 or resume all runs in all draws. If we think there's no use for the other case (I think there is, but happy to be convinced otherwise).

Yes this is fair enough (and also your point about the suspend date being calculated from the year_of_change). I agree that the fact that there are two distinct applications for this makes it difficult to create in-built checks that apply to both.

My counter suggestion would be to add a check-list on the wiki for the second kind of application, including:

  • Ensure that suspend date and year of change in scenario are compatible;
  • Ensure that draw 0 schedules all potential parameter change events in the first period, even if these won't be used by draw 0 itself;
  • Ensure that scenario file is modified in the first instance s.t. all draws except draw 0 are 'turned off' when submitting the first period

@tamuri tamuri merged commit 85ec43b into master Jan 15, 2026
103 of 132 checks passed
@tamuri tamuri deleted the tamuri/1665-suspend-resume-batch branch January 15, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants