Allow suspend/resume of simulations on Batch runs #1668

tamuri · 2025-08-15T10:06:16Z

Towards implementing #1665, for suspend/resume when using tlo batch-submit with following caveats:

~~Currently only suspends/resumes exactly the draw+run that was suspended (i.e. cannot use other draw's runs to resume different draws - which is what we want)~~ done
~~Have to know some internal Azure Batch environment variables to get the path to pickled simulation - want to get rid of that, somehow.~~ done

… to work with AZ_* env vars.

- takes the supplied job id and makes it the full path to the saved job - not perfect

- Okay after simulation is setup

- useful when all draws can be resumed from a a single specified draw

- we may be restoring simulation from a baseline without any parameters - the draw itself may override parameters

Copilot

Pull Request Overview

This PR implements suspend/resume functionality for simulations running on Azure Batch, allowing users to pause and later continue long-running simulations. The implementation adds support for specifying suspend/resume parameters through command-line arguments and handles the path resolution for pickled simulation files in the Azure Batch environment.

Adds command-line argument parsing to batch-submit to handle suspend/resume parameters
Modifies simulation loading logic to support resuming from pickled simulations with flexible path handling
Includes a test scenario file to validate suspend/resume functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/tlo/scenario.py	Enhanced simulation loading logic to support resuming from pickled files with environment variable expansion and flexible path handling
src/tlo/cli.py	Added argument parsing to batch-submit command to handle suspend/resume parameters and construct Azure Batch file paths
src/scripts/dev/scenarios/suspend-resume-test.py	Added test scenario for validating suspend/resume functionality

Comments suppressed due to low confidence (1)

src/tlo/cli.py:1

This code assumes that '--resume-simulation' is always followed by exactly one argument, but doesn't validate that i+1 is within bounds. If '--resume-simulation' is the last argument, accessing scenario_args[i+1] will raise an IndexError.

"""The TLOmodel command-line interface"""

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/tlo/cli.py

src/tlo/scenario.py

…/UCL/TLOmodel into tamuri/1665-suspend-resume-batch Rebase to remote

marghe-molaro · 2026-01-12T18:19:55Z

Hi @tamuri and @tbhallett,

Below the results of my tests as well as general feedback on the suspend/resume functionality

TEST SET-UP

Submitted from branch molaro/test-suspend-resume, which branched off this PR's branch and additionally incorporated bug fix Add missing clinic eligibility info when performing mode transition #1775
20k pop
Considered 3 runs per draw (to ensure seeds are matched when resuming)
Considered an initial period of 4 years (2010-2013) before suspension (to ensure period is long enough to accrue discrepancies if suspend/resume not working as expected)
Suspend-date considered was 2013-12-31
Considered 5 draws: no changes to HS, mode change, mode change with rescaling of capabilities, mode change + change of HRH scenario, cons. avail. change

OUTPUTS
These can be downloaded with the following IDs:

Using suspend/resume (Resumed from effect_of_capabilities_scaling-2026-01-12T132957Z): Job ID: effect_of_capabilities_scaling-2026-01-12T144859Z
Submitted with standard tlo batch-submit: Job ID: effect_of_capabilities_scaling-2026-01-12T144942Z

OUTCOME OF TEST
The plots below were made based on the outputs of the two job ID above (i.e. the output from first period, effect_of_capabilities_scaling-2026-01-12T132957Z, was never downloaded). The different thickness of the lines distinguishes the different runs. Outcome:

The resume function seems to work well for the baseline case and for two of the mode transitions (with or without rescaling). (On a separate note, it should be noted that the rescaling appears to have no effect on outcome, not even in the standard submission, which is indicative of an issue).
The resume function seems to struggle with the transition to perfect consumables and with combining the transition to mode 2, rescaling, and switching to a different HRH configuration (Mode 2 rescaling and funded plus)

Possible reasons for discrepancy:

My error: it may be that the scenario file I submitted did not adequately prepare the simulation for the transition, I will double check.
Issue with using suspend/resume specifically when transitioning to a different consumable or HRH configuration

plot_run_Mode 2 with rescaling and funded plus

GENERAL FEEDBACK

Typo in Wiki
“Suspend the simulation at a date before the intervention is applied. So, if the intervention is 2020-01-01, we suspend the simulation 2019-12-01.”
Should read “suspend the simulation in 2019-12-31”
Ensuring dates treated consistently
(related to 1) There are currently two variables floating around:

a “year (or date) of change” in the scenario file,
the “suspend date”,
It would be safer for only one of these to be specified by the user, and the other one calculated accordingly (otherwise at least issue an error message if the suspend_date differs from year_of_change by more than expected +1 day difference?)

tlo batch-download
It seems unlikely that the user might ever need to download the suspended simulation, could tlo batch-download be modified to only download standard outputs/ignore suspended sim?
Ensuring draw 0 in first period schedules all possible event changes
As mentioned by tamuri during post Q3 2025 discussion, it should be made clear to the user that draw 0 from which the analysis will resume should, upon initialisation, schedule all events enacting changes that will be considered by at least one of the draws upon resuming the simulation.
Example that could be provided: analysis where changes to the HS are enforced in YEAR_OF_CHANGE=2025. Among the four draws considered, the first draw considers no changes, the second considers a HealthSystem mode shift in that year, the third a consumable shift, and the fourth both.
Draw 0 submitted for the first period (2010/01/01 - 2024/12/31) should ensure that events to enforce a mode and consumable switch are both scheduled for YEAR_OF_CHANGE upon initialisation, i.e. draw 0 should be submitted with the following parameters:
year_cons_availability_switch:YEAR_OF_CHANGE ← will ensure event to switch cons availability will be “accessible” to all draws when resuming
year_mode_switch:YEAR_OF_CHANGE ← will ensure event to switch will be accessible to all draws when resuming
mode_appt_constraints: standard for first period (i.e. 1),
cons_availability: standard for first period (i.e. “default”),
mode_appt_constraints_postSwitch": 1, ← will be overwritten by each draw upon resuming
cons_availability_postSwitch":"default", ← will be overwritten by each draw upon resuming

even if no changes are actually contemplated by draw 0 upon resuming, so that each draw can “use” that event if needed.

Clarifying use of instance variable initialisation for the user
It should be made clear to the user (if I understood this correctly) that any instance variables initialised from parameter values at the start of the first period will not be updated upon resuming the simulation. I can imagine this may be an issue in quite a few modules.
Remind user to the “switch off” all draws >0 when submitting scenario for the first period if intending to copy and paste first period across draws!
It is easy to forget this, however doing so would compromise the number one reason for introducing suspend/resume, i.e. to save on Azure resources. Can this be highlighted more prominently in the Wiki, or (better) could inbuilt checks be introduced? E.g. if tlo batch-submit with suspend-date, but number of draws > 0, check that this is really the user’s intention?
Issue with rescaling population
Information on scaling parameter not available when downloading only the second period of the simulation after resuming, hence couldn’t rescale.
Longer term improvements
Issues 2, 3, and 6 listed here could be addressed by creating the unique submission option that Tim suggested after Q3, whereby a user would specify the suspend date + which draw to use for the first period upfront, and then behind the curtains the code would ‘switch off’ all other draws for the first period (issue 6), resume a day after the suspension date (issue 2), and combine outputs for the user (issue 3).

marghe-molaro · 2026-01-13T17:17:39Z

Re-ran with standard tlo batch-submission (effect_of_capabilities_scaling-2026-01-13T110935Z), and results are consistent. Including outputs below

plot_run_Mode 2 with rescaling and funded plus2

tamuri · 2026-01-14T07:43:42Z

I don't think this is the cause of the problem but I noticed that the HealthSystemChangeParameters event is scheduled at the start of the simulation with the value of (for example) cons_availability_postSwitch at that moment, which will be default in draw 0. This means it can't be used for resuming a draw with a different value.

It needs to change so that when HealthSystemChangeParameters runs, it picks up the "live" value from the HealthSystem module, not one recorded at the start of the sim.

TLOmodel/src/tlo/methods/healthsystem.py

Lines 882 to 907 in cb63057

    
           # Schedule a consumables availability switch 
        
           sim.schedule_event( 
        
               HealthSystemChangeParameters( 
        
                   self, parameters={"cons_availability": self.parameters["cons_availability_postSwitch"]} 
        
               ), 
        
               Date(self.parameters["year_cons_availability_switch"], 1, 1), 
        
           ) 
        
           # Schedule an equipment availability switch 
        
           sim.schedule_event( 
        
               HealthSystemChangeParameters( 
        
                   self, parameters={"equip_availability": self.parameters["equip_availability_postSwitch"]} 
        
               ), 
        
               Date(self.parameters["year_equip_availability_switch"], 1, 1), 
        
           ) 
        
           # Schedule an equipment availability switch 
        
           sim.schedule_event( 
        
               HealthSystemChangeParameters( 
        
                   self, 
        
                   parameters={ 
        
                       "use_funded_or_actual_staffing": self.parameters["use_funded_or_actual_staffing_postSwitch"] 
        
                   }, 
        
               ), 
        
               Date(self.parameters["year_use_funded_or_actual_staffing_switch"], 1, 1), 
        
           )

marghe-molaro · 2026-01-14T08:26:57Z

I don't think this is the cause of the problem but I noticed that the HealthSystemChangeParameters event is scheduled at the start of the simulation with the value of (for example) cons_availability_postSwitch at that moment, which will be default in draw 0. This means it can't be used for resuming a draw with a different value.

It needs to change so that when HealthSystemChangeParameters runs, it picks up the "live" value from the HealthSystem module, not one recorded at the start of the sim.

TLOmodel/src/tlo/methods/healthsystem.py

Lines 882 to 907 in cb63057

# Schedule a consumables availability switch

sim.schedule_event(

HealthSystemChangeParameters(

self, parameters={"cons_availability": self.parameters["cons_availability_postSwitch"]}

),

Date(self.parameters["year_cons_availability_switch"], 1, 1),

)

# Schedule an equipment availability switch

sim.schedule_event(

HealthSystemChangeParameters(

self, parameters={"equip_availability": self.parameters["equip_availability_postSwitch"]}

),

Date(self.parameters["year_equip_availability_switch"], 1, 1),

)

# Schedule an equipment availability switch

sim.schedule_event(

HealthSystemChangeParameters(

self,

parameters={

"use_funded_or_actual_staffing": self.parameters["use_funded_or_actual_staffing_postSwitch"]

},

),

Date(self.parameters["year_use_funded_or_actual_staffing_switch"], 1, 1),

)

Oh absolutely - I had assumed that the parameter value was looked up when the parameter-change event is ran, but clearly it isn't. I can issue a PR to update that

(also notice that the mode switch, which works correctly with suspend/resume, does this already)

marghe-molaro · 2026-01-14T22:29:36Z

Confirming that fix brought in by PR #1777 has now resolved all discrepancies between suspend/resume and standard batch-submit, as shown by the plots below.

Noting here again that the rescaling of mode 2 is not working as expected in master, likely due to issue highlighted in #1770. Until fix in PR #1773 (which will also have to simultaneously bring in PRs #1662 and #1689) is merged, it might be safer to reverse PR #1617 in master.

plot_run_Mode 1 perfect consumables_post_fix

plot_run_Mode 2 with rescaling and funded plus_post_fix

tamuri · 2026-01-15T08:59:13Z

Typo in Wiki
“Suspend the simulation at a date before the intervention is applied. So, if the intervention is 2020-01-01, we suspend the simulation 2019-12-01.”
Should read “suspend the simulation in 2019-12-31”

Ensuring dates treated consistently
(related to 1) There are currently two variables floating around:

a “year (or date) of change” in the scenario file,

the “suspend date”,
It would be safer for only one of these to be specified by the user, and the other one calculated accordingly (otherwise at least issue an error message if the suspend_date differs from year_of_change by more than expected +1 day difference?)

Suspend/resume could potentially be used for any reason, not just when testing interventions using year of change (e.g. when you need to suspend because you'll run out of compute time on your institution's compute cluster). So I prefer if this is something the analyst has to check themselves.

3. tlo batch-download
   It seems unlikely that the user might ever need to download the suspended simulation, could tlo batch-download be modified to only download standard outputs/ignore suspended sim?

Good idea. I'll create an issue for this.

4. Ensuring draw 0 in first period schedules all possible event changes
   As mentioned by tamuri during post Q3 2025 discussion, it should be made clear to the user that draw 0 from which the analysis will resume should, upon initialisation, schedule all events enacting changes that will be considered by at least one of the draws upon resuming the simulation.
   Example that could be provided: analysis where changes to the HS are enforced in YEAR_OF_CHANGE=2025. Among the four draws considered, the first draw considers no changes, the second considers a HealthSystem mode shift in that year, the third a consumable shift, and the fourth both.
   Draw 0 submitted for the first period (2010/01/01 - 2024/12/31) should ensure that events to enforce a mode and consumable switch are both scheduled for YEAR_OF_CHANGE upon initialisation, i.e. draw 0 should be submitted with the following parameters:
   year_cons_availability_switch:YEAR_OF_CHANGE ← will ensure event to switch cons availability will be “accessible” to all draws when resuming
   year_mode_switch:YEAR_OF_CHANGE ← will ensure event to switch will be accessible to all draws when resuming
   mode_appt_constraints: standard for first period (i.e. 1),
   cons_availability: standard for first period (i.e. “default”),
   mode_appt_constraints_postSwitch": 1, ← will be overwritten by each draw upon resuming
   cons_availability_postSwitch":"default", ← will be overwritten by each draw upon resuming

even if no changes are actually contemplated by draw 0 upon resuming, so that each draw can “use” that event if needed.

I think this is now working as expected, at least for the built-in switches. We at least have a very simple example of how it should work.

5. Clarifying use of instance variable initialisation for the user
   It should be made clear to the user (if I understood this correctly) that any instance variables initialised from parameter values at the start of the first period will not be updated upon resuming the simulation. I can imagine this may be an issue in quite a few modules.

Yes - we should guide users to schedule a future event to update all variables that need to change after resuming, as we do for the healthsystem switches.

6. Remind user to the “switch off” all draws >0 when submitting scenario for the first period if intending to copy and paste first period across draws!
   It is easy to forget this, however doing so would compromise the number one reason for introducing suspend/resume, i.e. to save on Azure resources. Can this be highlighted more prominently in the Wiki, or (better) could inbuilt checks be introduced? E.g. if tlo batch-submit with suspend-date, but number of draws > 0, check that this is really the user’s intention?

We implemented suspend/resume to work both ways - resume everything from draw 0 or resume all runs in all draws. If we think there's no use for the other case (I think there is, but happy to be convinced otherwise).

7. Issue with rescaling population
   Information on scaling parameter not available when downloading only the second period of the simulation after resuming, hence couldn’t rescale.

Could output this on simulation end?

8. Longer term improvements
   Issues 2, 3, and 6 listed here could be addressed by creating the unique submission option that Tim suggested after Q3, whereby a user would specify the suspend date + which draw to use for the first period upfront, and then behind the curtains the code would ‘switch off’ all other draws for the first period (issue 6), resume a day after the suspension date (issue 2), and combine outputs for the user (issue 3).

👍

marghe-molaro · 2026-01-15T09:23:10Z

We implemented suspend/resume to work both ways - resume everything from draw 0 or resume all runs in all draws. If we think there's no use for the other case (I think there is, but happy to be convinced otherwise).

Yes this is fair enough (and also your point about the suspend date being calculated from the year_of_change). I agree that the fact that there are two distinct applications for this makes it difficult to create in-built checks that apply to both.

My counter suggestion would be to add a check-list on the wiki for the second kind of application, including:

Ensure that suspend date and year of change in scenario are compatible;
Ensure that draw 0 schedules all potential parameter change events in the first period, even if these won't be used by draw 0 itself;
Ensure that scenario file is modified in the first instance s.t. all draws except draw 0 are 'turned off' when submitting the first period

tamuri added 9 commits August 15, 2025 10:58

Pass through command-line arguments to scenario

f6a0556

Expand environment variables in specified path if provided (quick fix…

70d721b

… to work with AZ_* env vars.

Close the log file handle before pickling the simulation

81c36d9

Scenario for testing suspend/resume

4723a4e

Rewrite the argument for resume-simulation argument

fb3c80d

- takes the supplied job id and makes it the full path to the saved job - not perfect

Use printing because logger only works when configured via simulation

e040894

- Okay after simulation is setup

Improve message

16b2618

Allow user to specific a specific draw to resume simulation from

91fa801

- useful when all draws can be resumed from a a single specified draw

Override parameters when restoring simulation

509fc76

- we may be restoring simulation from a baseline without any parameters - the draw itself may override parameters

tamuri requested a review from Copilot August 26, 2025 10:39

Copilot AI reviewed Aug 26, 2025

View reviewed changes

src/tlo/cli.py Show resolved Hide resolved

src/tlo/scenario.py Outdated Show resolved Hide resolved

src/tlo/scenario.py Outdated Show resolved Hide resolved

tamuri and others added 9 commits August 26, 2025 16:04

Handle resuming from multi-digit draws

6a2ac5d

Merge branch 'master' into tamuri/1665-suspend-resume-batch

d72ae20

Create scenario to test suspend/resume in HealthSystem

44cdc96

Merge branch 'tamuri/1665-suspend-resume-batch' of https://github.com…

e2cec0c

…/UCL/TLOmodel into tamuri/1665-suspend-resume-batch Rebase to remote

Modify scenario file to run the complete analysis upon resuming

827b362

Add analysis file to check suspend/resume

b376010

Ensure year of transition occurs later to avoid bugs

090d445

Switch off non-0 draws for first period

73bf3eb

Restore all draws in scenario

cb63057

This was referenced Jan 14, 2026

Make cons, equip, and hrh switch suspend/resume compatible #1777

Merged

Computation of rescaling factors is incorrect #1770

Open

Merge branch 'master' into tamuri/1665-suspend-resume-batch

dcf80ce

tamuri marked this pull request as ready for review January 14, 2026 14:15

tamuri and others added 2 commits January 14, 2026 14:26

Linting

f0fb6ce

Fix scenario file in this branch and turn off all draws > 0

27aaa35

Restore draws > 0

2e1d217

marghe-molaro added 3 commits January 15, 2026 09:31

Fix analysis file

8026834

Style fixes

88bd5d5

Fix import format on analysis file

bc131f8

tamuri merged commit 85ec43b into master Jan 15, 2026
103 of 132 checks passed

tamuri deleted the tamuri/1665-suspend-resume-batch branch January 15, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow suspend/resume of simulations on Batch runs #1668

Allow suspend/resume of simulations on Batch runs #1668

Uh oh!

tamuri commented Aug 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marghe-molaro commented Jan 12, 2026 •

edited

Loading

Uh oh!

marghe-molaro commented Jan 13, 2026

Uh oh!

tamuri commented Jan 14, 2026 •

edited

Loading

Uh oh!

marghe-molaro commented Jan 14, 2026 •

edited

Loading

Uh oh!

marghe-molaro commented Jan 14, 2026

Uh oh!

tamuri commented Jan 15, 2026

Uh oh!

marghe-molaro commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow suspend/resume of simulations on Batch runs #1668

Allow suspend/resume of simulations on Batch runs #1668

Uh oh!

Conversation

tamuri commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marghe-molaro commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marghe-molaro commented Jan 13, 2026

Uh oh!

tamuri commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marghe-molaro commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marghe-molaro commented Jan 14, 2026

Uh oh!

tamuri commented Jan 15, 2026

Uh oh!

marghe-molaro commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tamuri commented Aug 15, 2025 •

edited

Loading

marghe-molaro commented Jan 12, 2026 •

edited

Loading

tamuri commented Jan 14, 2026 •

edited

Loading

marghe-molaro commented Jan 14, 2026 •

edited

Loading