Skip to content

Comments

Add method to generate scaling experiments to PayuManager#46

Merged
micaeljtoliveira merged 3 commits intomainfrom
generate_experiments
Feb 17, 2026
Merged

Add method to generate scaling experiments to PayuManager#46
micaeljtoliveira merged 3 commits intomainfrom
generate_experiments

Conversation

@micaeljtoliveira
Copy link
Member

@micaeljtoliveira
Copy link
Member Author

@manodeep Would you mind also having a look at this PR? It should clarify some of the changes made in ACCESS-NRI/access-config-utils#34

@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (4ae8487) to head (22c20f2).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #46   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           16        16           
  Lines          791       834   +43     
=========================================
+ Hits           791       834   +43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@micaeljtoliveira
Copy link
Member Author

@manodeep Later this week I'll do a final check with the latest changes from ACCESS-NRI/access-config-utils#34, but I think this is ready for review.

Copy link
Collaborator

@manodeep manodeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micaeljtoliveira Looked through the changes - looks mostly good to me, and asked for clarifications on some.

I don't know enough of 'MagicMock' to properly review the testing bits, but seems to be fine. Coverage is 100% right?

return logs

def generate_core_layouts_from_node_count(
self, num_nodes: float, cores_per_node: int, layout_search_config: LayoutSearchConfig | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the | None = None - does that mean allow "None" as an input, and then pass that "None" to the routine? (Basically, trying to figure out what the = None bit is doing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to read this in two parts:
layout_search_config: LayoutSearchConfig | None: parameter is of type LayoutSearchConfing or None.
layout_search_config = None: the usual way of providing a default value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I still don't follow why the =None is necessary. Either the parameter is passed as an instance of LayoutSearchConfig or it is passed as None

Regardless, this is just for my curiosity :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, type-hinting does not actually do runtime checks - so that =None would be required

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for not clarifying that part.

logger.info(f"Generated {len(layouts)} layouts for {num_nodes} nodes. Layouts: {layouts}")

# TODO: the branch name needs to be simpler and model agnostic
branch_name = f"layout-unused-cores-to-cice-{layout_config.allocate_unused_cores_to_ice}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminded me - that feature is broken :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating an issue about it!

for layout in layouts:
pert_config = self.generate_perturbation_block(layout=layout, branch_name_prefix=branch_name)
branch = pert_config["branches"][0]
pert_config["config.yaml"]["walltime"] = str(timedelta(hours=walltime_hrs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the walltime need to be set explicitly?

(I remember there was something odd about the walltime in experiment-generator - but may be just add a comment that clarifies the need for walltime setting)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The walltime is a value set in the payu config (config.yaml), so here we are letting the user set it differently for each perturbation by providing a function.

You had a similar logic in the ESM1.6 scaling notebook, but were directly replacing the text of the yaml block:

        walltime_hrs += nblocks_added * (1.5 * 4.0/num_nodes) # use a 1.5 hrs time for 4-node runs, and then scale linearly
        entire_block += block

    entire_block = entire_block.replace("walltime: Inf", f"walltime: {int(walltime_hrs)}:00:00")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, one could just pass it as part of the control_options, but then it would always be the same for all perturbations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why janky walltime increasing logic with the number of scaling experiments actually worked 😅 - might have had to edit by hand afterwards

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the new implementation will be less janky 😄

# Verify the configuration passed to ExperimentGenerator has correct walltime
call_args = mock_experiment_generator.call_args[0][0]
assert (
call_args["Perturbation_Experiment"]["Experiment_1"]["config.yaml"]["walltime"] == "5:00:00"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you split out num_nodes_list into a variable above, then you can use that variable in the manager.generate_scaling_experiments call, as well as here with calls to walltime_func for each element in that list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that requires manipulating further the result of the function, just like done in the PayuManager code:

                pert_config["config.yaml"]["walltime"] = str(timedelta(hours=walltime_hrs))

So the tests would be:

    assert (
        call_args["Perturbation_Experiment"]["Experiment_1"]["config.yaml"]["walltime"] == str(timedelta(hours=walltime_func(2.0))
    )

which then doesn't check that str(timedelta(hours=walltime_func(2.0)) is doing what it's suppose to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see it both ways. I like making changes in one place and having that update everywhere; however, I can also see that this being in a "tests" context, it makes sense to have duplicated logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, keeping the code DRY is a good thing, but it's not always clear how far to take that in tests. I think not including any of the code being tested in the unit test itself is a good rule of thumb.

…ake a list of node counts, generate layouts that match the user-supplied criteria, and create the corresponding experiments.

Currently the method still has some code that is somehow ESM1.6 specific (e.g., the branch naming scheme), but we need to have more supported models before deciding on the best way to generalise it.

Deleted the generate_experiments method, as this should not be needed anymore. If other types of generation is needed (i.e, not scaling tests), one should implement specific methods for those.
@manodeep
Copy link
Collaborator

@micaeljtoliveira Thanks for the clarifications - is this all finalised now?

@micaeljtoliveira
Copy link
Member Author

@manodeep Yes, all finalised from my side. And thanks for the review.

@micaeljtoliveira
Copy link
Member Author

I forgot to say, I did test this the best I could by trying to make sure it was producing the same perturbations as the ESM1.6 scaling notebook, but I couldn't actually run those perturbation experiments (the corresponding executables are gone...). That means that, when we update the notebook, we'll have to carefully check the results.

@micaeljtoliveira micaeljtoliveira merged commit 20576d1 into main Feb 17, 2026
8 checks passed
@micaeljtoliveira micaeljtoliveira deleted the generate_experiments branch February 17, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants