Skip to content

Feat: launch-slurm should upload important config values when registering an experiment #1423

@tjhunter

Description

@tjhunter

Describe the task. Describe the task. It can be a feature, a set of experiments, documentation, etc.

Use case: as a scientist, I launch a number of experiments such as using this loop:

for lr in "5e-5" "1e-4" "2e-4" "4e-4" ; do
  for node in 2 4 8 ; do
    echo "$lr $node"
    ../WeatherGenerator-private/hpc/launch-slurm.py --chain-jobs 1 --nodes "$node" --options "wgtags.org='ecmwf'" "wgtags.exp='lr_scaling'" "wgtags.issue='1168'" "lr_max=$lr" "num_mini_epochs=1024" "wgtags.num_nodes=$nodes"

  done
done

Currently, the configs and the tags are uploaded at the end of the training run. I need to wait for the completion of the experiment to know the tags associated with the experiment. This prevents me from:

  • understanding which run_id is associated with which experiments
  • monitoring a large batch of experiments (8+) from within mlflow.

Feature request: when an experiment is launched and registered, also upload the wgtags.* space of the config (at least, maybe also the rest of the config if easy to do).

Marked as initiative because it has to happen after the config is fully resolved.

Hedgedoc URL, if you are keeping notes, plots, logs in hedgedoc.

No response

URL to the design document

No response

Area

  • datasets, data readers, data preparation and transfer
  • model
  • science
  • infrastructure and engineering
  • evaluation, export and visualization
  • documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    infraIssues related to infrastructureinitiativeLarge piece of work covering multiple sprint

    Type

    No type

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions