Skip to content

Finetuning receipt fails while using MLFlowlogger #405

@mohittalele

Description

@mohittalele

Describe the bug

When using pl.loggers.MLFlowLogger with NeMo 2.0's nl.NeMoLogger during fine-tuning, the training fails with a TypeError because MLflow's protobuf expects an integer step value but receives a float (e.g., 1.0 instead of 1). The error occurs during metric logging: TypeError: Cannot set mlflow.Metric.step to 1.0: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,).

Steps/Code to reproduce bug
Run the below example notebook with MLflow logger -
https://github.com/NVIDIA-NeMo/NeMo/blob/main/tutorials/llm/llama/nemo2-sft-peft/nemo2-peft.ipynb

and add mlflow logger as follows :-

def logger() -> run.Config[nl.NeMoLogger]:
    mlflow_logger = run.Config(
        pl.loggers.MLFlowLogge,
        experiment_name="nemo2_peft_test",
        run_name=None,
        tracking_uri="http://mlflow:5000",
        log_model=False,
        tags={
            "model": "llama-3.2-1b",
            "task": "squad",
            "method": "lora",
            "framework": "nemo2.0",
        },
    )
        
    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=10,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    return run.Config(
        nl.NeMoLogger,
        name="nemo2_peft",
        log_dir="./results",
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=None,
        extra_loggers=[mlflow_logger]
    )

Run this example

The example fails with error -

TypeError: Cannot set mlflow.Metric.step to 1.0: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,)

stack :-

 [default0]:[NeMo W 2026-01-05 13:54:20 rerun_state_machine:1264] Implicit initialization of Rerun State Machine!
i.finetune/0 [default0]:[NeMo W 2026-01-05 13:54:20 rerun_state_machine:239] RerunStateMachine initialized in mode RerunMode.DISABLED
i.finetune/0 [default0]:Training epoch 0, iteration 0/19 | lr: 0.0001 | global_batch_size: 8 | global_step: 0 | reduced_train_loss: 2.833
i.finetune/0 [default0]:Training epoch 0, iteration 1/19 | lr: 0.0001 | global_batch_size: 8 | global_step: 1 | reduced_train_loss: 2.355 | consumed_samples: 16
i.finetune/0 [default0]:🏃 View run serious-foal-183 at: http://mlflow:5000/#/experiments/41/runs/f1db1b8a68ca4e06a2fb599545454587
i.finetune/0 [default0]:🧪 View experiment at: http://mlflow:5000/#/experiments/41
i.finetune/0 [default0]:[rank0]: Traceback (most recent call last):
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/python_message.py", line 737, in field_setter
i.finetune/0 [default0]:[rank0]:     new_value = type_checker.CheckValue(new_value)
i.finetune/0 [default0]:[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/type_checkers.py", line 168, in CheckValue
i.finetune/0 [default0]:[rank0]:     raise TypeError(message)
i.finetune/0 [default0]:[rank0]: TypeError: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,)
i.finetune/0 [default0]:
i.finetune/0 [default0]:[rank0]: During handling of the above exception, another exception occurred:
i.finetune/0 [default0]:
i.finetune/0 [default0]:[rank0]: Traceback (most recent call last):
i.finetune/0 [default0]:[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
i.finetune/0 [default0]:[rank0]:   File "<frozen runpy>", line 88, in _run_code
i.finetune/0 [default0]:[rank0]:   File "/opt/NeMo-Run/nemo_run/core/runners/fdl_runner.py", line 72, in <module>
i.finetune/0 [default0]:[rank0]:     fdl_runner_app()
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 339, in __call__
i.finetune/0 [default0]:[rank0]:     raise e
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 322, in __call__
i.finetune/0 [default0]:[rank0]:     return get_command(self)(*args, **kwargs)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in __call__
i.finetune/0 [default0]:[rank0]:     return self.main(*args, **kwargs)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 677, in main
i.finetune/0 [default0]:[rank0]:     return _main(
i.finetune/0 [default0]:[rank0]:            ^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 195, in _main
i.finetune/0 [default0]:[rank0]:     rv = self.invoke(ctx)
i.finetune/0 [default0]:[rank0]:          ^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.finetune/0 [default0]:[rank0]:     return ctx.invoke(self.callback, **ctx.params)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.finetune/0 [default0]:[rank0]:     return __callback(*args, **kwargs)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 697, in wrapper
i.finetune/0 [default0]:[rank0]:     return callback(**use_params)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/opt/NeMo-Run/nemo_run/core/runners/fdl_runner.py", line 68, in fdl_direct_run
i.finetune/0 [default0]:[rank0]:     fdl_fn()
i.finetune/0 [default0]:[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 221, in finetune
i.finetune/0 [default0]:[rank0]:     return train(
i.finetune/0 [default0]:[rank0]:            ^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 126, in train
i.finetune/0 [default0]:[rank0]:     trainer.fit(model, data)
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
i.finetune/0 [default0]:[rank0]:     call._call_and_handle_interrupt(
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
i.finetune/0 [default0]:[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
i.finetune/0 [default0]:[rank0]:     return function(*args, **kwargs)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
i.finetune/0 [default0]:[rank0]:     self._run(model, ckpt_path=ckpt_path)
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
i.finetune/0 [default0]:[rank0]:     results = self._run_stage()
i.finetune/0 [default0]:[rank0]:               ^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
i.finetune/0 [default0]:[rank0]:     self.fit_loop.run()
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
i.finetune/0 [default0]:[rank0]:     self.advance()
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
i.finetune/0 [default0]:[rank0]:     self.epoch_loop.run(self._data_fetcher)
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
i.finetune/0 [default0]:[rank0]:     self.advance(data_fetcher)
i.finetune/0 [default0]:[rank0]:   File "/opt/NeMo/nemo/lightning/pytorch/trainer.py", line 47, in advance
i.finetune/0 [default0]:[rank0]:     super().advance(data_fetcher)
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 278, in advance
i.finetune/0 [default0]:[rank0]:     trainer._logger_connector.update_train_step_metrics()
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 163, in update_train_step_metrics
i.finetune/0 [default0]:[rank0]:     self.log_metrics(self.metrics["log"])
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 117, in log_metrics
i.finetune/0 [default0]:[rank0]:     logger.log_metrics(metrics=scalar_metrics, step=step)
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning_utilities-0.14.0-py3.12.egg/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
i.finetune/0 [default0]:[rank0]:     return fn(*args, **kwargs)
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loggers/mlflow.py", line 270, in log_metrics
i.finetune/0 [default0]:[rank0]:     self.experiment.log_batch(run_id=self.run_id, metrics=metrics_list, **self._log_batch_kwargs)
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/mlflow/tracking/client.py", line 2581, in log_batch
i.finetune/0 [default0]:[rank0]:     return self._tracking_client.log_batch(
i.finetune/0 [default0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/mlflow/telemetry/track.py", line 30, in wrapper
i.finetune/0 [default0]:[rank0]:     result = func(*args, **kwargs)
i.finetune/0 [default0]:[rank0]:              ^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/mlflow/tracking/_tracking_service/client.py", line 581, in log_batch
i.finetune/0 [default0]:[rank0]:     self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/mlflow/store/tracking/rest_store.py", line 914, in log_batch
i.finetune/0 [default0]:[rank0]:     metric_protos = [metric.to_proto() for metric in metrics]
i.finetune/0 [default0]:[rank0]:                      ^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/mlflow/entities/metric.py", line 84, in to_proto
i.finetune/0 [default0]:[rank0]:     metric.step = self.step
i.finetune/0 [default0]:[rank0]:     ^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]:   File "/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/python_message.py", line 739, in field_setter
i.finetune/0 [default0]:[rank0]:     raise TypeError(
i.finetune/0 [default0]:[rank0]: TypeError: Cannot set mlflow.Metric.step to 1.0: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,)
```


Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions