-
Notifications
You must be signed in to change notification settings - Fork 88
Description
Describe the bug
When using pl.loggers.MLFlowLogger with NeMo 2.0's nl.NeMoLogger during fine-tuning, the training fails with a TypeError because MLflow's protobuf expects an integer step value but receives a float (e.g., 1.0 instead of 1). The error occurs during metric logging: TypeError: Cannot set mlflow.Metric.step to 1.0: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,).
Steps/Code to reproduce bug
Run the below example notebook with MLflow logger -
https://github.com/NVIDIA-NeMo/NeMo/blob/main/tutorials/llm/llama/nemo2-sft-peft/nemo2-peft.ipynb
and add mlflow logger as follows :-
def logger() -> run.Config[nl.NeMoLogger]:
mlflow_logger = run.Config(
pl.loggers.MLFlowLogge,
experiment_name="nemo2_peft_test",
run_name=None,
tracking_uri="http://mlflow:5000",
log_model=False,
tags={
"model": "llama-3.2-1b",
"task": "squad",
"method": "lora",
"framework": "nemo2.0",
},
)
ckpt = run.Config(
nl.ModelCheckpoint,
save_last=True,
every_n_train_steps=10,
monitor="reduced_train_loss",
save_top_k=1,
save_on_train_epoch_end=True,
save_optim_on_train_end=True,
)
return run.Config(
nl.NeMoLogger,
name="nemo2_peft",
log_dir="./results",
use_datetime_version=False,
ckpt=ckpt,
wandb=None,
extra_loggers=[mlflow_logger]
)
Run this example
The example fails with error -
TypeError: Cannot set mlflow.Metric.step to 1.0: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,)
stack :-
[default0]:[NeMo W 2026-01-05 13:54:20 rerun_state_machine:1264] Implicit initialization of Rerun State Machine!
i.finetune/0 [default0]:[NeMo W 2026-01-05 13:54:20 rerun_state_machine:239] RerunStateMachine initialized in mode RerunMode.DISABLED
i.finetune/0 [default0]:Training epoch 0, iteration 0/19 | lr: 0.0001 | global_batch_size: 8 | global_step: 0 | reduced_train_loss: 2.833
i.finetune/0 [default0]:Training epoch 0, iteration 1/19 | lr: 0.0001 | global_batch_size: 8 | global_step: 1 | reduced_train_loss: 2.355 | consumed_samples: 16
i.finetune/0 [default0]:🏃 View run serious-foal-183 at: http://mlflow:5000/#/experiments/41/runs/f1db1b8a68ca4e06a2fb599545454587
i.finetune/0 [default0]:🧪 View experiment at: http://mlflow:5000/#/experiments/41
i.finetune/0 [default0]:[rank0]: Traceback (most recent call last):
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/python_message.py", line 737, in field_setter
i.finetune/0 [default0]:[rank0]: new_value = type_checker.CheckValue(new_value)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/type_checkers.py", line 168, in CheckValue
i.finetune/0 [default0]:[rank0]: raise TypeError(message)
i.finetune/0 [default0]:[rank0]: TypeError: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,)
i.finetune/0 [default0]:
i.finetune/0 [default0]:[rank0]: During handling of the above exception, another exception occurred:
i.finetune/0 [default0]:
i.finetune/0 [default0]:[rank0]: Traceback (most recent call last):
i.finetune/0 [default0]:[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
i.finetune/0 [default0]:[rank0]: File "<frozen runpy>", line 88, in _run_code
i.finetune/0 [default0]:[rank0]: File "/opt/NeMo-Run/nemo_run/core/runners/fdl_runner.py", line 72, in <module>
i.finetune/0 [default0]:[rank0]: fdl_runner_app()
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 339, in __call__
i.finetune/0 [default0]:[rank0]: raise e
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 322, in __call__
i.finetune/0 [default0]:[rank0]: return get_command(self)(*args, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in __call__
i.finetune/0 [default0]:[rank0]: return self.main(*args, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 677, in main
i.finetune/0 [default0]:[rank0]: return _main(
i.finetune/0 [default0]:[rank0]: ^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 195, in _main
i.finetune/0 [default0]:[rank0]: rv = self.invoke(ctx)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
i.finetune/0 [default0]:[rank0]: return ctx.invoke(self.callback, **ctx.params)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
i.finetune/0 [default0]:[rank0]: return __callback(*args, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 697, in wrapper
i.finetune/0 [default0]:[rank0]: return callback(**use_params)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/opt/NeMo-Run/nemo_run/core/runners/fdl_runner.py", line 68, in fdl_direct_run
i.finetune/0 [default0]:[rank0]: fdl_fn()
i.finetune/0 [default0]:[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 221, in finetune
i.finetune/0 [default0]:[rank0]: return train(
i.finetune/0 [default0]:[rank0]: ^^^^^^
i.finetune/0 [default0]:[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 126, in train
i.finetune/0 [default0]:[rank0]: trainer.fit(model, data)
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
i.finetune/0 [default0]:[rank0]: call._call_and_handle_interrupt(
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
i.finetune/0 [default0]:[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
i.finetune/0 [default0]:[rank0]: return function(*args, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
i.finetune/0 [default0]:[rank0]: self._run(model, ckpt_path=ckpt_path)
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
i.finetune/0 [default0]:[rank0]: results = self._run_stage()
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
i.finetune/0 [default0]:[rank0]: self.fit_loop.run()
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
i.finetune/0 [default0]:[rank0]: self.advance()
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
i.finetune/0 [default0]:[rank0]: self.epoch_loop.run(self._data_fetcher)
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
i.finetune/0 [default0]:[rank0]: self.advance(data_fetcher)
i.finetune/0 [default0]:[rank0]: File "/opt/NeMo/nemo/lightning/pytorch/trainer.py", line 47, in advance
i.finetune/0 [default0]:[rank0]: super().advance(data_fetcher)
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 278, in advance
i.finetune/0 [default0]:[rank0]: trainer._logger_connector.update_train_step_metrics()
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 163, in update_train_step_metrics
i.finetune/0 [default0]:[rank0]: self.log_metrics(self.metrics["log"])
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 117, in log_metrics
i.finetune/0 [default0]:[rank0]: logger.log_metrics(metrics=scalar_metrics, step=step)
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning_utilities-0.14.0-py3.12.egg/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
i.finetune/0 [default0]:[rank0]: return fn(*args, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loggers/mlflow.py", line 270, in log_metrics
i.finetune/0 [default0]:[rank0]: self.experiment.log_batch(run_id=self.run_id, metrics=metrics_list, **self._log_batch_kwargs)
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/mlflow/tracking/client.py", line 2581, in log_batch
i.finetune/0 [default0]:[rank0]: return self._tracking_client.log_batch(
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/mlflow/telemetry/track.py", line 30, in wrapper
i.finetune/0 [default0]:[rank0]: result = func(*args, **kwargs)
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/mlflow/tracking/_tracking_service/client.py", line 581, in log_batch
i.finetune/0 [default0]:[rank0]: self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/mlflow/store/tracking/rest_store.py", line 914, in log_batch
i.finetune/0 [default0]:[rank0]: metric_protos = [metric.to_proto() for metric in metrics]
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/mlflow/entities/metric.py", line 84, in to_proto
i.finetune/0 [default0]:[rank0]: metric.step = self.step
i.finetune/0 [default0]:[rank0]: ^^^^^^^^^^^
i.finetune/0 [default0]:[rank0]: File "/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/python_message.py", line 739, in field_setter
i.finetune/0 [default0]:[rank0]: raise TypeError(
i.finetune/0 [default0]:[rank0]: TypeError: Cannot set mlflow.Metric.step to 1.0: 1.0 has type <class 'float'>, but expected one of: (<class 'int'>,)
```