Skip to content

Conversation

@therealnaveenkamal
Copy link
Contributor

@therealnaveenkamal therealnaveenkamal commented Dec 14, 2025

What does this PR do ?

Adds OneLogger (https://github.com/NVIDIA/nv-one-logger) support for NeMo-RL

Issues

Addresses #1152

Usage

  • You can potentially add a usage example below
uv run examples/run_sft.py logger.one_logger_enabled=True logger.one_logger.project=my-nemo-project logger.one_logger.name=my-rl-run

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added One Logger as a new logging backend for experiment tracking, metrics, and hyperparameters.
    • Supports configuration alongside existing logging backends (Weights & Biases, TensorBoard, MLflow).
    • Includes new configuration options to enable and customize One Logger in training runs.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Naveenraj Kamalakannan <[email protected]>
@therealnaveenkamal therealnaveenkamal requested review from a team as code owners December 14, 2025 01:09
@therealnaveenkamal therealnaveenkamal changed the title added one-logger support for NeMo RL [feat]: added one-logger support for NeMo RL Dec 14, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 14, 2025

📝 Walkthrough

Walkthrough

This pull request integrates NV-One-Logger as a new logging backend into the NeMo RL framework. It adds the nv-one-logger submodule, creates wrapper packages with setup and configuration files, introduces the OneLogger class implementing the LoggerInterface, and updates project dependencies and example configurations to support the new logger backend.

Changes

Cohort / File(s) Change Summary
Submodule and Workspace Setup
.gitmodules, 3rdparty/NV-One-Logger-workspace/nv-one-logger
Adds new git submodule reference for NV-One-Logger at 3rdparty/NV-One-Logger-workspace/nv-one-logger pointing to [email protected]:NVIDIA/nv-one-logger.git
NV-One-Logger Root Workspace
3rdparty/NV-One-Logger-workspace/pyproject.toml, 3rdparty/NV-One-Logger-workspace/setup.py, 3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py
Adds project configuration and setup script for the core nv-one-logger workspace, including dependency resolution from nested pyproject.toml and package discovery; adds utility script to check nv_one_logger installation status
Wrapper Package Configurations
3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/*, 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/*, 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/*
Adds pyproject.toml and setup.py files for three wrapper packages (otel, training-telemetry, wandb), each with Poetry dependency parsing and setuptools package configuration
Logger Module Integration
nemo_rl/utils/logger.py
Introduces OneLogger class implementing LoggerInterface, OneLoggerConfigDict TypedDict, extends LoggerConfig with one_logger configuration options, integrates OneLogger backend initialization into Logger class with conditional enablement per rank
Example Configurations
examples/configs/grpo_math_1B.yaml, examples/configs/sft.yaml
Adds optional one_logger configuration blocks with project and name fields to logger section in example config files
Project Dependencies
pyproject.toml
Adds four nv-one-logger packages (core, otel, training-telemetry, wandb) as dependencies and optional dependencies, registers workspace entries for UV package manager, and includes packages in no-build-isolation list

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • nemo_rl/utils/logger.py: New OneLogger class and LoggerConfig TypedDict extensions introduce public API surface; apparent code duplication at end of file; GPU-event integration scaffolding uses new API patterns
  • Setup.py files (3rdparty/*/setup.py): Dependency parsing logic converts Poetry format to pip install_requires across four similar files; potential edge cases in version handling and path resolution
  • pyproject.toml workspace configuration: Multiple new dependencies and workspace entries; verify correctness of workspace member paths and dependency declarations
  • Config file additions: Straightforward additions but verify schema compatibility with logger configuration expectations

Suggested labels

CI, Run CICD

Suggested reviewers

  • terrykong
  • parthchadha
  • yfw

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR introduces OneLogger class with full logging functionality but no tests were added for it; existing test file contains no OneLogger references. Add comprehensive unit tests for OneLogger class to tests/unit/utils/test_logger.py covering initialization, logging methods, and Logger class integration.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'added one-logger support for NeMo RL' directly and clearly summarizes the main change: integrating OneLogger (nv-one-logger) into the NeMo RL framework. It matches the core objective described in the PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/utils/logger.py (1)

948-978: Require explicit cfg["one_logger"] configuration when one_logger_enabled is True, consistent with other backends.

The one_logger initialization uses cfg.get("one_logger", {}) which allows silent fallback to an empty config dict, whereas wandb, swanlab, tensorboard, and mlflow all require their config keys to be present (using direct cfg[key] access). This inconsistency can result in OneLogger being silently initialized with defaults instead of failing fast when configuration is missing.

Change from:

if cfg["one_logger_enabled"]:
    # OneLogger needs no log dir setup
    self.one_logger = OneLogger(cfg.get("one_logger", {}), log_dir=self.base_log_dir)
    self.loggers.append(self.one_logger)

To:

if cfg["one_logger_enabled"]:
    one_logger_cfg = cfg["one_logger"]  # Require key to be present
    # OneLogger needs no log dir setup
    self.one_logger = OneLogger(one_logger_cfg, log_dir=self.base_log_dir)
    self.loggers.append(self.one_logger)

This matches the pattern used by all other logging backends and ensures configuration is explicitly provided.

🧹 Nitpick comments (7)
pyproject.toml (1)

191-191: Minor clarity improvement for the comment.

The comment starting with "If we don't do this" could be slightly clearer about the consequence.

Consider this diff for improved clarity:

-  # If we don't do this, the research projects do not see the global uv.lock, and may mistakenly
+  # If we don't do this, research projects will not see the global uv.lock and may mistakenly
3rdparty/NV-One-Logger-workspace/pyproject.toml (1)

5-9: Consider aligning Python version requirement with project standards.

The requires-python = ">=3.8" may conflict with the project's Python 3.12+ guideline. If this wrapper is intended for use only within NeMo-RL (which targets 3.12+), consider updating to requires-python = ">=3.12". However, if maintaining compatibility with the upstream nv-one-logger package is intentional, this can remain as-is.

3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py (2)

26-30: Poetry caret conversion loses upper-bound semantics.

Poetry's ^ specifier (e.g., ^1.2.3) means >=1.2.3, <2.0.0 for major versions. Converting to just >=1.2.3 removes the upper bound, potentially allowing breaking changes. If strict version compatibility is needed, consider using the poetry-core or poetry-semver libraries for accurate conversion, or explicitly set upper bounds.


17-59: Consider extracting shared logic to reduce duplication.

This dependency extraction and setup logic is duplicated across all four setup.py files (core, otel, wandb, training-telemetry). Consider creating a shared utility module (e.g., _setup_utils.py) that handles Poetry-to-setuptools conversion, then have each setup.py import and call it with package-specific parameters.

Example shared utility approach:

# 3rdparty/NV-One-Logger-workspace/_setup_utils.py
def setup_nv_logger_package(src_dir, name, description):
    """Shared setup logic for nv-one-logger wrappers."""
    # ... common logic here ...

Then each setup.py becomes:

from _setup_utils import setup_nv_logger_package
setup_nv_logger_package(
    src_dir=Path(__file__).parent.parent.parent / "nv-one-logger" / "nv_one_logger" / "one_logger_otel",
    name="nv-one-logger-otel",
    description="Extensions to onelogger library to use Open telemetry (OTEL) as a backend.",
)
3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/setup.py (1)

6-59: Duplicated setup logic.

This file duplicates the same Poetry-to-setuptools conversion logic found in the other wrapper setup.py files. Please consider the shared utility refactor suggested in the otel wrapper review to improve maintainability.

3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/setup.py (1)

6-59: Duplicated setup logic.

Same duplication concern as other wrappers. Please consider the shared utility refactor suggested in the otel wrapper review.

3rdparty/NV-One-Logger-workspace/setup.py (1)

6-59: Duplicated setup logic - consider this file as the host for shared utility.

This root setup.py shares identical logic with all wrapper setup.py files. Consider placing a shared _setup_utils.py module here in 3rdparty/NV-One-Logger-workspace/ that all setup.py files can import from.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e5df0c and 046c049.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • .gitmodules (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/nv-one-logger (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/pyproject.toml (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/setup.py (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/pyproject.toml (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/pyproject.toml (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/setup.py (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/pyproject.toml (1 hunks)
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/setup.py (1 hunks)
  • examples/configs/grpo_math_1B.yaml (1 hunks)
  • examples/configs/sft.yaml (1 hunks)
  • nemo_rl/utils/logger.py (7 hunks)
  • pyproject.toml (4 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/pyproject.toml
  • .gitmodules
  • examples/configs/grpo_math_1B.yaml
  • 3rdparty/NV-One-Logger-workspace/setup.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/setup.py
  • pyproject.toml
  • 3rdparty/NV-One-Logger-workspace/nv-one-logger
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/pyproject.toml
  • 3rdparty/NV-One-Logger-workspace/pyproject.toml
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/setup.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/pyproject.toml
  • 3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py
  • examples/configs/sft.yaml
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py
  • nemo_rl/utils/logger.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • 3rdparty/NV-One-Logger-workspace/setup.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/setup.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/setup.py
  • 3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py
  • nemo_rl/utils/logger.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • 3rdparty/NV-One-Logger-workspace/setup.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/setup.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/setup.py
  • 3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py
  • 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py
  • nemo_rl/utils/logger.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/utils/logger.py
🪛 Ruff (0.14.8)
3rdparty/NV-One-Logger-workspace/setup.py

10-10: Avoid specifying long messages outside the exception class

(TRY003)

3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-wandb/setup.py

10-10: Avoid specifying long messages outside the exception class

(TRY003)

3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-training-telemetry/setup.py

10-10: Avoid specifying long messages outside the exception class

(TRY003)

3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py

15-15: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py

10-10: Avoid specifying long messages outside the exception class

(TRY003)

nemo_rl/utils/logger.py

467-467: Unused method argument: log_dir

(ARG002)


512-512: Unused method argument: step_metric

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
examples/configs/sft.yaml (1)

209-212: LGTM! Consistent logger configuration pattern.

The new one_logger configuration follows the established pattern used by other loggers (wandb, swanlab) with appropriate variable interpolation and safe default (one_logger_enabled: false).

examples/configs/grpo_math_1B.yaml (1)

286-289: LGTM! Consistent with sft.yaml configuration.

The one_logger configuration is consistent with the pattern established in sft.yaml and other logger configurations, ensuring uniformity across example configs.

3rdparty/NV-One-Logger-workspace/is_nvlogger_installed.py (1)

1-21: LGTM! Copyright header and structure are correct.

The file follows coding guidelines correctly:

  • NVIDIA copyright header with current year (2025)
  • Upper snake case for module-level constant INSTALLED
  • Clean try/except pattern for import detection

The Ruff RUF100 hint about the unused noqa directive is a false positive—the noqa: F401 is necessary since F401 (unused import) is enabled in the project's ruff configuration (line 316 of root pyproject.toml), and the import is intentionally unused as it's only checked for availability.

pyproject.toml (2)

158-161: LGTM! Proper workspace configuration.

The workspace sources, members, and no-build-isolation-package entries are correctly configured and follow the same pattern as other workspace packages (e.g., megatron-core, megatron-bridge).

Also applies to: 185-189, 218-221


52-55: The nv-one-logger packages should remain as required dependencies in dependencies, not optional.

The packages are unconditionally imported at the top level of nemo_rl/utils/logger.py (lines 57–63), which itself is unconditionally imported across all main example scripts and algorithm modules (sft.py, distillation.py, dpo.py, grpo.py, rm.py, etc.). If nv-one-logger were optional, importing any of these modules would fail if the packages were not installed. The one_logger_enabled configuration flag only controls whether the logger is used, not whether it can be imported.

Comment on lines +21 to +23
[submodule "3rdparty/NV-One-Logger-workspace/nv-one-logger"]
path = 3rdparty/NV-One-Logger-workspace/nv-one-logger
url = [email protected]:NVIDIA/nv-one-logger.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .gitmodules

Repository: NVIDIA-NeMo/RL

Length of output: 1102


🏁 Script executed:

curl -s -o /dev/null -w "%{http_code}" https://github.com/NVIDIA/nv-one-logger

Repository: NVIDIA-NeMo/RL

Length of output: 61


Add branch and shallow clone configuration for consistency.

The new submodule entry lacks branch and shallow properties that are present in all four existing submodules. Additionally, it uses an SSH URL instead of HTTPS like all other NVIDIA submodules. This creates inconsistency and may lead to unintended behavior:

  1. Without shallow = true, the submodule will perform full clones instead of shallow clones, increasing download size and time.
  2. Without a branch specification, the submodule will track HEAD rather than a stable, pinned branch.
  3. The SSH URL ([email protected]:) differs from the HTTPS URLs used for all other NVIDIA submodules in this file.

Apply this diff to align with existing submodule configurations:

 [submodule "3rdparty/NV-One-Logger-workspace/nv-one-logger"]
 	path = 3rdparty/NV-One-Logger-workspace/nv-one-logger
-	url = [email protected]:NVIDIA/nv-one-logger.git
+	url = https://github.com/NVIDIA/nv-one-logger.git
+	branch = main
+	shallow = true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[submodule "3rdparty/NV-One-Logger-workspace/nv-one-logger"]
path = 3rdparty/NV-One-Logger-workspace/nv-one-logger
url = [email protected]:NVIDIA/nv-one-logger.git
[submodule "3rdparty/NV-One-Logger-workspace/nv-one-logger"]
path = 3rdparty/NV-One-Logger-workspace/nv-one-logger
url = https://github.com/NVIDIA/nv-one-logger.git
branch = main
shallow = true
🤖 Prompt for AI Agents
In .gitmodules around lines 21-23, the NV-One-Logger submodule entry is missing
the branch and shallow settings and uses an SSH URL; update the entry to match
the other NVIDIA submodules by changing the URL to the HTTPS form, adding a
branch line matching the other NVIDIA submodules (e.g., branch = main or the
same branch used by the other entries), and add shallow = true so the submodule
is shallow-cloned.

@@ -0,0 +1 @@
Subproject commit c382dd5193ae987233c7a61de33fb16f91724b7d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Checking repository structure ==="
ls -la 3rdparty/ 2>/dev/null | head -20 || echo "3rdparty directory listing failed"

echo -e "\n=== Checking if submodule file exists ==="
if [ -f "3rdparty/NV-One-Logger-workspace/nv-one-logger" ]; then
  echo "File found. Content:"
  cat "3rdparty/NV-One-Logger-workspace/nv-one-logger"
else
  echo "File not found at specified path"
fi

echo -e "\n=== Checking .gitmodules configuration ==="
if [ -f .gitmodules ]; then
  echo ".gitmodules content:"
  cat .gitmodules
else
  echo ".gitmodules file not found"
fi

echo -e "\n=== Searching for OneLogger references in codebase ==="
rg -i "onelogger|one.?logger" --type py --type sh 2>/dev/null | head -30 || echo "No OneLogger references found"

echo -e "\n=== Searching for test files ==="
find . -type f \( -name "*test*.py" -o -name "*test*.sh" \) 2>/dev/null | grep -i "logger\|log" || echo "No logger-related test files found"

echo -e "\n=== Searching for documentation files mentioning OneLogger ==="
rg -i "onelogger|one.?logger" --type md --type txt 2>/dev/null | head -30 || echo "No documentation found mentioning OneLogger"

echo -e "\n=== Listing Python files (excluding tests) to check for imports ==="
fd -e py . --exclude tests --exclude test_\* 2>/dev/null | xargs grep -l "onelogger\|nv_one_logger" 2>/dev/null || echo "No Python files import OneLogger"

Repository: NVIDIA-NeMo/RL

Length of output: 4838


🏁 Script executed:

head -100 tests/unit/utils/test_logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 3519


🏁 Script executed:

rg -n "OneLogger" tests/unit/utils/test_logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 40


Add OneLogger integration tests and documentation.

OneLogger is integrated in nemo_rl/utils/logger.py but lacks test coverage and documentation. Before merging:

  1. Add unit tests for the OneLogger class in tests/unit/utils/test_logger.py, including configuration, initialization, and logging functionality.
  2. Add documentation describing OneLogger setup, configuration options, and usage examples (e.g., README or integration guide).
  3. Verify the submodule commit hash in .gitmodules is accessible and stable in the NVIDIA/nv-one-logger repository.
🤖 Prompt for AI Agents
In 3rdparty/NV-One-Logger-workspace/nv-one-logger around lines 1 to 1, tests and
documentation for OneLogger are missing; add unit tests at
tests/unit/utils/test_logger.py that instantiate OneLogger with various
valid/invalid configurations, assert initialization state, and verify that log
methods emit expected messages (use a temporary config and capture logs or mock
handlers), create documentation (README or integration guide) detailing setup
steps, available configuration options and usage examples showing initialization
and common logging calls, and finally confirm the .gitmodules submodule commit
hash points to an accessible, stable commit in the NVIDIA/nv-one-logger repo and
update it if necessary.

Comment on lines +1 to +4
import sys
import toml
from pathlib import Path
import setuptools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Missing NVIDIA copyright header and unused import.

Same issues as the wrapper setup.py files: add the NVIDIA copyright header and remove the unused sys import. As per coding guidelines for Python files.

Apply this diff:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
-import sys
 import toml
 from pathlib import Path
 import setuptools
🤖 Prompt for AI Agents
In 3rdparty/NV-One-Logger-workspace/setup.py around lines 1-4, the file is
missing the required NVIDIA copyright header and contains an unused import of
sys; remove the unused "import sys" line and prepend the standard NVIDIA
copyright header comment block at the top of the file (matching other wrapper
setup.py files in the repo) so the file begins with the header followed by the
remaining imports (toml, Path, setuptools).

[project]
name = "nv-one-logger-otel"
description = "Wrapper for nv-one-logger-otel"
requires-python = ">=3.8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent Python version requirement.

The requires-python = ">=3.8" setting is inconsistent with the root pyproject.toml which specifies requires-python = ">=3.12" (line 16) and the coding guidelines which require Python 3.12+ conformance. This could allow installation in unsupported Python environments.

Apply this diff to align with project standards:

-requires-python = ">=3.8"
+requires-python = ">=3.12"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
requires-python = ">=3.8"
requires-python = ">=3.12"
🤖 Prompt for AI Agents
In 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/pyproject.toml
around line 8, update the Python requirement from ">=3.8" to ">=3.12" to match
the root pyproject.toml and project guidelines; edit the requires-python field
to ">=3.12" so the package cannot be installed on older, unsupported Python
versions and ensure any related CI/metadata remains consistent.

Comment on lines +1 to +4
import sys
import toml
from pathlib import Path
import setuptools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Missing NVIDIA copyright header and unused import.

Per coding guidelines, Python files require the NVIDIA copyright header. Also, sys is imported but never used.

Apply this diff:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
-import sys
 import toml
 from pathlib import Path
 import setuptools
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import sys
import toml
from pathlib import Path
import setuptools
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import toml
from pathlib import Path
import setuptools
🤖 Prompt for AI Agents
In 3rdparty/NV-One-Logger-workspace/wrappers/nv-one-logger-otel/setup.py lines
1-4, the file is missing the required NVIDIA copyright header and imports an
unused module (`sys`); add the standard NVIDIA Python file header comment block
at the very top of the file and remove the unused `import sys` line so the file
only imports required modules (toml, Path, setuptools). Ensure the header is
placed before any imports and keep the remaining imports unchanged.

Comment on lines +44 to 69
# Add 3rdparty to path to support vendored nv-one-logger
import sys
_3rd_party_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../3rdparty/nv-one-logger/nv_one_logger"))
if os.path.isdir(_3rd_party_root):
# We need to add the src directories of the individual packages to sys.path
_core_src = os.path.join(_3rd_party_root, "one_logger_core/src")
_otel_src = os.path.join(_3rd_party_root, "one_logger_otel/src")

if os.path.isdir(_core_src) and _core_src not in sys.path:
sys.path.append(_core_src)
if os.path.isdir(_otel_src) and _otel_src not in sys.path:
sys.path.append(_otel_src)

from nv_one_logger.api.config import OneLoggerConfig as BaseOneLoggerConfig
from nv_one_logger.api.one_logger_provider import OneLoggerProvider
from nv_one_logger.core.attributes import Attributes
from nv_one_logger.core.event import Event
from nv_one_logger.core.span import StandardSpanName
from nv_one_logger.otel.exporter.otel_exporter import OTelExporter
from nv_one_logger.recorder.default_recorder import DefaultRecorder
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

from nemo_rl.data.interfaces import LLMMessageLogType
from nemo_rl.distributed.batched_data_dict import BatchedDataDict
HAVE_ONE_LOGGER = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and check its size
find . -name "logger.py" -path "*/nemo_rl/utils/*" -type f

Repository: NVIDIA-NeMo/RL

Length of output: 83


🏁 Script executed:

# Get the file size to determine how to read it
wc -l nemo_rl/utils/logger.py 2>/dev/null || echo "File not found, trying alternate path"

Repository: NVIDIA-NeMo/RL

Length of output: 86


🏁 Script executed:

# Read the relevant section and context around lines 44-69
sed -n '1,100p' nemo_rl/utils/logger.py | cat -n

Repository: NVIDIA-NeMo/RL

Length of output: 3945


🏁 Script executed:

# Search for HAVE_ONE_LOGGER usage throughout the file
rg "HAVE_ONE_LOGGER" nemo_rl/utils/logger.py -n

Repository: NVIDIA-NeMo/RL

Length of output: 233


🏁 Script executed:

# Check the OneLogger class initialization and how it's used
rg -A 20 "class OneLogger" nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 1404


🏁 Script executed:

# Look for any import error handling or try-except blocks around imports
sed -n '40,100p' nemo_rl/utils/logger.py | cat -n

Repository: NVIDIA-NeMo/RL

Length of output: 2512


🏁 Script executed:

# Check Logger.__init__ to see how OneLogger backend is instantiated
sed -n '900,1000p' nemo_rl/utils/logger.py | cat -n

Repository: NVIDIA-NeMo/RL

Length of output: 5019


🏁 Script executed:

# Look for where LoggerConfig one_logger field is accessed
rg -n "one_logger" nemo_rl/utils/logger.py | head -30

Repository: NVIDIA-NeMo/RL

Length of output: 1620


🏁 Script executed:

# Check if there's any env var or fallback for one_logger config
rg -B5 -A5 "OneLogger(" nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 128


🏁 Script executed:

# Check OneLogger.__init__ signature and usage of log_dir parameter
sed -n '464,520p' nemo_rl/utils/logger.py | cat -n

Repository: NVIDIA-NeMo/RL

Length of output: 2558


🏁 Script executed:

# Verify OneLoggerConfigDict is correctly defined and its usage
sed -n '95,125p' nemo_rl/utils/logger.py | cat -n

Repository: NVIDIA-NeMo/RL

Length of output: 1073


Wrap nv-one-logger imports in try-except to prevent import-time crash when package is unavailable.

Currently, the unconditional imports at lines 57–64 will raise ImportError before line 68 can execute, making HAVE_ONE_LOGGER ineffective and preventing nemo_rl/utils/logger.py from importing at all if nv-one-logger is neither vendored nor installed. The check in OneLogger.__init__ (line 468) cannot protect against this module-level failure.

 # Add 3rdparty to path to support vendored nv-one-logger
 import sys
 _3rd_party_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../3rdparty/nv-one-logger/nv_one_logger"))
 if os.path.isdir(_3rd_party_root):
     # We need to add the src directories of the individual packages to sys.path
     _core_src = os.path.join(_3rd_party_root, "one_logger_core/src")
     _otel_src = os.path.join(_3rd_party_root, "one_logger_otel/src")
     
-    if os.path.isdir(_core_src) and _core_src not in sys.path:
-        sys.path.append(_core_src)
-    if os.path.isdir(_otel_src) and _otel_src not in sys.path:
-        sys.path.append(_otel_src)
+    if os.path.isdir(_core_src) and _core_src not in sys.path:
+        sys.path.insert(0, _core_src)
+    if os.path.isdir(_otel_src) and _otel_src not in sys.path:
+        sys.path.insert(0, _otel_src)
 
-from nv_one_logger.api.config import OneLoggerConfig as BaseOneLoggerConfig
-from nv_one_logger.api.one_logger_provider import OneLoggerProvider
-from nv_one_logger.core.attributes import Attributes
-from nv_one_logger.core.event import Event
-from nv_one_logger.core.span import StandardSpanName
-from nv_one_logger.otel.exporter.otel_exporter import OTelExporter
-from nv_one_logger.recorder.default_recorder import DefaultRecorder
-from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
+try:
+    from nv_one_logger.api.config import OneLoggerConfig as BaseOneLoggerConfig
+    from nv_one_logger.api.one_logger_provider import OneLoggerProvider
+    from nv_one_logger.core.attributes import Attributes
+    from nv_one_logger.core.event import Event
+    from nv_one_logger.core.span import StandardSpanName
+    from nv_one_logger.otel.exporter.otel_exporter import OTelExporter
+    from nv_one_logger.recorder.default_recorder import DefaultRecorder
+    from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
+except ImportError:
+    HAVE_ONE_LOGGER = False
+else:
+    HAVE_ONE_LOGGER = True
 
 from nemo_rl.data.interfaces import LLMMessageLogType
 from nemo_rl.distributed.batched_data_dict import BatchedDataDict
-HAVE_ONE_LOGGER = True
🤖 Prompt for AI Agents
In nemo_rl/utils/logger.py around lines 44 to 69, the unconditional imports from
nv_one_logger will raise ImportError at module import time and make
HAVE_ONE_LOGGER ineffective; wrap the nv_one_logger imports in a try/except
ImportError block so that if the package is unavailable you catch the
ImportError, set HAVE_ONE_LOGGER = False (and optionally record a debug/warning
via the standard logging module), and only set HAVE_ONE_LOGGER = True inside the
try block after successful imports; keep the existing sys.path adjustments
outside the try so vendored paths are still attempted before importing.

Comment on lines +95 to +99
class OneLoggerConfigDict(TypedDict):
project: str
name: str
enable_for_current_rank: NotRequired[bool]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align TypedDict requirements with runtime behavior (avoid hidden defaults; enforce config presence when enabled).
OneLoggerConfigDict requires project/name, but Logger passes {} and OneLogger supplies fallback defaults in code, which conflicts with the “YAML is source of truth / no hidden defaults” guideline.

 class OneLoggerConfigDict(TypedDict):
     project: str
     name: str
     enable_for_current_rank: NotRequired[bool]
 
 class LoggerConfig(TypedDict):
@@
     one_logger_enabled: bool
@@
     one_logger: NotRequired[OneLoggerConfigDict]

And then enforce presence when enabled (see Logger.init comment below) and access cfg["project"] / cfg["name"] (see OneLogger.init comment below).

Also applies to: 106-118

🤖 Prompt for AI Agents
In nemo_rl/utils/logger.py around lines 95-99 (and similarly 106-118), the
TypedDict currently requires project/name but runtime passes {} and OneLogger
supplies fallbacks; update the TypedDict so project and name are NotRequired
(make them optional), then add a validation in Logger.__init__ that if logging
is enabled it raises a descriptive error when cfg lacks "project" or "name", and
finally update OneLogger.__init__ to assume presence and access cfg["project"]
and cfg["name"] (instead of using .get) after the Logger check ensures they
exist.

Comment on lines +464 to +557
class OneLogger(LoggerInterface):
"""OneLogger backend."""

def __init__(self, cfg: OneLoggerConfigDict, log_dir: Optional[str] = None):
if not HAVE_ONE_LOGGER:
print("Warning: nv_one_logger not installed. OneLogger will be disabled.")
return

self.provider = OneLoggerProvider.instance()
project = cfg.get("project", "nemo_rl")
name = cfg.get("name", "nemo_rl_run")

if not self.provider.one_logger_ready:
# Configure OneLogger
one_logger_config = BaseOneLoggerConfig(
application_name=project,
session_tag_or_fn=name,
world_size_or_fn=lambda: int(os.environ.get("WORLD_SIZE", 1)),
enable_for_current_rank=cfg.get("enable_for_current_rank", True),
)

# Setup OTLP exporter
otel_endpoint = os.environ.get(
"OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318/v1/traces"
)
os.environ["OTEL_SERVICE_NAME"] = project
otel_exporter = OTelExporter(
otel_span_exporter=OTLPSpanExporter(endpoint=otel_endpoint)
)

recorder = DefaultRecorder(exporters=[otel_exporter])
self.provider.configure(one_logger_config, recorder)
print(f"Initialized OneLogger for project {project}, run {name}")

self.recorder = self.provider.recorder

# Ensure we have an active application span
app_spans = self.recorder.get_active_spans_by_name(StandardSpanName.APPLICATION)
if app_spans:
self.app_span = app_spans[0]
else:
self.app_span = self.recorder.start(StandardSpanName.APPLICATION)

def log_metrics(
self,
metrics: dict[str, Any],
step: int,
prefix: Optional[str] = "",
step_metric: Optional[str] = None,
) -> None:
if not HAVE_ONE_LOGGER or not self.app_span:
return

attributes_dict = {"step": step}

for name, value in metrics.items():
if prefix:
name = f"{prefix}/{name}"
# OneLogger attributes support basic types.
if isinstance(value, (int, float, str, bool)):
attributes_dict[name] = value
elif isinstance(value, torch.Tensor) and value.numel() == 1:
attributes_dict[name] = value.item()

# Record as an event
self.recorder.event(
span=self.app_span,
event=Event.create(
name="metrics_update",
attributes=Attributes(attributes_dict),
),
)

def log_hyperparams(self, params: Mapping[str, Any]) -> None:
if not HAVE_ONE_LOGGER or not self.app_span:
return

flat_params = flatten_dict(params)
# Convert to supported types
valid_params = {}
for k, v in flat_params.items():
if isinstance(v, (int, float, str, bool)):
valid_params[k] = v
else:
valid_params[k] = str(v)

self.app_span.attributes.add_attributes(Attributes(valid_params))

def log_histogram(self, histogram: list[Any], step: int, name: str) -> None:
pass

def log_plot(self, figure: plt.Figure, step: int, name: str) -> None:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file structure and see if it exists
git ls-files nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 81


🏁 Script executed:

# Read the specific file to verify current state
wc -l nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 86


🏁 Script executed:

# Read the OneLogger class implementation
sed -n '464,557p' nemo_rl/utils/logger.py | cat -n

Repository: NVIDIA-NeMo/RL

Length of output: 4085


🏁 Script executed:

# Check for config definitions - likely a TypedDict
rg "OneLoggerConfigDict" -A 5 -B 2

Repository: NVIDIA-NeMo/RL

Length of output: 1301


🏁 Script executed:

# Check if there's a copyright header at the top of the file
head -20 nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 728


🏁 Script executed:

# Check for @ray.remote decorators in the file
rg "@ray.remote" nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 40


🏁 Script executed:

# Check for OpenTelemetry dependency version in pyproject.toml or requirements
fd -e toml -e txt | xargs grep -l "opentelemetry\|otel" 2>/dev/null | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 148


🏁 Script executed:

# Look for config schema or YAML files that define defaults
fd -e yaml | xargs grep -l "onelogger\|one_logger" -i 2>/dev/null

Repository: NVIDIA-NeMo/RL

Length of output: 118


🏁 Script executed:

# Check YAML examples for one_logger configuration
cat examples/configs/grpo_math_1B.yaml | grep -A 10 "one_logger"

Repository: NVIDIA-NeMo/RL

Length of output: 436


🏁 Script executed:

# Check sft.yaml
cat examples/configs/sft.yaml | grep -A 10 "one_logger"

Repository: NVIDIA-NeMo/RL

Length of output: 508


🏁 Script executed:

# Check pyproject.toml for OpenTelemetry dependency version
grep -A 2 -B 2 "opentelemetry\|otel" pyproject.toml

Repository: NVIDIA-NeMo/RL

Length of output: 837


🏁 Script executed:

# Check if log_dir is used anywhere in the class or as base class method
rg "log_dir" nemo_rl/utils/logger.py

Repository: NVIDIA-NeMo/RL

Length of output: 2942


🏁 Script executed:

# Verify step_metric usage in log_metrics
rg "step_metric" nemo_rl/utils/logger.py -A 3 -B 3

Repository: NVIDIA-NeMo/RL

Length of output: 6620


Initialize all class members in __init__ path; don't skip initialization on early return.

On early return (line 5-7), self.app_span and self.recorder are never initialized. Later calls (lines 51, 66, 75, 87) then fail with AttributeError. Always initialize these members before any early return.

Also: For required config keys project and name, use direct access (cfg["project"]) instead of cfg.get() with defaults—the TypedDict marks them required. The unused step_metric parameter should be explicitly ignored with _ = step_metric.

🧰 Tools
🪛 Ruff (0.14.8)

467-467: Unused method argument: log_dir

(ARG002)


512-512: Unused method argument: step_metric

(ARG002)

Comment on lines 951 to 957
Parameters:
cfg (LoggerConfig): Configuration mapping. Expected keys include:
- "log_dir": base directory for backend logs.
- "wandb_enabled", "swanlab_enabled", "tensorboard_enabled", "mlflow_enabled": booleans to enable backends.
- "wandb", "swanlab", "tensorboard", "mlflow": per-backend configuration dicts.
- "log_dir": base directory for backend logs.
- "wandb_enabled", "swanlab_enabled", "tensorboard_enabled", "mlflow_enabled", "one_logger_enabled": booleans to enable backends.
- "wandb", "swanlab", "tensorboard", "mlflow", "one_logger": per-backend configuration dicts.
- "monitor_gpus": boolean to enable Ray GPU monitoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring has duplicated bullets (“log_dir” repeated).
Clean up to avoid confusing rendered docs.

             cfg (LoggerConfig): Configuration mapping. Expected keys include:
                 - "log_dir": base directory for backend logs.
-                - "log_dir": base directory for backend logs.
                 - "wandb_enabled", "swanlab_enabled", "tensorboard_enabled", "mlflow_enabled", "one_logger_enabled": booleans to enable backends.
                 - "wandb", "swanlab", "tensorboard", "mlflow", "one_logger": per-backend configuration dicts.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Parameters:
cfg (LoggerConfig): Configuration mapping. Expected keys include:
- "log_dir": base directory for backend logs.
- "wandb_enabled", "swanlab_enabled", "tensorboard_enabled", "mlflow_enabled": booleans to enable backends.
- "wandb", "swanlab", "tensorboard", "mlflow": per-backend configuration dicts.
- "log_dir": base directory for backend logs.
- "wandb_enabled", "swanlab_enabled", "tensorboard_enabled", "mlflow_enabled", "one_logger_enabled": booleans to enable backends.
- "wandb", "swanlab", "tensorboard", "mlflow", "one_logger": per-backend configuration dicts.
- "monitor_gpus": boolean to enable Ray GPU monitoring.
Parameters:
cfg (LoggerConfig): Configuration mapping. Expected keys include:
- "log_dir": base directory for backend logs.
- "wandb_enabled", "swanlab_enabled", "tensorboard_enabled", "mlflow_enabled", "one_logger_enabled": booleans to enable backends.
- "wandb", "swanlab", "tensorboard", "mlflow", "one_logger": per-backend configuration dicts.
- "monitor_gpus": boolean to enable Ray GPU monitoring.
🤖 Prompt for AI Agents
In nemo_rl/utils/logger.py around lines 951 to 957, the docstring parameter list
contains a duplicated bullet for "log_dir"; remove the duplicate entry (or
replace it with the correct intended key if a different config key was meant) so
the docstring lists each expected configuration key only once and update any
wording to keep the list consistent and unambiguous.

@therealnaveenkamal
Copy link
Contributor Author

@terrykong I've integrated nv-one-logger:

  • added as a git submodule at 3rdparty/NV-One-Logger-workspace/nv-one-logger
  • created wrapper packages nv-one-logger-core, -otel, -training-telemetry, -wandb
  • tested with an OTEL collector and confirmed metrics are being sent successfully

let me know what you think, thanks!

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @therealnaveenkamal for the help adding one-logger. please see my review below and let me know if you have any questions

Copy link
Contributor

@terrykong terrykong Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for all the one logger dependencies, can we switch to VCS installs in pyproject.toml? see https://github.com/NVIDIA-NeMo/RL/blob/main/pyproject.toml#L154 for example

for context, the 3rdparty repo contains dependencies we expect to change a lot and require us to edit the code to have a monorepo experience. considering logging is probably not something we will change very often, i think it will be easier to maintain this as a regular python package.

Comment on lines +44 to +55
# Add 3rdparty to path to support vendored nv-one-logger
import sys
_3rd_party_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../3rdparty/nv-one-logger/nv_one_logger"))
if os.path.isdir(_3rd_party_root):
# We need to add the src directories of the individual packages to sys.path
_core_src = os.path.join(_3rd_party_root, "one_logger_core/src")
_otel_src = os.path.join(_3rd_party_root, "one_logger_otel/src")

if os.path.isdir(_core_src) and _core_src not in sys.path:
sys.path.append(_core_src)
if os.path.isdir(_otel_src) and _otel_src not in sys.path:
sys.path.append(_otel_src)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this is moved to pyproject.toml as a regular VCS install, i think we can just assume one logger is installed to simplify this

"""OneLogger backend."""

def __init__(self, cfg: OneLoggerConfigDict, log_dir: Optional[str] = None):
if not HAVE_ONE_LOGGER:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove

Comment on lines +473 to +474
project = cfg.get("project", "nemo_rl")
name = cfg.get("name", "nemo_rl_run")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these default should come from yaml

Suggested change
project = cfg.get("project", "nemo_rl")
name = cfg.get("name", "nemo_rl_run")
project = cfg["project"]
name = cfg["name"]

one_logger_config = BaseOneLoggerConfig(
application_name=project,
session_tag_or_fn=name,
world_size_or_fn=lambda: int(os.environ.get("WORLD_SIZE", 1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WORLD_SIZE doesn't have a value in this context b/c this is executed on the driver and there may be multiple "worlds" b/c multiple policy worker groups and the generation worker groups. Probably best thing to do here is num_nodes * gpus_per_node

url = https://github.com/NVIDIA-NeMo/Gym.git
branch = main
shallow = true
[submodule "3rdparty/NV-One-Logger-workspace/nv-one-logger"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be needed

Comment on lines +158 to +161
nv-one-logger-core = { workspace = true }
nv-one-logger-otel = { workspace = true }
nv-one-logger-training-telemetry = { workspace = true }
nv-one-logger-wandb = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these vcs installs can probably use subdir syntax https://docs.astral.sh/uv/concepts/projects/dependencies/#git


# Setup OTLP exporter
otel_endpoint = os.environ.get(
"OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318/v1/traces"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

return


class OneLogger(LoggerInterface):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add unit tests for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants