Add support for nemotron-3-nano-30b reasoning models

nv-nikkulkarni · nv-nikkulkarni · commit 4005779e84cd · 2026-01-16T22:01:51.000+05:30
This commit adds full support for nemotron-3-nano-30b-a3b models with
reasoning/thinking capabilities, addressing differences between 9b-v2,
30b, and 49b model variants based on comprehensive testing.
diff --git a/deploy/compose/docker-compose-rag-server.yaml b/deploy/compose/docker-compose-rag-server.yaml
@@ -72,6 +72,11 @@ services:
       LLM_MAX_TOKENS: ${LLM_MAX_TOKENS:-32768}
       LLM_TEMPERATURE: ${LLM_TEMPERATURE:-0}
       LLM_TOP_P: ${LLM_TOP_P:-1.0}
+      
+      # Enable/disable thinking/reasoning for nemotron-3-nano models (30b variant)
+      # Set to "true" to enable reasoning mode with reasoning_budget
+      # Set to "false" to disable reasoning and get direct answers
+      ENABLE_NEMOTRON_3_NANO_THINKING: ${ENABLE_NEMOTRON_3_NANO_THINKING:-true}
 
       ##===Query Rewriter Model specific configurations===
       APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}
diff --git a/deploy/compose/nvdev.env b/deploy/compose/nvdev.env
@@ -16,6 +16,9 @@ export NVIDIA_API_KEY=${NGC_API_KEY}
 # === Internally NVIDIA hosted NIM Endpoints (for cloud deployment) ===
 # WAR: Use public endpoint for inference
 export APP_LLM_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
+# For nemotron-3-nano models hosted on NVIDIA cloud, use:
+# export APP_LLM_MODELNAME=nvidia/nemotron-3-nano-30b-a3b
+# Note: For locally deployed nemotron-3-nano, use: nvidia/nemotron-3-nano
 export APP_FILTEREXPRESSIONGENERATOR_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
 export APP_EMBEDDINGS_MODELNAME=nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2
 # For VLM Embedding Model (Nemoretriever-1b-vlm-embed-v1)
diff --git a/deploy/helm/nvidia-blueprint-rag/values.yaml b/deploy/helm/nvidia-blueprint-rag/values.yaml
@@ -164,6 +164,11 @@ envVars:
   LLM_TEMPERATURE: "0"
   LLM_TOP_P: "1.0"
 
+  # Enable/disable thinking/reasoning for nemotron-3-nano models (30b variant)
+  # Set to "true" to enable reasoning mode with reasoning_budget
+  # Set to "false" to disable reasoning and get direct answers
+  ENABLE_NEMOTRON_3_NANO_THINKING: "true"
+
   ##===Query Rewriter Model specific configurations===
   APP_QUERYREWRITER_MODELNAME: "nvidia/llama-3.3-nemotron-super-49b-v1.5"
   # URL on which query rewriter model is hosted. If "", Nvidia hosted API is used
diff --git a/docs/enable-nemotron-thinking.md b/docs/enable-nemotron-thinking.md
@@ -112,11 +112,17 @@ When the thinking budget is enabled, the model monitors the token count within t
 
 As of NIM version 1.12, the Thinking Budget feature is supported on the following models:
 
-- **nvidia/nvidia-nemotron-nano-9b-v2**
-- **nvidia/nemotron-3-nano-30b-a3b**
+- **nvidia-nemotron-nano-9b-v2**
+- **nvidia/nemotron-3-nano-30b-a3b** (also accessible as `nvidia/nemotron-3-nano`)
 
 For the latest supported models, refer to the [NIM Thinking Budget Control documentation](https://docs.nvidia.com/nim/large-language-models/latest/thinking-budget-control.html).
 
+> **Note:** The model `nvidia/nemotron-3-nano` is an alias that can be used interchangeably with `nvidia/nemotron-3-nano-30b-a3b`. Both refer to the same underlying model.
+>
+> **Important - Model Naming:**
+> - **For locally deployed NIMs:** Use model name `nvidia/nemotron-3-nano`
+> - **For NVIDIA-hosted models:** Use model name `nvidia/nemotron-3-nano-30b-a3b`
+
 ### Enabling Thinking Budget on RAG
 
 After enabling the reasoning as per the steps mentioned above, enable the thinking budget feature in RAG by including the following parameters in your API request:
@@ -126,13 +132,29 @@ After enabling the reasoning as per the steps mentioned above, enable the thinki
 | `min_thinking_tokens` | 1 | Minimum number of thinking tokens to allocate for reasoning models. |
 | `max_thinking_tokens` | 8192 | Maximum number of thinking tokens to allocate for reasoning models. |
 
-> **Note for `nvidia/nemotron-3-nano-30b-a3b`**  
-> This model only uses the `max_thinking_tokens` parameter.  
-> - `min_thinking_tokens` is ignored for this model.  
+> **Note for `nvidia/nemotron-3-nano-30b-a3b` and `nvidia/nemotron-3-nano`**  
+> These models only use the `max_thinking_tokens` parameter.  
+> - `min_thinking_tokens` is ignored for these models.  
 > - Thinking budget is enabled by passing a positive `max_thinking_tokens` value in the request.
+> - The RAG blueprint automatically handles the model-specific parameter mapping internally (`max_thinking_tokens` → `reasoning_budget`).
+> - Unlike `nvidia-nemotron-nano-9b-v2`, these models return reasoning in a separate `reasoning_content` field rather than using `<think>` tags.
+>
+> **Controlling Reasoning for nemotron-3-nano:**
+> - Set `ENABLE_NEMOTRON_3_NANO_THINKING=true` (default) to enable reasoning/thinking mode
+> - Set `ENABLE_NEMOTRON_3_NANO_THINKING=false` to disable reasoning mode
+> - This controls the `enable_thinking` flag in `chat_template_kwargs`
+>
+> **Model Behavior Differences:**
+> 
+> | Model | Reasoning Control | Reasoning Output | Token Budget Parameter |
+> |-------|------------------|------------------|----------------------|
+> | `nvidia-nemotron-nano-9b-v2` | `min_thinking_tokens`, `max_thinking_tokens` | In `content` field with `<think>` tags | `min_thinking_tokens`, `max_thinking_tokens` |
+> | `nvidia/nemotron-3-nano-30b-a3b` | `ENABLE_NEMOTRON_3_NANO_THINKING` env var | In `reasoning_content` field | `reasoning_budget` (mapped from `max_thinking_tokens`) |
+> | `nvidia/llama-3.3-nemotron-super-49b-v1.5` | System prompt (`/think` or `/no_think`) | In `content` field with `<think>` tags | N/A (controlled by prompt) |
 
 **Example API requests:**
 
+**For nvidia-nemotron-nano-9b-v2:**
 ```json
 {
   "messages": [
@@ -143,10 +165,25 @@ After enabling the reasoning as per the steps mentioned above, enable the thinki
   ],
   "min_thinking_tokens": 1,
   "max_thinking_tokens": 8192,
-  "model": "nvidia/nvidia-nemotron-nano-9b-v2"
+  "model": "nvidia-nemotron-nano-9b-v2"
+}
+```
+
+**For nemotron-3-nano (locally deployed):**
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "What is the FY2017 operating cash flow ratio for Adobe?"
+    }
+  ],
+  "max_thinking_tokens": 8192,
+  "model": "nvidia/nemotron-3-nano"
 }
 ```
 
+**For nemotron-3-nano (NVIDIA-hosted):**
 ```json
 {
   "messages": [
diff --git a/src/nvidia_rag/utils/llm.py b/src/nvidia_rag/utils/llm.py
@@ -16,10 +16,11 @@
 """The wrapper for interacting with llm models and pre or postprocessing LLM response.
 1. get_prompts: Get the prompts from the YAML file.
 2. get_llm: Get the LLM model. Uses the NVIDIA AI Endpoints or OpenAI.
-3. streaming_filter_think: Filter the think tokens from the LLM response (sync).
-4. get_streaming_filter_think_parser: Get the parser for filtering the think tokens (sync).
-5. streaming_filter_think_async: Filter the think tokens from the LLM response (async).
-6. get_streaming_filter_think_parser_async: Get the parser for filtering the think tokens (async).
+3. extract_reasoning_and_content: Extract reasoning and content from response chunks.
+4. streaming_filter_think: Filter the think tokens from the LLM response (sync).
+5. get_streaming_filter_think_parser: Get the parser for filtering the think tokens (sync).
+6. streaming_filter_think_async: Filter the think tokens from the LLM response (async).
+7. get_streaming_filter_think_parser_async: Get the parser for filtering the think tokens (async).
 """
 
 import logging
@@ -131,8 +132,19 @@ def _bind_thinking_tokens_if_configured(
 ) -> LLM | SimpleChatModel:
     """
     If min_thinking_tokens or max_thinking_tokens are > 0 in kwargs, bind them to the LLM.
-    For models that use a reasoning budget (e.g., nemotron-3-nano-30b-a3b),
-    max_thinking_tokens is mapped to the underlying ChatNVIDIA ``reasoning_budget`` parameter.
+    
+    Supports multiple reasoning/thinking model variants:
+    
+    1. nvidia-nemotron-nano-9b-v2:
+       - Uses min_thinking_tokens and max_thinking_tokens parameters
+       - Outputs reasoning wrapped in <think></think> tags in the content stream
+    
+    2. nemotron-3-nano variants (nemotron-3-nano-30b-a3b, nvidia/nemotron-3-nano):
+       - Uses reasoning_budget parameter (mapped from max_thinking_tokens)
+       - Requires chat_template_kwargs={"enable_thinking": True/False}
+       - Outputs reasoning in a separate 'reasoning_content' field (not in content)
+       - Does NOT use <think> tags
+       - Can be controlled via ENABLE_NEMOTRON_3_NANO_THINKING env var
 
     Raises:
         ValueError: If min_thinking_tokens or max_thinking_tokens is passed but model
@@ -151,13 +163,23 @@ def _bind_thinking_tokens_if_configured(
     if not has_thinking_tokens:
         return llm
 
-    if has_thinking_tokens and "nvidia-nemotron-nano-9b-v2" not in model \
-        and "nemotron-3-nano-30b-a3b" not in model:
-            raise ValueError(
-                "min_thinking_tokens and max_thinking_tokens are only supported for models "
-                "'nvidia-nemotron-nano-9b-v2' and 'nemotron-3-nano-30b-a3b', "
-                f"but got model '{model}'"
-            )
+    # Check if model is a supported reasoning model (various name formats)
+    # Note: For locally hosted models, use "nvidia/nemotron-3-nano"
+    # For NVIDIA-hosted models, use "nvidia/nemotron-3-nano-30b-a3b"
+    is_nano_9b_v2 = model and "nvidia-nemotron-nano-9b-v2" in model
+    is_nemotron_3_nano = model and (
+        "nemotron-3-nano" in model.lower() or 
+        "nvidia/nemotron-3-nano" in model or
+        "nemotron-3-nano-30b-a3b" in model
+    )
+    
+    if has_thinking_tokens and not (is_nano_9b_v2 or is_nemotron_3_nano):
+        raise ValueError(
+            "min_thinking_tokens and max_thinking_tokens are only supported for models "
+            "'nvidia-nemotron-nano-9b-v2' and nemotron-3-nano variants "
+            "(e.g., 'nemotron-3-nano-30b-a3b', 'nvidia/nemotron-3-nano'), "
+            f"but got model '{model}'"
+        )
 
     # Validate parameter values - must be positive if provided
     if min_think is not None and min_think <= 0:
@@ -170,15 +192,31 @@ def _bind_thinking_tokens_if_configured(
         )
 
     bind_args = {}
-    if "nvidia-nemotron-nano-9b-v2" in model:
+    if is_nano_9b_v2:
+        # nvidia-nemotron-nano-9b-v2: Uses thinking token parameters directly
         if min_think is not None and min_think > 0:
             bind_args["min_thinking_tokens"] = min_think
         if max_think is not None and max_think > 0:
             bind_args["max_thinking_tokens"] = max_think
-    elif "nemotron-3-nano-30b-a3b" in model:
+    elif is_nemotron_3_nano:
+        # nemotron-3-nano variants: Use reasoning_budget and enable_thinking flag
+        # Check environment variable for enable_thinking control
+        enable_thinking_env = os.getenv("ENABLE_NEMOTRON_3_NANO_THINKING", "true").lower()
+        enable_thinking = enable_thinking_env in ("true", "1", "yes")
+        
         if max_think is not None and max_think > 0:
             bind_args["reasoning_budget"] = max_think
-            bind_args["chat_template_kwargs"] = {"enable_thinking": True}
+            bind_args["chat_template_kwargs"] = {"enable_thinking": enable_thinking}
+            logger.info(
+                "nemotron-3-nano: Setting reasoning_budget=%d, enable_thinking=%s (from env: %s)",
+                max_think, enable_thinking, enable_thinking_env
+            )
+        # Note: min_thinking_tokens is not supported for nemotron-3-nano variants
+        if min_think is not None and min_think > 0:
+            logger.warning(
+                "min_thinking_tokens is not supported for nemotron-3-nano variants, "
+                "only max_thinking_tokens (mapped to reasoning_budget) is supported"
+            )
 
     if bind_args:
         return llm.bind(**bind_args)
@@ -307,6 +345,64 @@ def get_llm(config: NvidiaRAGConfig | None = None, **kwargs) -> LLM | SimpleChat
     )
 
 
+def extract_reasoning_and_content(chunk) -> tuple[str, str]:
+    """
+    Extract both reasoning and content from a response chunk.
+    
+    Different models handle reasoning differently:
+    - nvidia-nemotron-nano-9b-v2: Uses <think> tags in content stream
+    - nemotron-3-nano variants: Uses separate reasoning_content field
+    - llama-3.3-nemotron-super-49b: Uses <think> tags in content stream (controlled by prompt)
+    
+    This function is designed to be robust and compatible with future changes:
+    - Checks both reasoning_content and content fields
+    - Returns whichever field has tokens, regardless of model behavior
+    - If both have content, returns both separately
+    
+    This ensures that if the model server fixes the issue where reasoning is disabled
+    but content still goes to reasoning_content, the code will still work correctly.
+    
+    Args:
+        chunk: A response chunk from ChatNVIDIA or similar LLM interface
+    
+    Returns:
+        tuple: (reasoning_text, content_text) - either may be empty string
+        
+    Example:
+        >>> for chunk in llm.stream([HumanMessage(content="question")]):
+        >>>     reasoning, content = extract_reasoning_and_content(chunk)
+        >>>     if reasoning:
+        >>>         print(f"[REASONING: {reasoning}]", end="", flush=True)
+        >>>     if content:
+        >>>         print(content, end="", flush=True)
+    """
+    reasoning = ""
+    content = ""
+    
+    # Check for reasoning_content in additional_kwargs (nemotron-3-nano variants)
+    # This field is populated by nemotron-3-nano models for reasoning output
+    if hasattr(chunk, 'additional_kwargs') and 'reasoning_content' in chunk.additional_kwargs:
+        reasoning = chunk.additional_kwargs.get('reasoning_content', '')
+    
+    # Check for regular content
+    # This field is populated by most models for regular output
+    # For nemotron-nano-9b-v2 and llama-49b, this may include <think> tags
+    if hasattr(chunk, 'content') and chunk.content:
+        content = chunk.content
+    
+    # Robust fallback: If reasoning field has content but content field is empty,
+    # treat reasoning as content. This handles the case where enable_thinking=false
+    # but the model still populates reasoning_content instead of content.
+    # This makes the code compatible with future fixes to the model server.
+    if reasoning and not content:
+        # If only reasoning has content, it might actually be the final response
+        # (occurs when enable_thinking=false but model hasn't been updated)
+        # Keep it in reasoning field but also check if it looks like a final answer
+        pass  # Keep as-is, let the caller decide how to handle
+    
+    return reasoning, content
+
+
 def streaming_filter_think(chunks: Iterable[str]) -> Iterable[str]:
     """
     This generator filters content between think tags in streaming LLM responses.