Skip to content

Commit 4005779

Browse files
Add support for nemotron-3-nano-30b reasoning models
This commit adds full support for nemotron-3-nano-30b-a3b models with reasoning/thinking capabilities, addressing differences between 9b-v2, 30b, and 49b model variants based on comprehensive testing.
1 parent b084586 commit 4005779

File tree

5 files changed

+168
-22
lines changed

5 files changed

+168
-22
lines changed

deploy/compose/docker-compose-rag-server.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,11 @@ services:
7272
LLM_MAX_TOKENS: ${LLM_MAX_TOKENS:-32768}
7373
LLM_TEMPERATURE: ${LLM_TEMPERATURE:-0}
7474
LLM_TOP_P: ${LLM_TOP_P:-1.0}
75+
76+
# Enable/disable thinking/reasoning for nemotron-3-nano models (30b variant)
77+
# Set to "true" to enable reasoning mode with reasoning_budget
78+
# Set to "false" to disable reasoning and get direct answers
79+
ENABLE_NEMOTRON_3_NANO_THINKING: ${ENABLE_NEMOTRON_3_NANO_THINKING:-true}
7580

7681
##===Query Rewriter Model specific configurations===
7782
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"nvidia/llama-3.3-nemotron-super-49b-v1.5"}

deploy/compose/nvdev.env

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ export NVIDIA_API_KEY=${NGC_API_KEY}
1616
# === Internally NVIDIA hosted NIM Endpoints (for cloud deployment) ===
1717
# WAR: Use public endpoint for inference
1818
export APP_LLM_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
19+
# For nemotron-3-nano models hosted on NVIDIA cloud, use:
20+
# export APP_LLM_MODELNAME=nvidia/nemotron-3-nano-30b-a3b
21+
# Note: For locally deployed nemotron-3-nano, use: nvidia/nemotron-3-nano
1922
export APP_FILTEREXPRESSIONGENERATOR_MODELNAME=nvidia/llama-3.3-nemotron-super-49b-v1.5
2023
export APP_EMBEDDINGS_MODELNAME=nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2
2124
# For VLM Embedding Model (Nemoretriever-1b-vlm-embed-v1)

deploy/helm/nvidia-blueprint-rag/values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,11 @@ envVars:
164164
LLM_TEMPERATURE: "0"
165165
LLM_TOP_P: "1.0"
166166

167+
# Enable/disable thinking/reasoning for nemotron-3-nano models (30b variant)
168+
# Set to "true" to enable reasoning mode with reasoning_budget
169+
# Set to "false" to disable reasoning and get direct answers
170+
ENABLE_NEMOTRON_3_NANO_THINKING: "true"
171+
167172
##===Query Rewriter Model specific configurations===
168173
APP_QUERYREWRITER_MODELNAME: "nvidia/llama-3.3-nemotron-super-49b-v1.5"
169174
# URL on which query rewriter model is hosted. If "", Nvidia hosted API is used

docs/enable-nemotron-thinking.md

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,17 @@ When the thinking budget is enabled, the model monitors the token count within t
112112
113113
As of NIM version 1.12, the Thinking Budget feature is supported on the following models:
114114
115-
- **nvidia/nvidia-nemotron-nano-9b-v2**
116-
- **nvidia/nemotron-3-nano-30b-a3b**
115+
- **nvidia-nemotron-nano-9b-v2**
116+
- **nvidia/nemotron-3-nano-30b-a3b** (also accessible as `nvidia/nemotron-3-nano`)
117117

118118
For the latest supported models, refer to the [NIM Thinking Budget Control documentation](https://docs.nvidia.com/nim/large-language-models/latest/thinking-budget-control.html).
119119

120+
> **Note:** The model `nvidia/nemotron-3-nano` is an alias that can be used interchangeably with `nvidia/nemotron-3-nano-30b-a3b`. Both refer to the same underlying model.
121+
>
122+
> **Important - Model Naming:**
123+
> - **For locally deployed NIMs:** Use model name `nvidia/nemotron-3-nano`
124+
> - **For NVIDIA-hosted models:** Use model name `nvidia/nemotron-3-nano-30b-a3b`
125+
120126
### Enabling Thinking Budget on RAG
121127

122128
After enabling the reasoning as per the steps mentioned above, enable the thinking budget feature in RAG by including the following parameters in your API request:
@@ -126,13 +132,29 @@ After enabling the reasoning as per the steps mentioned above, enable the thinki
126132
| `min_thinking_tokens` | 1 | Minimum number of thinking tokens to allocate for reasoning models. |
127133
| `max_thinking_tokens` | 8192 | Maximum number of thinking tokens to allocate for reasoning models. |
128134

129-
> **Note for `nvidia/nemotron-3-nano-30b-a3b`**
130-
> This model only uses the `max_thinking_tokens` parameter.
131-
> - `min_thinking_tokens` is ignored for this model.
135+
> **Note for `nvidia/nemotron-3-nano-30b-a3b` and `nvidia/nemotron-3-nano`**
136+
> These models only use the `max_thinking_tokens` parameter.
137+
> - `min_thinking_tokens` is ignored for these models.
132138
> - Thinking budget is enabled by passing a positive `max_thinking_tokens` value in the request.
139+
> - The RAG blueprint automatically handles the model-specific parameter mapping internally (`max_thinking_tokens` → `reasoning_budget`).
140+
> - Unlike `nvidia-nemotron-nano-9b-v2`, these models return reasoning in a separate `reasoning_content` field rather than using `<think>` tags.
141+
>
142+
> **Controlling Reasoning for nemotron-3-nano:**
143+
> - Set `ENABLE_NEMOTRON_3_NANO_THINKING=true` (default) to enable reasoning/thinking mode
144+
> - Set `ENABLE_NEMOTRON_3_NANO_THINKING=false` to disable reasoning mode
145+
> - This controls the `enable_thinking` flag in `chat_template_kwargs`
146+
>
147+
> **Model Behavior Differences:**
148+
>
149+
> | Model | Reasoning Control | Reasoning Output | Token Budget Parameter |
150+
> |-------|------------------|------------------|----------------------|
151+
> | `nvidia-nemotron-nano-9b-v2` | `min_thinking_tokens`, `max_thinking_tokens` | In `content` field with `<think>` tags | `min_thinking_tokens`, `max_thinking_tokens` |
152+
> | `nvidia/nemotron-3-nano-30b-a3b` | `ENABLE_NEMOTRON_3_NANO_THINKING` env var | In `reasoning_content` field | `reasoning_budget` (mapped from `max_thinking_tokens`) |
153+
> | `nvidia/llama-3.3-nemotron-super-49b-v1.5` | System prompt (`/think` or `/no_think`) | In `content` field with `<think>` tags | N/A (controlled by prompt) |
133154

134155
**Example API requests:**
135156

157+
**For nvidia-nemotron-nano-9b-v2:**
136158
```json
137159
{
138160
"messages": [
@@ -143,10 +165,25 @@ After enabling the reasoning as per the steps mentioned above, enable the thinki
143165
],
144166
"min_thinking_tokens": 1,
145167
"max_thinking_tokens": 8192,
146-
"model": "nvidia/nvidia-nemotron-nano-9b-v2"
168+
"model": "nvidia-nemotron-nano-9b-v2"
169+
}
170+
```
171+
172+
**For nemotron-3-nano (locally deployed):**
173+
```json
174+
{
175+
"messages": [
176+
{
177+
"role": "user",
178+
"content": "What is the FY2017 operating cash flow ratio for Adobe?"
179+
}
180+
],
181+
"max_thinking_tokens": 8192,
182+
"model": "nvidia/nemotron-3-nano"
147183
}
148184
```
149185

186+
**For nemotron-3-nano (NVIDIA-hosted):**
150187
```json
151188
{
152189
"messages": [

src/nvidia_rag/utils/llm.py

Lines changed: 112 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,11 @@
1616
"""The wrapper for interacting with llm models and pre or postprocessing LLM response.
1717
1. get_prompts: Get the prompts from the YAML file.
1818
2. get_llm: Get the LLM model. Uses the NVIDIA AI Endpoints or OpenAI.
19-
3. streaming_filter_think: Filter the think tokens from the LLM response (sync).
20-
4. get_streaming_filter_think_parser: Get the parser for filtering the think tokens (sync).
21-
5. streaming_filter_think_async: Filter the think tokens from the LLM response (async).
22-
6. get_streaming_filter_think_parser_async: Get the parser for filtering the think tokens (async).
19+
3. extract_reasoning_and_content: Extract reasoning and content from response chunks.
20+
4. streaming_filter_think: Filter the think tokens from the LLM response (sync).
21+
5. get_streaming_filter_think_parser: Get the parser for filtering the think tokens (sync).
22+
6. streaming_filter_think_async: Filter the think tokens from the LLM response (async).
23+
7. get_streaming_filter_think_parser_async: Get the parser for filtering the think tokens (async).
2324
"""
2425

2526
import logging
@@ -131,8 +132,19 @@ def _bind_thinking_tokens_if_configured(
131132
) -> LLM | SimpleChatModel:
132133
"""
133134
If min_thinking_tokens or max_thinking_tokens are > 0 in kwargs, bind them to the LLM.
134-
For models that use a reasoning budget (e.g., nemotron-3-nano-30b-a3b),
135-
max_thinking_tokens is mapped to the underlying ChatNVIDIA ``reasoning_budget`` parameter.
135+
136+
Supports multiple reasoning/thinking model variants:
137+
138+
1. nvidia-nemotron-nano-9b-v2:
139+
- Uses min_thinking_tokens and max_thinking_tokens parameters
140+
- Outputs reasoning wrapped in <think></think> tags in the content stream
141+
142+
2. nemotron-3-nano variants (nemotron-3-nano-30b-a3b, nvidia/nemotron-3-nano):
143+
- Uses reasoning_budget parameter (mapped from max_thinking_tokens)
144+
- Requires chat_template_kwargs={"enable_thinking": True/False}
145+
- Outputs reasoning in a separate 'reasoning_content' field (not in content)
146+
- Does NOT use <think> tags
147+
- Can be controlled via ENABLE_NEMOTRON_3_NANO_THINKING env var
136148
137149
Raises:
138150
ValueError: If min_thinking_tokens or max_thinking_tokens is passed but model
@@ -151,13 +163,23 @@ def _bind_thinking_tokens_if_configured(
151163
if not has_thinking_tokens:
152164
return llm
153165

154-
if has_thinking_tokens and "nvidia-nemotron-nano-9b-v2" not in model \
155-
and "nemotron-3-nano-30b-a3b" not in model:
156-
raise ValueError(
157-
"min_thinking_tokens and max_thinking_tokens are only supported for models "
158-
"'nvidia-nemotron-nano-9b-v2' and 'nemotron-3-nano-30b-a3b', "
159-
f"but got model '{model}'"
160-
)
166+
# Check if model is a supported reasoning model (various name formats)
167+
# Note: For locally hosted models, use "nvidia/nemotron-3-nano"
168+
# For NVIDIA-hosted models, use "nvidia/nemotron-3-nano-30b-a3b"
169+
is_nano_9b_v2 = model and "nvidia-nemotron-nano-9b-v2" in model
170+
is_nemotron_3_nano = model and (
171+
"nemotron-3-nano" in model.lower() or
172+
"nvidia/nemotron-3-nano" in model or
173+
"nemotron-3-nano-30b-a3b" in model
174+
)
175+
176+
if has_thinking_tokens and not (is_nano_9b_v2 or is_nemotron_3_nano):
177+
raise ValueError(
178+
"min_thinking_tokens and max_thinking_tokens are only supported for models "
179+
"'nvidia-nemotron-nano-9b-v2' and nemotron-3-nano variants "
180+
"(e.g., 'nemotron-3-nano-30b-a3b', 'nvidia/nemotron-3-nano'), "
181+
f"but got model '{model}'"
182+
)
161183

162184
# Validate parameter values - must be positive if provided
163185
if min_think is not None and min_think <= 0:
@@ -170,15 +192,31 @@ def _bind_thinking_tokens_if_configured(
170192
)
171193

172194
bind_args = {}
173-
if "nvidia-nemotron-nano-9b-v2" in model:
195+
if is_nano_9b_v2:
196+
# nvidia-nemotron-nano-9b-v2: Uses thinking token parameters directly
174197
if min_think is not None and min_think > 0:
175198
bind_args["min_thinking_tokens"] = min_think
176199
if max_think is not None and max_think > 0:
177200
bind_args["max_thinking_tokens"] = max_think
178-
elif "nemotron-3-nano-30b-a3b" in model:
201+
elif is_nemotron_3_nano:
202+
# nemotron-3-nano variants: Use reasoning_budget and enable_thinking flag
203+
# Check environment variable for enable_thinking control
204+
enable_thinking_env = os.getenv("ENABLE_NEMOTRON_3_NANO_THINKING", "true").lower()
205+
enable_thinking = enable_thinking_env in ("true", "1", "yes")
206+
179207
if max_think is not None and max_think > 0:
180208
bind_args["reasoning_budget"] = max_think
181-
bind_args["chat_template_kwargs"] = {"enable_thinking": True}
209+
bind_args["chat_template_kwargs"] = {"enable_thinking": enable_thinking}
210+
logger.info(
211+
"nemotron-3-nano: Setting reasoning_budget=%d, enable_thinking=%s (from env: %s)",
212+
max_think, enable_thinking, enable_thinking_env
213+
)
214+
# Note: min_thinking_tokens is not supported for nemotron-3-nano variants
215+
if min_think is not None and min_think > 0:
216+
logger.warning(
217+
"min_thinking_tokens is not supported for nemotron-3-nano variants, "
218+
"only max_thinking_tokens (mapped to reasoning_budget) is supported"
219+
)
182220

183221
if bind_args:
184222
return llm.bind(**bind_args)
@@ -307,6 +345,64 @@ def get_llm(config: NvidiaRAGConfig | None = None, **kwargs) -> LLM | SimpleChat
307345
)
308346

309347

348+
def extract_reasoning_and_content(chunk) -> tuple[str, str]:
349+
"""
350+
Extract both reasoning and content from a response chunk.
351+
352+
Different models handle reasoning differently:
353+
- nvidia-nemotron-nano-9b-v2: Uses <think> tags in content stream
354+
- nemotron-3-nano variants: Uses separate reasoning_content field
355+
- llama-3.3-nemotron-super-49b: Uses <think> tags in content stream (controlled by prompt)
356+
357+
This function is designed to be robust and compatible with future changes:
358+
- Checks both reasoning_content and content fields
359+
- Returns whichever field has tokens, regardless of model behavior
360+
- If both have content, returns both separately
361+
362+
This ensures that if the model server fixes the issue where reasoning is disabled
363+
but content still goes to reasoning_content, the code will still work correctly.
364+
365+
Args:
366+
chunk: A response chunk from ChatNVIDIA or similar LLM interface
367+
368+
Returns:
369+
tuple: (reasoning_text, content_text) - either may be empty string
370+
371+
Example:
372+
>>> for chunk in llm.stream([HumanMessage(content="question")]):
373+
>>> reasoning, content = extract_reasoning_and_content(chunk)
374+
>>> if reasoning:
375+
>>> print(f"[REASONING: {reasoning}]", end="", flush=True)
376+
>>> if content:
377+
>>> print(content, end="", flush=True)
378+
"""
379+
reasoning = ""
380+
content = ""
381+
382+
# Check for reasoning_content in additional_kwargs (nemotron-3-nano variants)
383+
# This field is populated by nemotron-3-nano models for reasoning output
384+
if hasattr(chunk, 'additional_kwargs') and 'reasoning_content' in chunk.additional_kwargs:
385+
reasoning = chunk.additional_kwargs.get('reasoning_content', '')
386+
387+
# Check for regular content
388+
# This field is populated by most models for regular output
389+
# For nemotron-nano-9b-v2 and llama-49b, this may include <think> tags
390+
if hasattr(chunk, 'content') and chunk.content:
391+
content = chunk.content
392+
393+
# Robust fallback: If reasoning field has content but content field is empty,
394+
# treat reasoning as content. This handles the case where enable_thinking=false
395+
# but the model still populates reasoning_content instead of content.
396+
# This makes the code compatible with future fixes to the model server.
397+
if reasoning and not content:
398+
# If only reasoning has content, it might actually be the final response
399+
# (occurs when enable_thinking=false but model hasn't been updated)
400+
# Keep it in reasoning field but also check if it looks like a final answer
401+
pass # Keep as-is, let the caller decide how to handle
402+
403+
return reasoning, content
404+
405+
310406
def streaming_filter_think(chunks: Iterable[str]) -> Iterable[str]:
311407
"""
312408
This generator filters content between think tags in streaming LLM responses.

0 commit comments

Comments
 (0)