You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/fp8.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,10 @@ This module provides a suite of tools to enable FP8 quantization for large langu
11
11
- Uses **TransformerEngine** for linear layer implementation.
12
12
- Supports both **Deepseek-style sub-channel scaling** and **per-tensor scaling**.
13
13
14
+
### Recommended recipe
15
+
- For Hopper GPUs we recommend to use FP8 (Deepseek-style) precision for both generation and training for best convergence and speedup
16
+
- For Blackwell GPUs, FP8 (deepseek-style) with FP32 scaling factor is not supported in training. Currently we recommend to use FP8 precision for generation and BF16 for training. We are actively exploring other recipes for better performance.
17
+
14
18
## Integration with NeMo RL
15
19
16
20
NeMo RL applies monkey patches to several core `vLLM` components to enable FP8 generation for reinforcement learning.
The `ft_launcher` is provided by `nvidia-resiliency-ext` (included in NeMo RL dependencies) and enables automatic fault tolerance and recovery for distributed training runs.
4
+
5
+
## Key Arguments
6
+
7
+
| Argument | Description | Example |
8
+
|----------|-------------|---------|
9
+
|`--ft-cfg-path`| Path to FT YAML config file |`examples/ft_launcher/ft_config.yaml`|
10
+
|`--ft-rank-heartbeat-timeout`| Heartbeat timeout in seconds |`450`|
11
+
|`--ft-initial-rank-heartbeat-timeout`| Initial timeout (longer for setup) |`1200`|
12
+
|`--max-restarts`| Maximum number of restart attempts |`5`|
2.**Timeouts**: Set `--ft-initial-rank-heartbeat-timeout` higher than `--ft-rank-heartbeat-timeout` to allow for model loading/setup time.
44
+
45
+
3.**Restart Policy**: The `any-failed` restart policy will restart the entire job if any rank fails. Look for these log messages to identify when a restart occurs:
0 commit comments