You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/sft.md
+6-7Lines changed: 6 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -191,7 +191,7 @@ policy:
191
191
use_triton: true # Use Triton-optimized kernels (DTensor v2 path)
192
192
```
193
193
194
-
### Parameter Details
194
+
### DTensor (Automodel) Parameter Details
195
195
- **`enabled`** (bool): Whether to enable LoRA training
196
196
- **`target_modules`** (list): Specific module names to apply LoRA. Empty with `match_all_linear=true` applies to all linear layers
197
197
- **`exclude_modules`** (list): Module names to exclude from LoRA
@@ -230,13 +230,13 @@ policy:
230
230
lora_dtype: None # Weight's dtype
231
231
```
232
232
233
-
### Parameter Details
233
+
### Megatron Parameter Details
234
234
- **`enabled`** (bool): Whether to enable LoRA training
235
235
- **`target_modules`** (list): Specific module names to apply LoRA. Defaults to all linear layers if the list is left empty. Example: ['linear_qkv', 'linear_proj', 'linear_fc1', 'linear_fc2'].
236
-
- 'linear_qkv': Apply LoRA to the fused linear layer used for query, key, and value projections in self-attention.
237
-
- 'linear_proj': Apply LoRA to the linear layer used for projecting the output of self-attention.
238
-
- 'linear_fc1': Apply LoRA to the first fully-connected layer in MLP.
239
-
- 'linear_fc2': Apply LoRA to the second fully-connected layer in MLP.
236
+
- 'linear_qkv': Apply LoRA to the fused linear layer used for query, key, and value projections in self-attention.
237
+
- 'linear_proj': Apply LoRA to the linear layer used for projecting the output of self-attention.
238
+
- 'linear_fc1': Apply LoRA to the first fully-connected layer in MLP.
239
+
- 'linear_fc2': Apply LoRA to the second fully-connected layer in MLP.
240
240
Target modules can also contain wildcards. For example, you can specify target_modules=['*.layers.0.*.linear_qkv', '*.layers.1.*.linear_qkv'] to add LoRA to only linear_qkv on the first two layers.
241
241
- **`exclude_modules`** (List[str], optional): A list of module names not to apply LoRa. It will match all nn.Linear & nn.Linear-adjacent modules whose name does not match any string in exclude_modules. If used, will require target_modules to be empty list or None.
242
242
- **`dim`** (int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64
@@ -247,7 +247,6 @@ policy:
247
247
- **`lora_B_init`** (str): Initialization method for the low-rank matrix B. Defaults to "zero".
248
248
- **`a2a_experimental`** (bool): Enables the experimental All-to-All (A2A) communication strategy. Defaults to False.
249
249
- **`lora_dtype`** (torch.dtype): Weight's dtype, by default will use orig_linear's but if they are quantized weights (e.g. 4bit) needs to be specified explicitly.
250
-
only.
251
250
252
251
### Megatron Example Usage
253
252
The config uses DTensor by default, so the megatron backend needs to be explicitly enabled.
0 commit comments