You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Checkpoint Engine provides efficient distributed checkpoint loading for SGLang inference servers, significantly reducing model loading time for large models and multi-node setups.
155
+
156
+
### Quick Start
157
+
158
+
**1. Install checkpoint-engine:**
159
+
```bash
160
+
pip install 'checkpoint-engine[p2p]'
161
+
```
162
+
163
+
**2. Launch SGLang server:**
164
+
```bash
165
+
python -m sglang.launch_server \
166
+
--model-path $MODEL_PATH \
167
+
--tp 8 \
168
+
--load-format dummy \
169
+
--wait-for-initial-weights
170
+
```
171
+
172
+
**3. Run checkpoint engine:**
173
+
```bash
174
+
python -m sglang.srt.checkpoint_engine.update \
175
+
--update-method broadcast \
176
+
--checkpoint-path $MODEL_PATH \
177
+
--inference-parallel-size 8
178
+
```
179
+
180
+
### Multi-Node Setup
181
+
182
+
For 2-node setup, run the same commands on both nodes with appropriate `--host` and distributed training parameters.
183
+
184
+
### Key Options
185
+
186
+
**SGLang Server:**
187
+
-`--wait-for-initial-weights`: Wait for checkpoint engine before becoming ready
-`--update-method`: Choose `broadcast`, `p2p`, or `all`
192
+
-`--inference-parallel-size`: Number of parallel processes
193
+
-`--checkpoint-path`: Model checkpoint directory
194
+
152
195
## Limitations and Future Work
153
196
154
-
- This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
197
+
- This project is currently tested with vLLM and SGLang. Integration with other frameworks is planned for future releases.
155
198
- The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
0 commit comments