-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
- Performance baseline for NVIDIA DGX Spark GB10 128G vram #45
- Performance baseline for NVIDIA DGX Spark GB10 128G vram machine-learning#47
- https://github.com/ObrienlabsDev/blog/blob/main/nvidia.md
- Performance baseline for NVIDIA DGX Spark GB10 128G vram blog#144
Java 25 - see #10
NVIDIA GB10 Arm DGX Spark
- 20251108
- 202510 https://marketplace.nvidia.com/en-us/developer/dgx-spark
- https://www.canadacomputers.com/en/workstations/279129/nvidia-dgx-spark-ai-mini-pc-940-54242-0000-000-940-54242-0000-000.html?srsltid=AfmBOoqBukUC7jk2tNZSeKghJ-I4_1RjpcxHzvZKQrZ_y6Zv99z1FThQ
- https://www.youtube.com/watch?v=AvgZscNZzYw&t=105s
nvidia-smi
michael@spark-7d19:~$ nvidia-smi
Sun Nov 9 09:26:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 On | N/A |
| N/A 45C P0 11W / N/A | Not Supported | 10% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4174 G /usr/lib/xorg/Xorg 84MiB |
| 0 N/A N/A 4399 G /usr/bin/gnome-shell 125MiB |
| 0 N/A N/A 4761 G .../7182/usr/lib/firefox/firefox 273MiB |
+-----------------------------------------------------------------------------------------+
michael@spark-7d19:~$ free
total used free shared buff/cache available
Mem: 125513720 5088412 118621120 20596 2728844 120425308
Swap: 16777212 0 16777212
michael@spark-7d19:~$ lscpu
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: ARM
Model name: Cortex-X925
Model: 1
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
Stepping: r0p1
CPU(s) scaling MHz: 93%
CPU max MHz: 4004.0000
CPU min MHz: 1378.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull sv
ebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Model name: Cortex-A725
Model: 1
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
Stepping: r0p1
CPU(s) scaling MHz: 112%
CPU max MHz: 2860.0000
CPU min MHz: 338.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull sv
ebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Caches (sum of all):
L1d: 1.3 MiB (20 instances)
L1i: 1.3 MiB (20 instances)
L2: 25 MiB (20 instances)
L3: 24 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
Java 8 (update to 25)
michael@spark-7d19:~$ java -version
openjdk version "1.8.0_462"
OpenJDK Runtime Environment (build 1.8.0_462-8u462-ga~us1-0ubuntu2~24.04.2-b08)
OpenJDK 64-Bit Server VM (build 25.462-b08, mixed mode)
NVIDIA PyTorch Training demo
55G shared VRAM, 6144 cores - 95% GPU, 80 deg C, 102sec at .62 steps/sec
https://build.nvidia.com/spark/pytorch-fine-tune/instructions
https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/Llama3_8B_LoRA_finetuning.py
michael@spark-7d19:~/wse_github/ObrienlabsDev$ docker run --gpus all -it --rm --name pytorch --ipc=host -v $HOME/.cache/huggingface:/root/.cache/huggingface -v ${PWD}:/workspace -w /workspace nvcr.io/nvidia/pytorch:25.09-py3
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# huggingface-cli login
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# git clone https://github.com/NVIDIA/dgx-spark-playbooks
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_3B_full_finetuning.py
LLAMA 3.2 3B FULL FINE-TUNING CONFIGURATION
============================================================
Model: meta-llama/Llama-3.2-3B-Instruct
Training mode: Full SFT
Batch size: 8
Gradient accumulation: 1
Effective batch size: 8
Sequence length: 2048
Number of epochs: 1
Learning rate: 5e-05
Dataset size: 500
Gradient checkpointing: False
Torch compile: False
============================================================
Loading model: meta-llama/Llama-3.2-3B-Instruct
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.16s/it]
Total parameters: 3,212,749,824
Trainable parameters: 3,212,749,824 (100% - Full Fine-tuning)
Loading dataset with 500 samples...
Starting full fine-tuning for 1 epoch(s)...
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.
{'loss': 2.6122, 'grad_norm': 34.25, 'learning_rate': 5e-05, 'num_tokens': 607.0, 'mean_token_accuracy': 0.4223706126213074, 'epoch': 0.02}
...
{'loss': 1.0151, 'grad_norm': 6.96875, 'learning_rate': 7.936507936507937e-07, 'num_tokens': 52435.0, 'mean_token_accuracy': 0.7538802623748779, 'epoch': 1.0}
{'train_runtime': 101.8642, 'train_samples_per_second': 4.908, 'train_steps_per_second': 0.618, 'train_loss': 1.0988628211475553, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [01:41<00:00, 1.62s/it]
============================================================
TRAINING COMPLETED
============================================================
Training runtime: 101.86 seconds
Samples per second: 4.91
Steps per second: 0.62
Train loss: 1.0989
michael@spark-7d19:~/wse_github/ObrienlabsDev$ nvidia-smi
Mon Nov 10 14:51:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 On | N/A |
| N/A 77C P0 76W / N/A | Not Supported | 95% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3188 G /usr/lib/xorg/Xorg 98MiB |
| 0 N/A N/A 3361 G /usr/bin/gnome-shell 164MiB |
| 0 N/A N/A 4120 G .../7182/usr/lib/firefox/firefox 514MiB |
| 0 N/A N/A 7286 G /usr/bin/gnome-system-monitor 34MiB |
| 0 N/A N/A 23191 G /usr/bin/gnome-control-center 39MiB |
| 0 N/A N/A 31966 C python 56403MiB |
+-----------------------------------------------------------------------------------------+
michael@spark-7d19:~/wse_github/ObrienlabsDev$ nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
pid, process_name, used_gpu_memory [MiB]
31966, python, 56403 MiB
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_3B_full_finetuning.py
LLAMA 3.2 3B FULL FINE-TUNING CONFIGURATION
============================================================
Model: meta-llama/Llama-3.2-3B-Instruct
Training mode: Full SFT
Batch size: 8
Gradient accumulation: 1
Effective batch size: 8
Sequence length: 2048
Number of epochs: 1
Learning rate: 5e-05
Dataset size: 500
Gradient checkpointing: False
Torch compile: False
============================================================
Loading model: meta-llama/Llama-3.2-3B-Instruct
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.16s/it]
Total parameters: 3,212,749,824
Trainable parameters: 3,212,749,824 (100% - Full Fine-tuning)
Loading dataset with 500 samples...
Starting full fine-tuning for 1 epoch(s)...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.
{'loss': 2.6122, 'grad_norm': 34.25, 'learning_rate': 5e-05, 'num_tokens': 607.0, 'mean_token_accuracy': 0.4223706126213074, 'epoch': 0.02}
...
{'train_runtime': 101.8642, 'train_samples_per_second': 4.908, 'train_steps_per_second': 0.618, 'train_loss': 1.0988628211475553, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [01:41<00:00, 1.62s/it]
============================================================
TRAINING COMPLETED
============================================================
Training runtime: 101.86 seconds
Samples per second: 4.91
Steps per second: 0.62
Train loss: 1.0989
Install VSCode
curl -sSL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > microsoft.gpg
sudo install -o root -g root -m 644 microsoft.gpg /etc/apt/keyrings/microsoft.gpg
sudo sh -c 'echo "deb [arch=arm64 signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/vscode stable main" > /etc/apt/sources.list.d/vscode.list'
sudo apt update
sudo apt install code
code
GPT-OSS:20b Tokens/sec for various M series and NVidia Ampere, Ada and Grace Blackwell GPUs
20260120 - ObrienlabsDev/blog#160

Reactions are currently unavailable