Skip to content

Performance baseline for NVIDIA DGX Spark GB10 128G vram #45

@obriensystems

Description

@obriensystems

Java 25 - see #10

Image

NVIDIA GB10 Arm DGX Spark

nvidia-smi

michael@spark-7d19:~$ nvidia-smi
Sun Nov  9 09:26:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0  On |                  N/A |
| N/A   45C    P0             11W /  N/A  | Not Supported          |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4174      G   /usr/lib/xorg/Xorg                       84MiB |
|    0   N/A  N/A            4399      G   /usr/bin/gnome-shell                    125MiB |
|    0   N/A  N/A            4761      G   .../7182/usr/lib/firefox/firefox        273MiB |
+-----------------------------------------------------------------------------------------+

michael@spark-7d19:~$ free
               total        used        free      shared  buff/cache   available
Mem:       125513720     5088412   118621120       20596     2728844   120425308
Swap:       16777212           0    16777212
michael@spark-7d19:~$ lscpu
Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   20
  On-line CPU(s) list:    0-19
Vendor ID:                ARM
  Model name:             Cortex-X925
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   10
    Socket(s):            1
    Stepping:             r0p1
    CPU(s) scaling MHz:   93%
    CPU max MHz:          4004.0000
    CPU min MHz:          1378.0000
    BogoMIPS:             2000.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull sv
                          ebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
  Model name:             Cortex-A725
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   10
    Socket(s):            1
    Stepping:             r0p1
    CPU(s) scaling MHz:   112%
    CPU max MHz:          2860.0000
    CPU min MHz:          338.0000
    BogoMIPS:             2000.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull sv
                          ebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Caches (sum of all):      
  L1d:                    1.3 MiB (20 instances)
  L1i:                    1.3 MiB (20 instances)
  L2:                     25 MiB (20 instances)
  L3:                     24 MiB (2 instances)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-19
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Java 8 (update to 25)

michael@spark-7d19:~$ java -version
openjdk version "1.8.0_462"
OpenJDK Runtime Environment (build 1.8.0_462-8u462-ga~us1-0ubuntu2~24.04.2-b08)
OpenJDK 64-Bit Server VM (build 25.462-b08, mixed mode)

NVIDIA PyTorch Training demo

55G shared VRAM, 6144 cores - 95% GPU, 80 deg C, 102sec at .62 steps/sec

https://build.nvidia.com/spark/pytorch-fine-tune/instructions
https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/Llama3_8B_LoRA_finetuning.py

michael@spark-7d19:~/wse_github/ObrienlabsDev$ docker run --gpus all -it --rm --name pytorch --ipc=host -v $HOME/.cache/huggingface:/root/.cache/huggingface -v ${PWD}:/workspace -w /workspace nvcr.io/nvidia/pytorch:25.09-py3

root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# huggingface-cli login
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# git clone https://github.com/NVIDIA/dgx-spark-playbooks 
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_3B_full_finetuning.py
LLAMA 3.2 3B FULL FINE-TUNING CONFIGURATION
============================================================
Model: meta-llama/Llama-3.2-3B-Instruct
Training mode: Full SFT 
Batch size: 8
Gradient accumulation: 1
Effective batch size: 8
Sequence length: 2048
Number of epochs: 1
Learning rate: 5e-05
Dataset size: 500
Gradient checkpointing: False
Torch compile: False
============================================================

Loading model: meta-llama/Llama-3.2-3B-Instruct
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.16s/it]
Total parameters: 3,212,749,824
Trainable parameters: 3,212,749,824 (100% - Full Fine-tuning)
Loading dataset with 500 samples...

Starting full fine-tuning for 1 epoch(s)...
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.
{'loss': 2.6122, 'grad_norm': 34.25, 'learning_rate': 5e-05, 'num_tokens': 607.0, 'mean_token_accuracy': 0.4223706126213074, 'epoch': 0.02}
...
{'loss': 1.0151, 'grad_norm': 6.96875, 'learning_rate': 7.936507936507937e-07, 'num_tokens': 52435.0, 'mean_token_accuracy': 0.7538802623748779, 'epoch': 1.0}
{'train_runtime': 101.8642, 'train_samples_per_second': 4.908, 'train_steps_per_second': 0.618, 'train_loss': 1.0988628211475553, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [01:41<00:00,  1.62s/it]

============================================================
TRAINING COMPLETED
============================================================
Training runtime: 101.86 seconds
Samples per second: 4.91
Steps per second: 0.62
Train loss: 1.0989

michael@spark-7d19:~/wse_github/ObrienlabsDev$ nvidia-smi
Mon Nov 10 14:51:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0  On |                  N/A |
| N/A   77C    P0             76W /  N/A  | Not Supported          |     95%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3188      G   /usr/lib/xorg/Xorg                       98MiB |
|    0   N/A  N/A            3361      G   /usr/bin/gnome-shell                    164MiB |
|    0   N/A  N/A            4120      G   .../7182/usr/lib/firefox/firefox        514MiB |
|    0   N/A  N/A            7286      G   /usr/bin/gnome-system-monitor            34MiB |
|    0   N/A  N/A           23191      G   /usr/bin/gnome-control-center            39MiB |
|    0   N/A  N/A           31966      C   python                                56403MiB |
+-----------------------------------------------------------------------------------------+
michael@spark-7d19:~/wse_github/ObrienlabsDev$ nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
pid, process_name, used_gpu_memory [MiB]
31966, python, 56403 MiB

root@f195eff1f920:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_3B_full_finetuning.py
LLAMA 3.2 3B FULL FINE-TUNING CONFIGURATION
============================================================
Model: meta-llama/Llama-3.2-3B-Instruct
Training mode: Full SFT 
Batch size: 8
Gradient accumulation: 1
Effective batch size: 8
Sequence length: 2048
Number of epochs: 1
Learning rate: 5e-05
Dataset size: 500
Gradient checkpointing: False
Torch compile: False
============================================================
Loading model: meta-llama/Llama-3.2-3B-Instruct
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.16s/it]
Total parameters: 3,212,749,824
Trainable parameters: 3,212,749,824 (100% - Full Fine-tuning)
Loading dataset with 500 samples...
Starting full fine-tuning for 1 epoch(s)...
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 128009, 'pad_token_id': 128009}.
{'loss': 2.6122, 'grad_norm': 34.25, 'learning_rate': 5e-05, 'num_tokens': 607.0, 'mean_token_accuracy': 0.4223706126213074, 'epoch': 0.02}
...
{'train_runtime': 101.8642, 'train_samples_per_second': 4.908, 'train_steps_per_second': 0.618, 'train_loss': 1.0988628211475553, 'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [01:41<00:00,  1.62s/it]

============================================================
TRAINING COMPLETED
============================================================
Training runtime: 101.86 seconds
Samples per second: 4.91
Steps per second: 0.62
Train loss: 1.0989

Install VSCode

  curl -sSL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > microsoft.gpg
   sudo install -o root -g root -m 644 microsoft.gpg /etc/apt/keyrings/microsoft.gpg
   sudo sh -c 'echo "deb [arch=arm64 signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/vscode stable main" > /etc/apt/sources.list.d/vscode.list'
   sudo apt update
   sudo apt install code
   code

GPT-OSS:20b Tokens/sec for various M series and NVidia Ampere, Ada and Grace Blackwell GPUs

20260120 - ObrienlabsDev/blog#160

Image

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions