-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Description
I'm running the server like so:
docker run -d --name dia-tts-server -p 8003:8003 -v ./model_cache:/app/model_cache -v ./reference_audio:/app/reference_audio -v ./outputs:/app/outputs -v ./voices:/app/voices --gpus all ghcr.io/devnen/dia-tts-server:latest
I am using the following test request:
curl -X POST "http://localhost:8003/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a test of the Dia text to speech system.",
"voice": "dialogue",
"response_format": "wav",
"speed": 1.0,
"seed": 42
}' \
--output output.wav
On RTX 3090 I get the following speeds:
2025-05-21 05:33:46,389 [INFO] server: Received OpenAI request: voice='dialogue', speed=1.0, format='wav', seed=42
2025-05-21 05:33:46,389 [INFO] utils: Found 0 predefined voices in /app/voices
2025-05-21 05:33:46,437 [INFO] engine: Generating speech (simple method) with params: {'mode': 'dialogue', 'seed': 42, 'split': False, 'chunk_size': 120, 'max_tokens': 'ModelDefault', 'cfg': 3.0, 'temp': 1.3, 'top_p': 0.95, 'top_k': 35, 'speed': 1.0, 'clone_ref': 'N/A', 'transcript_provided': False, 'text_snippet': "'Hello, this is a test of the Dia text to speech system....'"}
2025-05-21 05:33:46,437 [INFO] engine: Using generation seed: 42
2025-05-21 05:33:46,437 [INFO] engine: Text splitting disabled. Processing text as a single chunk.
2025-05-21 05:33:46,437 [INFO] engine: Starting generation loop for 1 chunks using model.generate() per chunk.
2025-05-21 05:33:46,437 [INFO] engine: Processing chunk 1/1 with model.generate()...
Using seed: 42 for generation
generate: data loaded
generate: starting generation loop
generate step 86: speed=88.590 tokens/s, realtime factor=1.030x
generate step 172: speed=92.708 tokens/s, realtime factor=1.078x
generate step 258: speed=92.104 tokens/s, realtime factor=1.071x
2025-05-21 05:33:49,524 [INFO] engine: Chunk 1 generated successfully in 3.09s. Audio shape: (130560,)
2025-05-21 05:33:49,566 [INFO] engine: Concatenated audio shape (simple method): (130560,)
2025-05-21 05:33:49,566 [INFO] engine: Speed factor is 1.0, no speed adjustment needed.
2025-05-21 05:33:49,566 [INFO] engine: Applying final audio post-processing...
2025-05-21 05:33:49,572 [INFO] engine: → No significant changes from final audio post-processing
2025-05-21 05:33:49,572 [INFO] engine: Final audio ready (simple method). Shape: (130560,), dtype: float32
2025-05-21 05:33:49,615 [INFO] utils: Encoded 261164 bytes to wav in 0.000 seconds.
2025-05-21 05:33:49,615 [INFO] server: Successfully generated 261164 bytes in format wav
$ nvidia-smi
Tue May 20 22:38:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 30C P8 25W / 390W | 4254MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1480 G /usr/lib/xorg/Xorg 175MiB |
| 0 N/A N/A 1734 G /usr/bin/gnome-shell 10MiB |
| 0 N/A N/A 29368 C python3 4038MiB |
+-----------------------------------------------------------------------------------------+
On H100 I get the following slower speeds:
2025-05-21 05:33:59,450 [INFO] server: Received OpenAI request: voice='dialogue', speed=1.0, format='wav', seed=42
2025-05-21 05:33:59,452 [INFO] utils: Found 43 predefined voices in /app/voices
2025-05-21 05:33:59,557 [INFO] engine: Generating speech (simple method) with params: {'mode': 'dialogue', 'seed': 42, 'split': False, 'chunk_size': 120, 'max_tokens': 'ModelDefault', 'cfg': 3.0, 'temp': 1.3, 'top_p': 0.95, 'top_k': 35, 'speed': 1.0, 'clone_ref': 'N/A', 'transcript_provided': False, 'text_snippet': "'Hello, this is a test of the Dia text to speech system....'"}
2025-05-21 05:33:59,557 [INFO] engine: Using generation seed: 42
2025-05-21 05:33:59,558 [INFO] engine: Text splitting disabled. Processing text as a single chunk.
2025-05-21 05:33:59,558 [INFO] engine: Starting generation loop for 1 chunks using model.generate() per chunk.
2025-05-21 05:33:59,558 [INFO] engine: Processing chunk 1/1 with model.generate()...
Using seed: 42 for generation
generate: data loaded
generate: starting generation loop
generate step 86: speed=38.971 tokens/s, realtime factor=0.453x
generate step 172: speed=37.989 tokens/s, realtime factor=0.442x
generate step 258: speed=38.864 tokens/s, realtime factor=0.452x
2025-05-21 05:34:06,960 [INFO] engine: Chunk 1 generated successfully in 7.40s. Audio shape: (135168,)
2025-05-21 05:34:07,016 [INFO] engine: Concatenated audio shape (simple method): (135168,)
2025-05-21 05:34:07,016 [INFO] engine: Speed factor is 1.0, no speed adjustment needed.
2025-05-21 05:34:07,017 [INFO] engine: Applying final audio post-processing...
2025-05-21 05:34:07,033 [INFO] engine: → No significant changes from final audio post-processing
2025-05-21 05:34:07,034 [INFO] engine: Final audio ready (simple method). Shape: (135168,), dtype: float32
2025-05-21 05:34:07,090 [INFO] utils: Encoded 270380 bytes to wav in 0.001 seconds.
2025-05-21 05:34:07,090 [INFO] server: Successfully generated 270380 bytes in format wav
$ nvidia-smi
Wed May 21 05:37:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:06:00.0 Off | 0 |
| N/A 45C P0 86W / 350W | 4306MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 730991 C python3 4296MiB |
+-----------------------------------------------------------------------------------------+
Why are the speeds faster on the 3090 vs the H100?
Metadata
Metadata
Assignees
Labels
No labels