Skip to content

Commit dc30520

Browse files
authored
Merge pull request #444 from Huanshere/refactor
Refactor
2 parents b4fed7c + 7aac71e commit dc30520

File tree

90 files changed

+1559
-1611
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+1559
-1611
lines changed

.cursorrules

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
2. 使用
2+
# ------------
3+
# comment
4+
# ------------
5+
进行大块的注释
6+
3. 避免使用复杂的函数内注释,以及函数变量中不要有类型定义
7+
4. 使用英文注释和print

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
4040

4141
# Install dependencies
4242
COPY requirements.txt .
43-
RUN pip install --no-cache-dir -r requirements.txt
43+
RUN pip install -e .
4444

4545
# Set CUDA-related environment variables
4646
ENV CUDA_HOME=/usr/local/cuda

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ https://github.com/user-attachments/assets/47d965b2-b4ab-4a0b-9d08-b49a7bf3508c
7777

7878
## Installation
7979

80-
You don't have to read the whole docs, [**here**](https://share.fastgpt.in/chat/share?shareId=066w11n3r9aq6879r4z0v9rh) is an online AI agent to help you.
80+
Meet any problem? Chat with our free online AI agent [**here**](https://share.fastgpt.in/chat/share?shareId=066w11n3r9aq6879r4z0v9rh) to help you.
8181

8282
> **Note:** For Windows users with NVIDIA GPU, follow these steps before installation:
8383
> 1. Install [CUDA Toolkit 12.6](https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.76_windows.exe)
@@ -121,8 +121,8 @@ docker run -d -p 8501:8501 --gpus all videolingo
121121

122122
## APIs
123123
VideoLingo supports OpenAI-Like API format and various TTS interfaces:
124-
- LLM: `claude-3-5-sonnet-20240620`, `deepseek-chat(v3)`, `gemini-2.0-flash-exp`, `gpt-4o`, ... (sorted by performance)
125-
- WhisperX: Run whisperX locally or use 302.ai API
124+
- LLM: `claude-3-5-sonnet`, `gpt-4.1`, `deepseek-v3`, `gemini-2.0-flash`, ... (sorted by performance, be cautious with gemini-2.5-flash...)
125+
- WhisperX: Run whisperX (large-v3) locally or use 302.ai API
126126
- TTS: `azure-tts`, `openai-tts`, `siliconflow-fishtts`, **`fish-tts`**, `GPT-SoVITS`, `edge-tts`, `*custom-tts`(You can modify your own TTS in custom_tts.py!)
127127

128128
> **Note:** VideoLingo works with **[302.ai](https://gpt302.saaslink.net/C2oHR9)** - one API key for all services (LLM, WhisperX, TTS). Or run locally with Ollama and Edge-TTS for free, no API needed!
@@ -133,13 +133,13 @@ For detailed installation, API configuration, and batch mode instructions, pleas
133133

134134
1. WhisperX transcription performance may be affected by video background noise, as it uses wav2vac model for alignment. For videos with loud background music, please enable Voice Separation Enhancement. Additionally, subtitles ending with numbers or special characters may be truncated early due to wav2vac's inability to map numeric characters (e.g., "1") to their spoken form ("one").
135135

136-
2. Using weaker models can lead to errors during intermediate processes due to strict JSON format requirements for responses. If this error occurs, please delete the `output` folder and retry with a different LLM, otherwise repeated execution will read the previous erroneous response causing the same error.
136+
2. Using weaker models can lead to errors during processes due to strict JSON format requirements for responses (tried my best to prompt llm😊). If this error occurs, please delete the `output` folder and retry with a different LLM, otherwise repeated execution will read the previous erroneous response causing the same error.
137137

138138
3. The dubbing feature may not be 100% perfect due to differences in speech rates and intonation between languages, as well as the impact of the translation step. However, this project has implemented extensive engineering processing for speech rates to ensure the best possible dubbing results.
139139

140140
4. **Multilingual video transcription recognition will only retain the main language**. This is because whisperX uses a specialized model for a single language when forcibly aligning word-level subtitles, and will delete unrecognized languages.
141141

142-
5. **Cannot dub multiple characters separately**, as whisperX's speaker distinction capability is not sufficiently reliable.
142+
5. **For now, cannot dub multiple characters separately**, as whisperX's speaker distinction capability is not sufficiently reliable.
143143

144144
## 📄 License
145145

batch/utils/batch_processor.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
1-
import os, sys
1+
import os
22
import gc
3-
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
43
from batch.utils.settings_check import check_settings
54
from batch.utils.video_processor import process_video
6-
from core.config_utils import load_key, update_key
5+
from core.utils.config_utils import load_key, update_key
76
import pandas as pd
87
from rich.console import Console
98
from rich.panel import Panel

batch/utils/settings_check.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
import os, sys
2-
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
1+
import os
32
import pandas as pd
43
from rich.console import Console
54
from rich.panel import Panel

batch/utils/video_processor.py

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
import os, sys
2-
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
3-
from st_components.imports_and_utils import *
4-
from core.onekeycleanup import cleanup
5-
from core.config_utils import load_key
1+
import os
2+
from core.st_utils.imports_and_utils import *
3+
from core.utils.onekeycleanup import cleanup
4+
from core.utils import load_key
65
import shutil
76
from functools import partial
87
from rich.panel import Panel
98
from rich.console import Console
9+
from core import *
1010

1111
console = Console()
1212

@@ -22,20 +22,20 @@ def process_video(file, dubbing=False, is_retry=False):
2222

2323
text_steps = [
2424
("🎥 Processing input file", partial(process_input_file, file)),
25-
("🎙️ Transcribing with Whisper", partial(step2_whisperX.transcribe)),
25+
("🎙️ Transcribing with Whisper", partial(_2_asr.transcribe)),
2626
("✂️ Splitting sentences", split_sentences),
2727
("📝 Summarizing and translating", summarize_and_translate),
2828
("⚡ Processing and aligning subtitles", process_and_align_subtitles),
29-
("🎬 Merging subtitles to video", step7_merge_sub_to_vid.merge_subtitles_to_video),
29+
("🎬 Merging subtitles to video", _7_sub_into_vid.merge_subtitles_to_video),
3030
]
3131

3232
if dubbing:
3333
dubbing_steps = [
3434
("🔊 Generating audio tasks", gen_audio_tasks),
35-
("🎵 Extracting reference audio", step9_extract_refer_audio.extract_refer_audio_main),
36-
("🗣️ Generating audio", step10_gen_audio.gen_audio),
37-
("🔄 Merging full audio", step11_merge_full_audio.merge_full_audio),
38-
("🎞️ Merging dubbing to video", step12_merge_dub_to_vid.merge_video_audio),
35+
("🎵 Extracting reference audio", _9_refer_audio.extract_refer_audio_main),
36+
("🗣️ Generating audio", _10_gen_audio.gen_audio),
37+
("🔄 Merging full audio", _11_merge_audio.merge_full_audio),
38+
("🎞️ Merging dubbing to video", _12_dub_to_vid.merge_video_audio),
3939
]
4040
text_steps.extend(dubbing_steps)
4141

@@ -78,8 +78,8 @@ def prepare_output_folder(output_folder):
7878

7979
def process_input_file(file):
8080
if file.startswith('http'):
81-
step1_ytdlp.download_video_ytdlp(file, resolution=load_key(YTB_RESOLUTION_KEY), cutoff_time=None)
82-
video_file = step1_ytdlp.find_video_files()
81+
_1_ytdlp.download_video_ytdlp(file, resolution=load_key(YTB_RESOLUTION_KEY))
82+
video_file = _1_ytdlp.find_video_files()
8383
else:
8484
input_file = os.path.join('batch', 'input', file)
8585
output_file = os.path.join(OUTPUT_DIR, file)
@@ -88,17 +88,17 @@ def process_input_file(file):
8888
return {'video_file': video_file}
8989

9090
def split_sentences():
91-
step3_1_spacy_split.split_by_spacy()
92-
step3_2_splitbymeaning.split_sentences_by_meaning()
91+
_3_1_split_nlp.split_by_spacy()
92+
_3_2_split_meaning.split_sentences_by_meaning()
9393

9494
def summarize_and_translate():
95-
step4_1_summarize.get_summary()
96-
step4_2_translate_all.translate_all()
95+
_4_1_summarize.get_summary()
96+
_4_2_translate.translate_all()
9797

9898
def process_and_align_subtitles():
99-
step5_splitforsub.split_for_sub_main()
100-
step6_generate_final_timeline.align_timestamp_main()
99+
_5_split_sub.split_for_sub_main()
100+
_6_gen_sub.align_timestamp_main()
101101

102102
def gen_audio_tasks():
103-
step8_1_gen_audio_task.gen_audio_task_main()
104-
step8_2_gen_dub_chunks.gen_dub_chunks()
103+
_8_1_audio_task.gen_audio_task_main()
104+
_8_2_dub_chunks.gen_dub_chunks()

config.yaml

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
11
# * Settings marked with * are advanced settings that won't appear in the Streamlit page and can only be modified manually in config.py
22
# recommend to set in streamlit page
3-
version: "2.2.3"
3+
# -------------------
4+
# version: "3.0.0"
5+
# author: "Huanshere"
6+
# -------------------
7+
48
## ======================== Basic Settings ======================== ##
9+
510
display_language: "zh-CN"
611

712
# API settings
813
api:
9-
key: 'your_api_key'
10-
base_url: 'https://api.302.ai'
11-
model: 'gemini-2.0-flash'
14+
key: 'your-api-key'
15+
base_url: 'https://yunwu.ai'
16+
model: 'gpt-4.1-2025-04-14'
17+
llm_support_json: false
18+
# *Number of LLM multi-threaded accesses, set to 1 if using local LLM
19+
max_workers: 4
1220

1321
# Language settings, written into the prompt, can be described in natural language
1422
target_language: '简体中文'
@@ -17,22 +25,25 @@ target_language: '简体中文'
1725
demucs: true
1826

1927
whisper:
20-
# ["medium", "large-v3", "large-v3-turbo"]. Note: for zh model will force to use Belle/large-v3
28+
# ["large-v3", "large-v3-turbo"]. Note: for zh model will force to use Belle/large-v3
2129
model: 'large-v3'
22-
# Whisper specified recognition language [en, zh, ...]
30+
# Whisper specified recognition language ISO 639-1
2331
language: 'en'
2432
detected_language: 'en'
2533
# Whisper running mode ["local", "cloud", "elevenlabs"]. Specifies where to run, cloud uses 302.ai API
2634
runtime: 'local'
2735
# 302.ai API key
2836
whisperX_302_api_key: 'your_302_api_key'
29-
# ElevenLabs API key
37+
# ElevenLabs API key (experimental)
3038
elevenlabs_api_key: 'your_elevenlabs_api_key'
3139

3240
# Whether to burn subtitles into the video
3341
burn_subtitles: true
3442

3543
## ======================== Advanced Settings ======================== ##
44+
# *🔬 h264_nvenc GPU acceleration for ffmpeg, make sure your GPU supports it
45+
ffmpeg_gpu: false
46+
3647
# *Youtube settings
3748
youtube:
3849
cookies_path: ''
@@ -49,8 +60,6 @@ subtitle:
4960
# *Summary length, set low to 2k if using local LLM
5061
summary_length: 8000
5162

52-
# *Number of LLM multi-threaded accesses, set to 1 if using local LLM
53-
max_workers: 4
5463
# *Maximum number of words for the first rough cut, below 18 will cut too finely affecting translation, above 22 is too long and will make subsequent subtitle splitting difficult to align
5564
max_split_length: 20
5665

@@ -62,7 +71,7 @@ pause_before_translate: false
6271

6372
## ======================== Dubbing Settings ======================== ##
6473
# TTS selection [sf_fish_tts, openai_tts, gpt_sovits, azure_tts, fish_tts, edge_tts, custom_tts]
65-
tts_method: 'f5tts'
74+
tts_method: 'azure_tts'
6675

6776
# SiliconFlow FishTTS
6877
sf_fish_tts:
@@ -125,7 +134,8 @@ tolerance: 1.5 # Allowed extension time to the next subtitle
125134

126135

127136

128-
## ======================== Additional settings 请勿修改======================== ##
137+
## ======================== Additional settings ======================== ##
138+
129139
# Whisper model directory
130140
model_dir: './_model_cache'
131141

@@ -145,13 +155,6 @@ allowed_audio_formats:
145155
- 'flac'
146156
- 'm4a'
147157

148-
# LLMs that support returning JSON format
149-
llm_support_json:
150-
- 'gpt-4o'
151-
- 'gpt-4o-mini'
152-
- 'gemini-2.0-flash'
153-
- 'deepseek-chat'
154-
155158
# Spacy models
156159
spacy_model_map:
157160
en: 'en_core_web_md'
Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,24 @@
11
import os
2-
import sys
32
import time
43
import shutil
54
import subprocess
65
from typing import Tuple
76

87
import pandas as pd
98
from pydub import AudioSegment
10-
from rich import print as rprint
119
from rich.console import Console
1210
from rich.progress import Progress
1311
from concurrent.futures import ThreadPoolExecutor, as_completed
1412

15-
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
16-
from core.config_utils import load_key
17-
from core.all_whisper_methods.audio_preprocess import get_audio_duration
18-
from core.all_tts_functions.tts_main import tts_main
13+
from core.utils import *
14+
from core.utils.models import *
15+
from core.asr_backend.audio_preprocess import get_audio_duration
16+
from core.tts_backend.tts_main import tts_main
1917

2018
console = Console()
2119

22-
TEMP_DIR = 'output/audio/tmp'
23-
SEGS_DIR = 'output/audio/segs'
24-
TASKS_FILE = "output/audio/tts_tasks.xlsx"
25-
OUTPUT_FILE = "output/audio/tts_tasks.xlsx"
26-
TEMP_FILE_TEMPLATE = f"{TEMP_DIR}/{{}}_temp.wav"
27-
OUTPUT_FILE_TEMPLATE = f"{SEGS_DIR}/{{}}.wav"
20+
TEMP_FILE_TEMPLATE = f"{_AUDIO_TMP_DIR}/{{}}_temp.wav"
21+
OUTPUT_FILE_TEMPLATE = f"{_AUDIO_SEGS_DIR}/{{}}.wav"
2822
WARMUP_SIZE = 5
2923

3024
def parse_df_srt_time(time_str: str) -> float:
@@ -217,11 +211,11 @@ def gen_audio() -> None:
217211
rprint("[bold magenta]🚀 Starting audio generation process...[/bold magenta]")
218212

219213
# 🎯 Step1: Create necessary directories
220-
os.makedirs(TEMP_DIR, exist_ok=True)
221-
os.makedirs(SEGS_DIR, exist_ok=True)
214+
os.makedirs(_AUDIO_TMP_DIR, exist_ok=True)
215+
os.makedirs(_AUDIO_SEGS_DIR, exist_ok=True)
222216

223217
# 📝 Step2: Load task file
224-
tasks_df = pd.read_excel(TASKS_FILE)
218+
tasks_df = pd.read_excel(_8_1_AUDIO_TASK)
225219
rprint("[green]📊 Loaded task file successfully[/green]")
226220

227221
# 🔊 Step3: Generate TTS audio
@@ -231,7 +225,7 @@ def gen_audio() -> None:
231225
tasks_df = merge_chunks(tasks_df)
232226

233227
# 💾 Step5: Save results
234-
tasks_df.to_excel(OUTPUT_FILE, index=False)
228+
tasks_df.to_excel(_8_1_AUDIO_TASK, index=False)
235229
rprint("[bold green]🎉 Audio generation completed successfully![/bold green]")
236230

237231
if __name__ == "__main__":
Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,17 @@
1-
import sys, os
2-
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
1+
import os
32
import pandas as pd
43
import subprocess
54
from pydub import AudioSegment
6-
from rich import print as rprint
75
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TaskProgressColumn
86
from rich.console import Console
7+
from core.utils import *
8+
from core.utils.models import *
99
console = Console()
1010

11-
INPUT_EXCEL = 'output/audio/tts_tasks.xlsx'
1211
DUB_VOCAL_FILE = 'output/dub.mp3'
1312

1413
DUB_SUB_FILE = 'output/dub.srt'
15-
SEGS_DIR = 'output/audio/segs'
16-
OUTPUT_FILE_TEMPLATE = f"{SEGS_DIR}/{{}}.wav"
14+
OUTPUT_FILE_TEMPLATE = f"{_AUDIO_SEGS_DIR}/{{}}.wav"
1715

1816
def load_and_flatten_data(excel_file):
1917
"""Load and flatten Excel data"""
@@ -45,7 +43,7 @@ def process_audio_segment(audio_file):
4543
'-i', audio_file,
4644
'-ar', '16000',
4745
'-ac', '1',
48-
'-b:a', '128k',
46+
'-b:a', '64k',
4947
temp_file
5048
]
5149
subprocess.run(ffmpeg_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
@@ -56,12 +54,7 @@ def process_audio_segment(audio_file):
5654
def merge_audio_segments(audios, new_sub_times, sample_rate):
5755
merged_audio = AudioSegment.silent(duration=0, frame_rate=sample_rate)
5856

59-
with Progress(
60-
SpinnerColumn(),
61-
TextColumn("[progress.description]{task.description}"),
62-
BarColumn(),
63-
TaskProgressColumn(),
64-
) as progress:
57+
with Progress(SpinnerColumn(), TextColumn("[progress.description]{task.description}"), BarColumn(), TaskProgressColumn()) as progress:
6558
merge_task = progress.add_task("🎵 Merging audio segments...", total=len(audios))
6659

6760
for i, (audio_file, time_range) in enumerate(zip(audios, new_sub_times)):
@@ -90,7 +83,7 @@ def merge_audio_segments(audios, new_sub_times, sample_rate):
9083
return merged_audio
9184

9285
def create_srt_subtitle():
93-
df, lines, new_sub_times = load_and_flatten_data(INPUT_EXCEL)
86+
df, lines, new_sub_times = load_and_flatten_data(_8_1_AUDIO_TASK)
9487

9588
with open(DUB_SUB_FILE, 'w', encoding='utf-8') as f:
9689
for i, ((start_time, end_time), line) in enumerate(zip(new_sub_times, lines), 1):
@@ -108,7 +101,7 @@ def merge_full_audio():
108101
console.print("\n[bold cyan]🎬 Starting audio merging process...[/bold cyan]")
109102

110103
with console.status("[bold cyan]📊 Loading data from Excel...[/bold cyan]"):
111-
df, lines, new_sub_times = load_and_flatten_data(INPUT_EXCEL)
104+
df, lines, new_sub_times = load_and_flatten_data(_8_1_AUDIO_TASK)
112105
console.print("[bold green]✅ Data loaded successfully[/bold green]")
113106

114107
with console.status("[bold cyan]🔍 Getting audio file list...[/bold cyan]"):
@@ -130,11 +123,7 @@ def merge_full_audio():
130123

131124
with console.status("[bold cyan]💾 Exporting final audio file...[/bold cyan]"):
132125
merged_audio = merged_audio.set_frame_rate(16000).set_channels(1)
133-
merged_audio.export(
134-
DUB_VOCAL_FILE,
135-
format="mp3",
136-
parameters=["-b:a", "64k"]
137-
)
126+
merged_audio.export(DUB_VOCAL_FILE, format="mp3", parameters=["-b:a", "64k"])
138127
console.print(f"[bold green]✅ Audio file successfully merged![/bold green]")
139128
console.print(f"[bold green]📁 Output file: {DUB_VOCAL_FILE}[/bold green]")
140129

0 commit comments

Comments
 (0)