⚡ Build structured YouTube datasets for NLP, ML, sentiment analysis & RAG in minutes.
A python tool for fetching thousands of videos fast from a Youtube channel along with structured transcripts and additional metadata. Export data easily as CSV, TXT, or JSON.
- Installation
- Quick CLI Usage
- Docker Quick Start
- Features
- Basic Usage (Python API)
- Using Different Fetchers
- Retreive Different Languages
- Fetching Only Manually Created Transcripts
- Exporting
- Other Methods
- Proxy Configuration
- Advanced HTTP Configuration (Optional)
- CLI (Advanced)
- Contributing
- Running Tests
- Related Projects
- License
- Contributors
Install from PyPI:
pip install ytfetcherFetch 50 video transcripts + metadata from a channel and save as JSON:
ytfetcher from_channel -c TheOffice -m 50 -f jsonThe recommended way to run or develop YTFetcher is using Docker to ensure a clean, stable environment without needing local Python or dependency management.
docker-compose buildUse docker-compose run to execute your desired command inside the container.
docker-compose run ytfetcher poetry run ytfetcher from_channel -c TheOffice -m 20 -f jsonYTFetcher comes with a simple CLI so you can fetch data directly from your terminal.
ytfetcher -husage: ytfetcher [-h] {from_channel,from_video_ids} ...
Fetch YouTube transcripts for a channel
positional arguments:
{from_channel,from_video_ids}
from_channel Fetch data from channel handle with max_results.
from_playlist_id Fetch data from a specific playlist id.
from_video_ids Fetch data from your custom video ids.
options:
-h, --help show this help message and exit- Fetch full transcripts from a YouTube channel.
- Get video metadata: title, description, thumbnails, published date.
- Async support for high performance.
- Export fetched data as txt, csv or json.
- CLI support.
Note: When specifying the channel, you should provide the exact channel handle without the @ symbol, channel URL, or display name.
For example, use TheOffice instead of @TheOffice or https://www.youtube.com/c/TheOffice.
Here’s how you can get transcripts and metadata information like channel name, description, published date, etc. from a single channel with from_channel method:
from ytfetcher import YTFetcher
from ytfetcher.models.channel import ChannelData
import asyncio
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=2
)
async def get_channel_data() -> list[ChannelData]:
channel_data = await fetcher.fetch_youtube_data()
return channel_data
if __name__ == '__main__':
data = asyncio.run(get_channel_data())
print(data)This will return a list of ChannelData with metadata in DLSnippet objects:
[
ChannelData(
video_id='video1',
transcripts=[
Transcript(
text="Hey there",
start=0.0,
duration=1.54
),
Transcript(
text="Happy coding!",
start=1.56,
duration=4.46
)
]
metadata=DLSnippet(
video_id='video1',
title='VideoTitle',
description='VideoDescription',
url='https://youtu.be/video1',
duration=120,
view_count=1000,
thumbnails=[{'url': 'thumbnail_url'}]
)
),
# Other ChannelData objects...
]ytfetcher also supports different fetcher so you can fetch with channel_handle, custom video_ids or from a playlist_id
Here's how you can fetch bulk transcripts from a specific playlist_id using ytfetcher.
from ytfetcher import YTFetcher
import asyncio
fetcher = YTFetcher.from_playlist_id(
playlist_id="playlistid1254"
)
# Rest is same ...Initialize ytfetcher with custom video IDs using from_video_ids method:
from ytfetcher import YTFetcher
import asyncio
fetcher = YTFetcher.from_video_ids(
video_ids=['video1', 'video2', 'video3']
)
# Rest is same ...You can use the languages param to retrieve your desired language. (Default en)
fetcher = YTFetcher.from_video_ids(video_ids=video_ids, languages=["tr", "en"])Also here's a quick CLI command for languages param.
ytfetcher from_channel -c TheOffice -m 50 -f csv --print --languages tr enytfetcher first tries to fetch the Turkish transcript. If it's not available, it falls back to English.
ytfetcher allows you to fetch only manually created transcripts from a channel which allows you to get more precise transcripts.
fetcher = YTFetcher.from_channel(channel_handle="TEDx", manually_created=True) # Set manually_created flag to TrueYou can also easily enable this feature with --manually-created argument in CLI.
ytfetcher from_channel -c TEDx -f csv --manually-createdUse the BaseExporter class to export ChannelData in csv, json, or txt:
from ytfetcher.services import JSONExporter #OR you can import other exporters: TXTExporter, CSVExporter
channel_data = asyncio.run(fetcher.fetch_youtube_data())
exporter = JSONExporter(
channel_data=channel_data,
allowed_metadata_list=['title'], # You can customize this
timing=True, # Include transcript start/duration
filename='my_export', # Base filename
output_dir='./exports' # Optional output directory
)
exporter.write()You can also specify arguments when exporting which allows you to decide whether to exclude timings and choose desired metadata.
ytfetcher from_channel -c TheOffice -m 20 -f json --no-timing --metadata title descriptionThis command will exclude timings from transcripts and keep only title and description as metadata.
You can also fetch only transcript data or metadata with video IDs using fetch_transcripts and fetch_snippets.
fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
async def get_transcript_data():
return await fetcher.fetch_transcripts()
data = asyncio.run(get_transcript_data())
print(data)async def get_snippets():
return await fetcher.fetch_snippets()
data = asyncio.run(get_snippets())
print(data)YTFetcher supports proxy usage for fetching YouTube transcripts:
from ytfetcher import YTFetcher
from ytfetcher.config import GenericProxyConfig, WebshareProxyConfig
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=3,
proxy_config=GenericProxyConfig() | WebshareProxyConfig()
)YTfetcher already uses custom headers for mimic real browser behavior but if you want to change it, you can use a custom HTTPConfig class.
from ytfetcher import YTFetcher
from ytfetcher.config import HTTPConfig
custom_config = HTTPConfig(
timeout=4.0,
headers={"User-Agent": "ytfetcher/1.0"}
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=10,
http_config=custom_config
)ytfetcher from_channel -c <CHANNEL_HANDLE> -m <MAX_RESULTS> -f <FORMAT>ytfetcher from_video_ids -v video_id1 video_id2 ... -f jsonytfetcher from_playlist_id -p playlistid123 -f csv -m 25ytfetcher from_channel -c <CHANNEL_HANDLE> -f json --webshare-proxy-username "<USERNAME>" --webshare-proxy-password "<PASSWORD>"ytfetcher from_channel -c <CHANNEL_HANDLE> -f json --http-proxy "http://user:pass@host:port" --https-proxy "https://user:pass@host:port"ytfetcher from_channel -c <CHANNEL_HANDLE> --http-timeout 4.2 --http-headers "{'key': 'value'}"git clone https://github.com/kaya70875/ytfetcher.git
cd ytfetcher
poetry installpoetry run pytestYou should be passing all type checks to contribute ytfetcher.
poetry run mypy ytfetcherThis project is licensed under the MIT License — see the LICENSE file for details.
Thanks to everyone who has contributed to ytfetcher ❤️
⭐ If you find this useful, please star the repo or open an issue with feedback!