-
Notifications
You must be signed in to change notification settings - Fork 2
add openvid dataset files #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
205b35d
add openvid dataset files
AyushExel a092251
Update HF_DATASET_CARD.md
AyushExel 70a1825
update dataset card
AyushExel 4e7bef1
update
AyushExel 861adcc
update example with av seek
AyushExel 97d7fb1
update uv lock
AyushExel bb553e2
add lance tag
AyushExel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,232 @@ | ||
| --- | ||
| license: cc-by-4.0 | ||
| task_categories: | ||
| - text-to-video | ||
| - video-classification | ||
| - lance | ||
| language: | ||
| - en | ||
| tags: | ||
| - text-to-video | ||
| - video-search | ||
| pretty_name: openvid-lance | ||
| size_categories: | ||
| - 100K<n<1M | ||
| --- | ||
| # OpenVid Dataset (Lance Format) | ||
|
|
||
| Lance format version of the [OpenVid dataset](https://huggingface.co/datasets/nkp37/OpenVid-1M) with **937,957 high-quality videos** stored with inline video blobs, embeddings, and rich metadata. | ||
|
|
||
|  | ||
|
|
||
| **Key Features:** | ||
| The dataset is stored in lance format with inline video blobs, video embeddings, and rich metadata. | ||
|
|
||
| - **Videos stored inline as blobs** - No external files to manage | ||
| - **Efficient column access** - Load metadata without touching video data | ||
| - **Prebuilt indices available** - IVF_PQ index for similarity search, FTS index on captions | ||
| - **Fast random access** - Read any video instantly by index | ||
| - **HuggingFace integration** - Load directly from the Hub in streaming mode | ||
|
|
||
| ## Lance Blob API | ||
|
|
||
| Lance stores videos as **inline blobs** - binary data embedded directly in the dataset. This provides: | ||
|
|
||
| - **Single source of truth** - Videos and metadata together in one dataset | ||
| - **Lazy loading** - Videos only loaded when you explicitly request them | ||
| - **Efficient storage** - Optimized encoding for large binary data | ||
| - **Transactional consistency** - Query and retrieve in one atomic operation | ||
|
|
||
|
|
||
| ```python | ||
| import lance | ||
|
|
||
| ds = lance.dataset("hf://datasets/lance-format/openvid-lance") | ||
|
|
||
| # 1. Browse metadata without loading video data | ||
| metadata = ds.scanner( | ||
| columns=["caption", "aesthetic_score"], # No video_blob column! | ||
| filter="aesthetic_score >= 4.5", | ||
| limit=10 | ||
| ).to_table().to_pylist() | ||
|
|
||
| # 2. User selects video to watch | ||
| selected_index = 3 | ||
|
|
||
| # 3. Load only that video blob | ||
| blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] | ||
AyushExel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| video_bytes = blob_file.read() | ||
|
|
||
| # 4. Save to disk | ||
| with open("video.mp4", "wb") as f: | ||
| f.write(video_bytes) | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```python | ||
| import lance | ||
|
|
||
| # Load dataset from HuggingFace | ||
| ds = lance.dataset("hf://datasets/lance-format/openvid-lance") | ||
| print(f"Total videos: {ds.count_rows():,}") | ||
| ``` | ||
|
|
||
| > **⚠️ HuggingFace Streaming Note** | ||
| > | ||
| > When streaming from HuggingFace (as shown above), some operations use minimal parameters to avoid rate limits: | ||
| > - `nprobes=1` for vector search (lowest value) | ||
| > - Column selection to reduce I/O | ||
| > | ||
| > **You may still hit rate limits on HuggingFace's free tier.** For best performance and to avoid rate limits, **download the dataset locally**: | ||
| > | ||
| > ```bash | ||
| > # Download once | ||
| > huggingface-cli download lance-format/openvid-lance --repo-type dataset --local-dir ./openvid | ||
| > | ||
| > # Then load locally | ||
| > ds = lance.dataset("./openvid") | ||
| > ``` | ||
| > | ||
| > Streaming is recommended only for quick exploration and testing. | ||
|
|
||
| ## Dataset Schema | ||
|
|
||
| Each row contains: | ||
| - `video_blob` - Video file as binary blob (inline storage) | ||
| - `caption` - Text description of the video | ||
| - `embedding` - 1024-dim vector embedding | ||
| - `aesthetic_score` - Visual quality score (0-5+) | ||
| - `motion_score` - Amount of motion (0-1) | ||
| - `temporal_consistency_score` - Frame consistency (0-1) | ||
| - `camera_motion` - Camera movement type (pan, zoom, static, etc.) | ||
| - `fps`, `seconds`, `frame` - Video properties | ||
|
|
||
| ## Usage Examples | ||
|
|
||
| ### 1. Browse Metadata quickly (Fast - No Video Loading) | ||
|
|
||
| ```python | ||
| # Load only metadata without heavy video blobs | ||
| scanner = ds.scanner( | ||
| columns=["caption", "aesthetic_score", "motion_score"], | ||
| limit=10 | ||
| ) | ||
| videos = scanner.to_table().to_pylist() | ||
|
|
||
| for video in videos: | ||
| print(f"{video['caption']} - Quality: {video['aesthetic_score']:.2f}") | ||
| ``` | ||
|
|
||
| ### 2. Export Videos from Blobs | ||
|
|
||
| ```python | ||
| # Load specific videos by index | ||
| indices = [0, 100, 500] | ||
| blob_files = ds.take_blobs("video_blob", ids=indices) | ||
|
|
||
| # Save to disk | ||
| for i, blob_file in enumerate(blob_files): | ||
| with open(f"video_{i}.mp4", "wb") as f: | ||
| f.write(blob_file.read()) | ||
| ``` | ||
|
|
||
| ### 3. Open inline videos with PyAV and run seeks directly on the blob file | ||
|
|
||
| ```python | ||
| import av | ||
|
|
||
| selected_index = 123 | ||
| blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] | ||
|
|
||
| with av.open(blob_file) as container: | ||
| stream = container.streams.video[0] | ||
|
|
||
| for seconds in (0.0, 1.0, 2.5): | ||
| target_pts = int(seconds / stream.time_base) | ||
| container.seek(target_pts, stream=stream) | ||
|
|
||
| frame = None | ||
| for candidate in container.decode(stream): | ||
| if candidate.time is None: | ||
| continue | ||
| frame = candidate | ||
| if frame.time >= seconds: | ||
| break | ||
|
|
||
| print( | ||
| f"Seek {seconds:.1f}s -> {frame.width}x{frame.height} " | ||
| f"(pts={frame.pts}, time={frame.time:.2f}s)" | ||
| ) | ||
| ``` | ||
|
|
||
| ### 4. Vector Similarity Search | ||
|
|
||
| ```python | ||
| import pyarrow as pa | ||
|
|
||
| # Find similar videos | ||
| ref_video = ds.take([0], columns=["embedding"]).to_pylist()[0] | ||
| query_vector = pa.array([ref_video['embedding']], type=pa.list_(pa.float32(), 1024)) | ||
|
|
||
| results = ds.scanner( | ||
| nearest={ | ||
| "column": "embedding", | ||
| "q": query_vector[0], | ||
| "k": 5, | ||
| "nprobes": 1, | ||
| "refine_factor": 1 | ||
| } | ||
| ).to_table().to_pylist() | ||
|
|
||
| for video in results[1:]: # Skip first (query itself) | ||
| print(video['caption']) | ||
| ``` | ||
|
|
||
| ### 5. Full-Text Search | ||
|
|
||
| ```python | ||
| # Search captions using FTS index | ||
| results = ds.scanner( | ||
| full_text_query="sunset beach", | ||
| columns=["caption", "aesthetic_score"], | ||
| limit=10, | ||
| fast_search=True | ||
| ).to_table().to_pylist() | ||
|
|
||
| for video in results: | ||
| print(f"{video['caption']} - {video['aesthetic_score']:.2f}") | ||
| ``` | ||
|
|
||
| ### 6. Filter by Quality | ||
|
|
||
| ```python | ||
| # Get high-quality videos | ||
| high_quality = ds.scanner( | ||
| filter="aesthetic_score >= 4.5 AND motion_score >= 0.3", | ||
| columns=["caption", "aesthetic_score", "camera_motion"], | ||
| limit=20 | ||
| ).to_table().to_pylist() | ||
| ``` | ||
|
|
||
| ## Dataset Statistics | ||
|
|
||
| - **Total videos**: 937,957 | ||
| - **Embedding dimension**: 1024 | ||
| - **Video formats**: MP4 (H.264) | ||
| - **Index types**: IVF_PQ (vector), FTS | ||
|
|
||
|
|
||
| ## Citation | ||
|
|
||
| @article{nan2024openvid, | ||
| title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation}, | ||
| author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying}, | ||
| journal={arXiv preprint arXiv:2407.02371}, | ||
| year={2024} | ||
| } | ||
|
|
||
|
|
||
| ## License | ||
|
|
||
| Please check the original OpenVid dataset license for usage terms. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| # OpenVid HF - Lance Format with Video Blobs | ||
|
|
||
| This project migrates the OpenVid dataset from LanceDB Enterprise to Hugging Face Hub, storing videos inline using Lance's Blob API. | ||
|
|
||
| ## Setup | ||
|
|
||
| ```bash | ||
| cd openvid_hf | ||
| uv sync | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Test with small batch (5-10 rows) | ||
| ```bash | ||
| uv run python test_direct_ingestion.py | ||
| ``` | ||
|
|
||
| ### Full migration | ||
| ```bash | ||
| uv run python dataprep.py --limit 1000000 | ||
| ``` | ||
|
|
||
| ### Upload to HF Hub | ||
| ```bash | ||
| uv run python upload_hub.py ./openvid.lance | ||
| ``` | ||
|
|
||
| ## Schema | ||
|
|
||
| | Column | Type | Description | | ||
| |--------|------|-------------| | ||
| | video_path | string | Original S3 path | | ||
| | caption | string | Video caption | | ||
| | seconds | float64 | Duration | | ||
| | embedding | list[float32, 1024] | Video embedding | | ||
| | video | large_binary (blob) | Video bytes | |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.