LTX-2: IC-LoRA training with reference videos #2498
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request adds support for IC-LoRA-style reference video conditioning to the LTX-2 model and pipeline, enabling the use of reference videos during inference and training. The changes include new methods for handling conditioning latents, validation logic, and integration of reference video tokens into the inference and training workflows.
IC-LoRA Reference Video Conditioning Support:
supports_conditioning_dataset,requires_conditioning_latents, andprepare_batch_conditionsmethods to the LTX-2 model to indicate and validate support for reference video conditioning. These methods ensure that only valid conditioning types and latent inputs are accepted. [1] [2]model_predictmethod to handle reference video conditioning latents, including input validation, packing of reference and target latents, concatenation, and alignment of timesteps and positional encodings. The method now also ensures reference tokens are excluded from the output and not used with incompatible regularizers. [1] [2] [3] [4] [5]Pipeline Integration and Data Handling:
_prepare_video_conditioningto the pipeline for loading, resizing, encoding, and packing reference video frames as latents, along with generating corresponding masks and positional encodings.__call__method to accept avideo_conditioningargument, process and concatenate reference tokens and masks, and ensure correct handling during the denoising loop (including timestep masking and restoration of reference tokens at each step). Reference tokens are removed from the final output. [1] [2] [3] [4] [5] [6] [7]Utilities and Imports:
load_videoandresize_video_framesto support video loading and preprocessing in both the main and image-to-video pipelines. [1] [2]retrieve_latentsto extract latents from VAE encoder output, supporting both sampling and argmax modes.These changes collectively enable robust and flexible reference video conditioning for LTX-2, improving its capabilities for tasks requiring IC-LoRA-style conditioning.