Skip to content

Conversation

@bghira
Copy link
Owner

@bghira bghira commented Jan 26, 2026

This pull request adds support for IC-LoRA-style reference video conditioning to the LTX-2 model and pipeline, enabling the use of reference videos during inference and training. The changes include new methods for handling conditioning latents, validation logic, and integration of reference video tokens into the inference and training workflows.

IC-LoRA Reference Video Conditioning Support:

  • Added supports_conditioning_dataset, requires_conditioning_latents, and prepare_batch_conditions methods to the LTX-2 model to indicate and validate support for reference video conditioning. These methods ensure that only valid conditioning types and latent inputs are accepted. [1] [2]
  • Enhanced the model_predict method to handle reference video conditioning latents, including input validation, packing of reference and target latents, concatenation, and alignment of timesteps and positional encodings. The method now also ensures reference tokens are excluded from the output and not used with incompatible regularizers. [1] [2] [3] [4] [5]

Pipeline Integration and Data Handling:

  • Introduced _prepare_video_conditioning to the pipeline for loading, resizing, encoding, and packing reference video frames as latents, along with generating corresponding masks and positional encodings.
  • Updated the pipeline's __call__ method to accept a video_conditioning argument, process and concatenate reference tokens and masks, and ensure correct handling during the denoising loop (including timestep masking and restoration of reference tokens at each step). Reference tokens are removed from the final output. [1] [2] [3] [4] [5] [6] [7]

Utilities and Imports:

  • Added imports for load_video and resize_video_frames to support video loading and preprocessing in both the main and image-to-video pipelines. [1] [2]
  • Added a utility function retrieve_latents to extract latents from VAE encoder output, supporting both sampling and argmax modes.

These changes collectively enable robust and flexible reference video conditioning for LTX-2, improving its capabilities for tasks requiring IC-LoRA-style conditioning.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds IC-LoRA-style reference video conditioning support to the LTX-2 training model and both the text-to-video and image-to-video pipelines, enabling the model to consume reference videos as latent-space conditions during training and inference.

Changes:

  • Add reference video loading, resizing, VAE encoding, and packing utilities to both LTX-2 pipelines, plus a shared retrieve_latents helper.
  • Extend the LTXVideo2 training model to declare conditioning support, validate conditioning batches, and integrate reference conditioning latents into model_predict (timesteps, RoPE coords, force-keep mask, and output slicing).
  • Wire video_conditioning into the pipelines’ __call__ methods, including token/mask concatenation, timestep masking for reference tokens, and removal of reference tokens from outputs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
simpletuner/helpers/models/ltxvideo2/pipeline_ltx2.py Adds reference video conditioning support for text-to-video inference: video loading/resizing/encoding, packing into reference tokens/masks/coords, integrating video_conditioning into the denoising loop via sequence concatenation, timestep masking, and coordinate concatenation, and stripping reference tokens before decoding.
simpletuner/helpers/models/ltxvideo2/pipeline_ltx2_image2video.py Mirrors the reference conditioning path for image-to-video, reusing similar _prepare_video_conditioning logic and adapting the denoising loop to separate reference vs target tokens in latent space and keep reference tokens fixed across steps.
simpletuner/helpers/models/ltxvideo2/model.py Declares that LTX-2 supports and uses conditioning latents, validates conditioning inputs in prepare_batch_conditions / model_predict, and augments model_predict to pack reference latents, build per-token timesteps and RoPE coords, extend TREAD force-keep masks, disallow CREPA with reference tokens, and drop reference tokens from the video prediction.
Comments suppressed due to low confidence (1)

simpletuner/helpers/models/ltxvideo2/model.py:1011

  • This assignment to 'prepare_batch_conditions' is unnecessary as it is redefined before this value is used.
    def prepare_batch_conditions(self, batch: dict, state: dict) -> dict:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants