Skip to content

[Feature] FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers acts like T2V rather than I2V. #1086

@JoJoistheBestOne

Description

@JoJoistheBestOne

Motivation

Dear authors,
recently, we use the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers to inference in a causal way.
we found that the image the model used is in a way very similar to T2V, aka:

  1. the image is firstly processed to the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers with timestamp(0) to update the kv_cache.
  2. then the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers model will generate next chunk only in T2V mode (which means it use 16 channel noise latent as input, which is different with the wan2.2 i2v using 16 channel image latents and 4 mask channel and 16 channel noised latents, totally in 36 channel to process).

Thus, we wonder was the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers trained in T2V model instead of I2V?

And, more,
is there any I2V way also in SF(causal)?

thanks

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions