Motivation
Dear authors,
recently, we use the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers to inference in a causal way.
we found that the image the model used is in a way very similar to T2V, aka:
- the image is firstly processed to the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers with timestamp(0) to update the kv_cache.
- then the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers model will generate next chunk only in T2V mode (which means it use 16 channel noise latent as input, which is different with the wan2.2 i2v using 16 channel image latents and 4 mask channel and 16 channel noised latents, totally in 36 channel to process).
Thus, we wonder was the FastVideo/SFWan2.2-I2V-A14B-Preview-Diffusers trained in T2V model instead of I2V?
And, more,
is there any I2V way also in SF(causal)?
thanks
Related resources
No response