Hello there,
Thanks for publish the excellent work! I have following questions:
- Based on the ReadMe usage2, we can achieve the fusion feature(based on reference image + modified text), how do you calculate the distance between the fusion feature and target image feature? Do you calculate it based on cosine similarity or euclidean distance?
- Do we use the clip embedding for the target image?
Best