🎉🎉🎉 This paper has been accepted by 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)!!
At least 1x24GB 3090 GPU (for training), only CPU (for sampling)
- Environment
conda create -n RATLIP python=3.9
conda activate RATLIP
- Clone this repo
git clone https://github.com/OxygenLu/RATLIP.git- Install the requirements
cd RATLIP
pip install -r requirements.txt
- Install CLIP
cd ../
git clone https://github.com/openai/CLIP.git
python ./CLIP/setup.py install
cd RALIP/code
bash scripts/train.sh ./cfg/bird.ymlbash scripts/test.sh ./cfg/bird.ymlYou can change state_epoch and the corresponding weight to continue training at breakpoints
The results are stored in TensorBoard files under ./logs
tensorboard --logdir your_path --port 8166The sample.ipynb can be used to sample
Compare RATLIP and state-of-the-art models on FID values (the smaller, the better).
| Model | CUB | CelebA-tiny |
|---|---|---|
| AttnGAN | 23.98 | 125.98 |
| LAFITE | 14.58 | - |
| DF-GAN | 14.81 | 137.60 |
| GALIP | 10.00 | 94.45 |
| Ours | 13.28 | 81.48 |
Compare RATLIP and state-of-the-art models on CLIP score values (the bigger, the better).
| Model | CUB | Oxford | CelebA-tiny |
|---|---|---|---|
| AttnGAN | - | 21.15 | - |
| LAFITE | 31.25 | - | - |
| DF-GAN | 29.20 | 26.67 | 24.41 |
| GALIP | 31.60 | 31.77 | 27.95 |
| Ours | 32.03 | 31.94 | 28.91 |
@INPROCEEDINGS{11169738,
author={Lin, Chengde and Lu, Xijun and Chen, Guangxi},
booktitle={2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)},
title={RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations},
year={2024},
volume={},
number={},
pages={2346-2352},
abstract={Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), a classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, CLIP, which has been extensively employed for establishing associations between text and images through the learning of multi-modal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superior performance of the proposed model over current state-of-the-art models. The code is available at https://github.com/OxygenLu/RATLIP.},
keywords={Recurrent neural networks;Codes;Text to image;Generative adversarial networks;Generators;Cybernetics;Batch normalization;Photorealistic images},
doi={10.1109/SMC54092.2024.11169738},
ISSN={},
month={Oct},}

