This project is an AI-powered animation generator that creates frame-by-frame, Ghibli-style animations from natural language prompts. Using a combination of OpenAI GPT models, Stable Diffusion pipelines (txt2img, img2img), and ControlNet, the system produces coherent and stylized animated sequences in the spirit of Studio Ghibliโs visual aesthetic.
- ๐ฌ Natural Prompt to Animation: Enter a single prompt (e.g. โa serene forest at dawn...โ) and receive a full animation sequence.
- ๐ผ๏ธ Frame-by-Frame Generation: Uses Stable Diffusion to generate the first frame from text, and
img2imgto evolve subsequent frames. - ๐ง LLM-Assisted Scene Breakdown: Breaks a high-level idea into a sequence of structured animation steps using Mistral.
- ๐๏ธ Style Control via LoRA + ControlNet: Ensures stylistic consistency and smooth motion using LoRA fine-tuning and depth-based ControlNet conditioning.
- ๐ช Smooth Transitions: Automatically blends frames for smoother animation and less jitter using pixel-wise interpolation.
- ๐ฝ๏ธ GIF Output: Compiles frames into a looping animated GIF.
- Python
- Stable Diffusion v1.5 (
diffusers) - Mistral-7B-Instruct (via HF)
- ControlNet (Depth)
- LoRA fine-tuning
- Pillow, OpenCV, Torch, Transformers
This section provides a deep dive into how this project leverages Stable Diffusion, LoRA fine-tuning, and ControlNet to produce stylistically consistent Ghibli-themed animation frames from user prompts.
We start with the runwayml/stable-diffusion-v1-5 model as the base for all image generation tasks:
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
safety_checker=None,
torch_dtype=torch.float16
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")This pipeline is later enhanced using LoRA fine-tuned weights to give it a distinct Ghibli-style output capability.
LoRA (Low-Rank Adaptation) allows efficient fine-tuning of large diffusion models by injecting small trainable matrices into attention layers. Instead of modifying the entire UNet weights, LoRA adjusts only a small subset of parameters, significantly reducing compute requirements.
- Captioned images are paired in a CSV (
metadata.csv). - Images + captions are wrapped as a Hugging Face
datasets.Datasetand pushed to the Hub:
from datasets import Dataset, Image
# DataFrame setup
df["image"] = df["file_name"].apply(lambda fn: os.path.join(image_folder, fn))
ds = Dataset.from_pandas(df).cast_column("image", Image())
ds.push_to_hub("ibrahim7004/lora-ghibli-images", split="train")Using Hugging Face's train_text_to_image_lora.py, the model is fine-tuned with custom captions:
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--dataset_name="ibrahim7004/lora-ghibli-images" \
--caption_column="caption" \
...
--output_dir="./finetune_lora/ghibli"Training runs for 3000 steps, saving LoRA weights as pytorch_lora_weights.safetensors.
Once trained, the model is reloaded with LoRA weights like so:
lora_path = hf_hub_download(repo_id="ibrahim7004/ghibli-stableDiff-finetuned", filename="v2_pytorch_lora_weights.safetensors")
pipe.unet.load_attn_procs(lora_path)The animation generation process consists of two parts:
- Frame 0: Generated from scratch via text-to-image
- Frame 1 onward: Generated using img2img with ControlNet for frame consistency
if idx == 0:
image = pipe(prompt=frame, ...).images[0]
else:
refined = img2img_pipe(
prompt=frame,
image=saved_image,
control_image=get_depth_map(saved_image),
...
).images[0]To maintain visual continuity, we use lllyasviel/sd-controlnet-depth:
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-depth")
img2img_pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
base_model,
controlnet=controlnet,
...
)A simple grayscale depth map is generated from the last frame:
def get_depth_map(image_pil):
gray = cv2.cvtColor(np.array(image_pil), cv2.COLOR_RGB2GRAY)
depth = cv2.normalize(gray, None, 0, 255, cv2.NORM_MINMAX)
return Image.fromarray(depth).convert("RGB")All saved frames (as frame_0.png, frame_1.png, ...) are stitched into a GIF for easy preview:
def create_gif_from_frames(folder_path="/content/frames", output_path="/content/animation.gif"):
frames = [Image.open(...)]
frames[0].save(output_path, format="GIF", save_all=True, append_images=frames[1:], loop=0)
display(IPyImage(filename=output_path))Uses mistralai/Mistral-7B-Instruct-v0.3 via Hugging Face Inference API to convert a natural prompt into a Python list of frame contexts.
from huggingface_hub import InferenceClient
client = InferenceClient(provider="hf-inference", api_key="hf_...")
def generate_animation_frames(prompt, steps=5):
system_message = (
f"You are an animation assistant to help create ghibli-themed animation frames. "
f"Each frame must include the word 'ghibli' or describe the frame as 'ghibli-style'. "
f"Break the given ghibli-themed idea into a smooth {steps}-step animation. "
f"Return the result as a Python list of {steps} strings."
)
completion = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": prompt}
]
)
return completion.choices[0].message.contentMistral's output is parsed and cleaned using a custom parser to return a valid list of frames.
import ast, re
def normalize_llm_frame_output(raw_output, steps=5):
try:
parsed = ast.literal_eval(raw_output)
if isinstance(parsed, list):
return parsed[:steps]
except:
pass
# fallback
pattern = re.compile(r'\d+\s*[\.\)]\s*["โ]?(.*?)["โ]?(?=\n\d+|\Z)', re.DOTALL)
matches = pattern.findall(raw_output)
return [m.strip() for m in matches[:steps]]The first frame is created using Stable Diffusion txt2img, and subsequent frames are generated via img2img using the previously generated image.
def generate_frames(frames):
for idx, frame in enumerate(frames):
if idx == 0:
image = pipe(frame, num_inference_steps=30).images[0]
saved_image = save_img(image, idx)
else:
refined = img2img_pipe(
prompt=frame,
image=saved_image,
strength=0.7,
guidance_scale=9.0,
num_inference_steps=40
).images[0]
saved_image = save_img(refined, idx)from IPython.display import Image as IPyImage, display
def create_gif_from_frames(folder_path="/content/frames", output_path="/content/animation.gif", duration=300):
frames = sorted(
[Image.open(os.path.join(folder_path, f)) for f in os.listdir(folder_path) if f.endswith(".png")],
key=lambda x: int(x.filename.split('_')[-1].split('.')[0])
)
frames[0].save(output_path, format="GIF", save_all=True, append_images=frames[1:], duration=duration, loop=0)
display(IPyImage(filename=output_path))The Ghibli visual style was enhanced using a manually curated and captioned dataset of Ghibli-style image-caption pairs:
๐ Final dataset: ghibli-images-for-SD1.5
โก๏ธ Download LoRA weights (Ghibli Refined)
๐ View full LoRA finetuning code (Studio Ghibli Dataset)
๐ Dataset: 50 images with captions manually written using ChatGPT assistance for consistency
๐งช Earlier experiments:
lora-ghibli-images lora-pak-truck-art
Used for fine-tuning LoRA weights applied to the Stable Diffusion UNet.
This system uses mistralai/Mistral-7B-Instruct-v0.3 via the Hugging Face Inference API to break a single animation idea into a coherent list of frames.
- Strong instruction-following ability
- Fast and cost-effective via Hugging Face's hosted API
- Stable and creative for scene decomposition
system_message = (
f"You are an animation assistant to help create ghibli-themed animation frames. "
f"Each frame must include the word 'ghibli' or describe the frame as 'ghibli-style'. "
f"Break the given ghibli-themed idea into a smooth 5-step animation. "
f"Return the result as a Python list of 5 strings."
)[
"A ghibli-style forest glows under golden sunlight.",
"Tall ghibli trees sway gently in the wind.",
"A ghibli cottage appears through the trees.",
"The ghibli sky fills with birds over the valley.",
"Sunlight fades as the ghibli village comes into view."
]prompt = "A ghibli forest with flickering sunlight through swaying trees"
result = generate_animation_frames(prompt, steps=5)
frames = normalize_llm_frame_output(result, steps=5)
generate_frames(frames)
create_gif_from_frames()


