Skip to content

Enhancing Q-Former for Knowledge Based Visual Question Answering with Multi-layer Co-Attention and Question-Aware Prompts

License

BSD-3-Clause and 2 other licenses found

Licenses found

BSD-3-Clause
LICENSE.md
Apache-2.0
LICENSE_BLIVA_FLANT5_WEIGHT.md
BSD-3-Clause
LICENSE_LAVIS.md
Notifications You must be signed in to change notification settings

pej0918/Enhanced-QFormer-VQA

 
 

Enhancing Q-Former for Knowledge-Based Visual Question Answering with Multi-Layer Co-Attention and Question-Aware Prompts

📣 Accepted at Autumn Annual Conference of IEIE, 2024.

In Knowledge-Based Visual Question Answering (KB-VQA), models must interpret both visual data and external knowledge to answer complex questions accurately. Our approach extends Q-Former, known for its foundational Cross-Attention mechanism for visual feature extraction, by integrating it with MCAN (Multimodal Co-Attention Network) and incorporating Question-Aware Prompts during fine-tuning. This structure not only strengthens the model’s understanding of question-image relationships but also leverages past interactions to improve overall answer accuracy.

🧠 Model Architecture Overview

Model Architecture

The architecture integrates MCAN and Question-Aware Prompts into the Q-Former framework, offering multi-layered cross-attention and enhanced contextual understanding.

💡 Key Contributions

  • MCAN Integration: Multi-layered Self-Attention and Cross-Attention for richer image-question interactions.
  • Question-Aware Prompts: Introduce answer candidates and past example contexts to enhance reasoning.
  • Improved Accuracy: Achieved 6.9% increase in accuracy on OK-VQA and AOK-VQA datasets.

⚙️ Methodology

  1. Q-Former & MCAN Integration: Combines initial cross-modal interaction with deep multi-layer attention.
  2. Fine-Tuning with Question-Aware Prompts: Utilizes answer candidates and answer-aware examples to boost context understanding during Fine-Tuning phase. Fine-Tuning Structure

📊 Question-Aware Prompt Structure

Question-Aware Prompt Structure

This structure combines Answer Candidates with confidence scores and Answer-Aware Examples from past cases, enhancing the model's reasoning capabilities.

🚀 Experiment Results

Model Accuracy (Only-Question) Accuracy (Question-Aware Prompt)
Q-Former 49.2% 55.65%
MCAN 52.56% -
Ours 50% 56.1%

🎯 Conclusion

  • introduce Question-Aware Prompts during fine-tuning, which provide supplementary context.
  • 6.9% accuracy improvement on KB-VQA tasks.

About

Enhancing Q-Former for Knowledge Based Visual Question Answering with Multi-layer Co-Attention and Question-Aware Prompts

Resources

License

BSD-3-Clause and 2 other licenses found

Licenses found

BSD-3-Clause
LICENSE.md
Apache-2.0
LICENSE_BLIVA_FLANT5_WEIGHT.md
BSD-3-Clause
LICENSE_LAVIS.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%