Controlled Vocabulary & Semantic Auto-Tagging

This repository explores the intersection of Computer Vision, Large Language Models (LLMs), and Controlled Vocabularies.

The primary goal is to develop a "Semantic Auto-tagger" that can classify content (text and images) using complex, professional taxonomies (specifically IPTC Media Topics) while overcoming the technical limitations of current AI structured outputs.

The Problem: The "Vocabulary Gap"

Most commercial Computer Vision models are trained on datasets that are either too generic or too rigid for professional Digital Asset Management (DAM) needs:

COCO: Limited to ~80 object classes (e.g., "Car," "Person").
ImageNet: Noun-heavy and biology-skewed (great for dog breeds, poor for abstract concepts).
LVMs (Large Vision Models): Models like GPT-4V or Gemini Vision offer "Open Vocabulary" capabilities but are prone to "creative interpretation." They might tag a cat as "Feline Companion"—a valid English phrase, but if the controlled vocabulary requires "Domestic Cat," the tag is effectively invalid.

This creates a compatibility issue where valid descriptions fail to map to the strict schema required by downstream systems.

The Solution

This project demonstrates how to force an LLM (Google Gemini) to strictly adhere to a pre-defined taxonomy using:

Structured Outputs: Constraining the model to return valid JSON matching specific schema.
RAG (Retrieval Augmented Generation): Dynamically retrieving relevant vocabulary terms to fit within the model's context window.
Multimodal Inputs: Processing both text and images.

Repository Structure

1. Interactive Application

app.py: A Streamlit application that provides a user-friendly interface for the auto-tagging pipeline. It allows users to input text or images and receive IPTC-compliant tags in real-time.

2. Jupyter Notebooks

The notebooks/ directory contains experiments and pipelines demonstrating different techniques:

media_topics_structured_output.ipynb: Demonstrates the baseline technique of using Google Gemini with structured outputs to classify text against the IPTC vocabulary.
media_topics_RAG.ipynb: Addresses the context window limitation by using a Vector Database (ChromaDB) to retrieve only the most relevant parts of the 1,200+ term vocabulary before asking the LLM to classify the text.
media_topics_image_tagging_pipeline.ipynb: Extends the pipeline to Computer Vision. It processes images, generates descriptions, and then maps those visual features to the controlled vocabulary.

3. Core Modules

gemini.py: Handles interactions with the Google GenAI SDK, including client initialization and model prompting.
media_topics.py: Utilities for downloading, parsing, and managing the IPTC Media Topics JSON taxonomy.
db.py: Interface for database operations (graph/vector interactions).

Getting Started

Prerequisites

Python 3.10+
A Google AI Studio API Key

Installation

Clone the repository:

git clone https://github.com/peterjakubowski/Controlled-Vocabulary.git
cd Controlled-Vocabulary

Install dependencies:
```
pip install -r requirements.txt
```
Environment Setup: Create a .env file in the root directory and add your API key:
```
GOOGLE_AI_API_KEY=your_api_key_here
```

Usage

Running the Streamlit App:

streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
images		images
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
db.py		db.py
gemini.py		gemini.py
media_topics.py		media_topics.py
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Controlled Vocabulary & Semantic Auto-Tagging

The Problem: The "Vocabulary Gap"

The Solution

Repository Structure

1. Interactive Application

2. Jupyter Notebooks

3. Core Modules

Getting Started

Prerequisites

Installation

Usage

About

Uh oh!

Releases

Packages

Languages

License

peterjakubowski/Controlled-Vocabulary

Folders and files

Latest commit

History

Repository files navigation

Controlled Vocabulary & Semantic Auto-Tagging

The Problem: The "Vocabulary Gap"

The Solution

Repository Structure

1. Interactive Application

2. Jupyter Notebooks

3. Core Modules

Getting Started

Prerequisites

Installation

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages