Text Categorisation Tool

A privacy-first, lightweight, efficient tool for categorising unstructured text responses from Excel workbooks into predefined categories using open-source Ollama models.

Overview

This project provides a simple framework for processing and categorising qualitative survey data. It extracts free-text responses from Excel workbooks and uses local Ollama models to categorise them into predefined categories. This is especially useful for analysing open-ended survey questions where responses vary widely.

Key Features

Excel Processing: Extract questions and responses directly from Excel workbooks
Local Inference: Uses Ollama to run powerful language models on your own hardware
Configurable Categories: Easily adapt the tool to your specific categorisation needs
Efficient Batching: Processes texts in parallel for faster throughput
Result Caching: Avoids redundant model calls to save time and resources
Question Tracking: Intelligently identifies and tracks question numbers across sheets
Lightweight: Minimal dependencies, focused on a single task

Project Structure

text_categorizer/
├── categorizer.py        # Text categorization module
├── excel_converter.py    # Excel file processing module
├── integration.py        # Integration of both modules
└── README.md             # This documentation

Prerequisites

Python 3.10 or higher
Ollama installed and running locally
Basic understanding of Excel data formats

Installation

Clone or download this repository:

git clone https://github.com/ai-mindset/text_categorizer.git
cd text_categorizer

Install the required dependencies:
```
uv pip install -e . 
```

Install and start Ollama:

# Install Ollama from https://ollama.com/download

# Start the Ollama server
ollama serve

Pull the default model:
```
ollama pull mistral-nemo
```

Usage

Basic Operation

To process an Excel file and categorise its contents:

python integration.py survey_responses.xlsx -o ./results

This will:

Read all sheets in survey_responses.xlsx
Extract questions and their associated responses
Categorise each response into one of the predefined categories
Save the results to the ./results directory

Command-line Options

python integration.py INPUT_FILE [OPTIONS]

Arguments:
  INPUT_FILE              Path to Excel file containing survey responses

Options:
  -o, --output PATH       Directory to save output files
  -s, --sheet TEXT        Specific sheet to process (processes all sheets if omitted)
  -m, --model TEXT        Ollama model to use (default: mistral-nemo)
  -b, --batch-size INT    Number of responses to process concurrently (default: 3)
  --no-cache              Disable caching of model responses

Examples

Process a specific sheet with a different model:

python integration.py quarterly_survey.xlsx -s "Q2 Responses" -m phi4 -o ./categorized

Process with larger batch size (faster on powerful machines):

python integration.py feedback.xlsx -b 5 -o ./output

Process without using cached responses:

python integration.py new_data.xlsx --no-cache -o ./fresh_results

Input Format

The tool works with standard Excel workbooks (.xlsx files). Each column in a sheet is treated as a separate question, with rows representing individual responses.

For optimal results:

Columns should have headers that include the question text
Question numbering (e.g., "Q1:", "Q5.", etc.) helps with organisation
Each sheet typically represents a separate survey or questionnaire

Example Excel structure:

| Q1: What is your role? | Q2: What challenges do you face? | Q3: What resources would help? |
|------------------------|----------------------------------|--------------------------------|
| Team Lead              | Time management                  | Better training                |
| Developer              | Technical complexity             | More documentation             |
| Manager                | Resource allocation              | Additional funding             |

Output Format

The tool generates JSON files that contain:

Individual question files with categorised responses
Combined files for each sheet with all questions

Example output structure:

{
  "sheet_name": "Survey Responses",
  "results": [
    {
      "question": "What resources would help?",
      "question_id": "question3",
      "categorized_items": [
        {
          "id": "0",
          "description": "Better training",
          "categories": ["Workforce"]
        },
        {
          "id": "1",
          "description": "More documentation",
          "categories": ["Data"]
        },
        {
          "id": "2",
          "description": "Additional funding",
          "categories": ["Funding"]
        }
      ],
      "summary": {
        "total_items": 3,
        "category_counts": {
          "Workforce": 1,
          "Data": 1,
          "Funding": 1
        }
      }
    }
  ]
}

Available Categories

By default, the tool categorises responses into these categories:

Funding
Data
Governance
Workforce
Comms and engagement
Other

To modify these categories, edit the CATEGORIES list in categorizer.py.

Performance Considerations

Model Selection: Smaller models run faster but may be less accurate
Batch Size: Higher values process more texts in parallel but require more memory
Caching: Enables faster re-processing of the same data
File Size: Very large Excel files may require more memory

Troubleshooting

Issue	Solution
Import errors	Ensure all files are in the same directory
"Model not found" errors	Verify Ollama is running (`ollama serve`) and model is installed (`ollama list`)
Slow processing	Try a smaller batch size or use a smaller model
Excel file errors	Check that your Excel file is not corrupted or password-protected
Memory errors	Process individual sheets instead of the entire workbook

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/text_categorizer		src/text_categorizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Categorisation Tool

Overview

Key Features

Project Structure

Prerequisites

Installation

Usage

Basic Operation

Command-line Options

Examples

Input Format

Output Format

Available Categories

Performance Considerations

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ai-mindset/TagSurvey

Folders and files

Latest commit

History

Repository files navigation

Text Categorisation Tool

Overview

Key Features

Project Structure

Prerequisites

Installation

Usage

Basic Operation

Command-line Options

Examples

Input Format

Output Format

Available Categories

Performance Considerations

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages