A privacy-first, lightweight, efficient tool for categorising unstructured text responses from Excel workbooks into predefined categories using open-source Ollama models.
This project provides a simple framework for processing and categorising qualitative survey data. It extracts free-text responses from Excel workbooks and uses local Ollama models to categorise them into predefined categories. This is especially useful for analysing open-ended survey questions where responses vary widely.
- Excel Processing: Extract questions and responses directly from Excel workbooks
- Local Inference: Uses Ollama to run powerful language models on your own hardware
- Configurable Categories: Easily adapt the tool to your specific categorisation needs
- Efficient Batching: Processes texts in parallel for faster throughput
- Result Caching: Avoids redundant model calls to save time and resources
- Question Tracking: Intelligently identifies and tracks question numbers across sheets
- Lightweight: Minimal dependencies, focused on a single task
text_categorizer/
├── categorizer.py # Text categorization module
├── excel_converter.py # Excel file processing module
├── integration.py # Integration of both modules
└── README.md # This documentation
- Python 3.10 or higher
- Ollama installed and running locally
- Basic understanding of Excel data formats
-
Clone or download this repository:
git clone https://github.com/ai-mindset/text_categorizer.git cd text_categorizer -
Install the required dependencies:
uv pip install -e . -
Install and start Ollama:
# Install Ollama from https://ollama.com/download # Start the Ollama server ollama serve
-
Pull the default model:
ollama pull mistral-nemo
To process an Excel file and categorise its contents:
python integration.py survey_responses.xlsx -o ./resultsThis will:
- Read all sheets in
survey_responses.xlsx - Extract questions and their associated responses
- Categorise each response into one of the predefined categories
- Save the results to the
./resultsdirectory
python integration.py INPUT_FILE [OPTIONS]
Arguments:
INPUT_FILE Path to Excel file containing survey responses
Options:
-o, --output PATH Directory to save output files
-s, --sheet TEXT Specific sheet to process (processes all sheets if omitted)
-m, --model TEXT Ollama model to use (default: mistral-nemo)
-b, --batch-size INT Number of responses to process concurrently (default: 3)
--no-cache Disable caching of model responses
Process a specific sheet with a different model:
python integration.py quarterly_survey.xlsx -s "Q2 Responses" -m phi4 -o ./categorizedProcess with larger batch size (faster on powerful machines):
python integration.py feedback.xlsx -b 5 -o ./outputProcess without using cached responses:
python integration.py new_data.xlsx --no-cache -o ./fresh_resultsThe tool works with standard Excel workbooks (.xlsx files). Each column in a sheet is treated as a separate question, with rows representing individual responses.
For optimal results:
- Columns should have headers that include the question text
- Question numbering (e.g., "Q1:", "Q5.", etc.) helps with organisation
- Each sheet typically represents a separate survey or questionnaire
Example Excel structure:
| Q1: What is your role? | Q2: What challenges do you face? | Q3: What resources would help? |
|------------------------|----------------------------------|--------------------------------|
| Team Lead | Time management | Better training |
| Developer | Technical complexity | More documentation |
| Manager | Resource allocation | Additional funding |
The tool generates JSON files that contain:
- Individual question files with categorised responses
- Combined files for each sheet with all questions
Example output structure:
{
"sheet_name": "Survey Responses",
"results": [
{
"question": "What resources would help?",
"question_id": "question3",
"categorized_items": [
{
"id": "0",
"description": "Better training",
"categories": ["Workforce"]
},
{
"id": "1",
"description": "More documentation",
"categories": ["Data"]
},
{
"id": "2",
"description": "Additional funding",
"categories": ["Funding"]
}
],
"summary": {
"total_items": 3,
"category_counts": {
"Workforce": 1,
"Data": 1,
"Funding": 1
}
}
}
]
}By default, the tool categorises responses into these categories:
- Funding
- Data
- Governance
- Workforce
- Comms and engagement
- Other
To modify these categories, edit the CATEGORIES list in categorizer.py.
- Model Selection: Smaller models run faster but may be less accurate
- Batch Size: Higher values process more texts in parallel but require more memory
- Caching: Enables faster re-processing of the same data
- File Size: Very large Excel files may require more memory
| Issue | Solution |
|---|---|
| Import errors | Ensure all files are in the same directory |
| "Model not found" errors | Verify Ollama is running (ollama serve) and model is installed (ollama list) |
| Slow processing | Try a smaller batch size or use a smaller model |
| Excel file errors | Check that your Excel file is not corrupted or password-protected |
| Memory errors | Process individual sheets instead of the entire workbook |
MIT