An intelligent qualitative coding system that analyzes market research transcripts, focus groups, and interviews using OpenAI's GPT models to generate actionable insights with research context awareness.
- Context-Aware Analysis - Incorporates research objectives and brand context for targeted insights
- Hierarchical Coding - Generates themes with sub-themes, priorities, and speaker-attributed quotes
- Multi-Format Export - Outputs in JSON, Markdown, Text, and CSV formats
- Market Research Focus - Optimized for competitive analysis, brand perception, and consumer insights
- Speaker Attribution - Tracks and attributes quotes to specific participants
- Project Organization - Structured input/output management with objectives and transcripts
- Smart Chunking - Topic-based text segmentation preserving context
- Optional Embeddings - HuggingFace model integration for similarity search
# Copy environment template
cp .env.template .env
# Edit .env file with your API keys
nano .envAdd your OpenAI API key to .env:
OPENAI_API_KEY=your_openai_api_key_here
pip install -r requirements.txtinputs/
└── your_project/
├── objectives/
│ ├── objectives.json # Research objectives
│ ├── brand_context.json # Brand positioning
│ └── research_brief.txt # Business context
└── transcripts/
└── transcript1.txt
# Analyze market research project
python analyze_market_research.py your_projectquali_codes/
├── src/ # Source code modules
│ ├── qualitative_coder.py # Main orchestration
│ ├── code_generator.py # OpenAI integration
│ ├── preprocessor.py # Text cleaning
│ ├── chunker.py # Text segmentation
│ ├── code_postprocessor.py # Analysis post-processing
│ ├── embeddings.py # HuggingFace embeddings (optional)
│ ├── local_vector_store.py # Local similarity search
│ ├── config.py # Configuration management
│ └── logger.py # Colored logging
├── tests/ # Test files (for your tests)
├── inputs/ # Input text files
├── outputs/ # Analysis results (JSON files)
├── logs/ # Log files
├── analyze_market_research.py # Main entry point
├── process_project.py # Project processor
├── requirements.txt # Dependencies
├── .env.template # Environment template
└── README.md # This file
from src import QualitativeCoder
# Initialize the coder
coder = QualitativeCoder()
# Process texts
results = coder.process_texts(
texts=["Your text data here", "More text..."],
languages=['en', 'en'],
cluster_ids=[1, 1],
store_vectors=True
)
# Save results
coder.save_results(results, "my_analysis.json")# Load from input file
texts = coder.load_texts_from_file("my_data.json")
# Process loaded texts
results = coder.process_texts(texts)# Search for similar texts (requires embeddings)
similar = coder.search_similar_texts("mental health", top_k=5)| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | Your OpenAI API key |
HUGGING_FACE_TOKEN |
No | For embeddings/similarity search |
CHUNK_SIZE |
No | Max characters per chunk (default: 512) |
CHUNK_OVERLAP |
No | Character overlap between chunks (default: 128) |
OPENAI_MODEL |
No | OpenAI model to use (default: gpt-4o) |
EMBEDDING_MODEL |
No | HuggingFace model for embeddings |
The system uses these directories by default:
- Input files: Place your data in
./inputs/ - Output files: Results saved to
./outputs/ - Log files: Logs written to
./logs/
To use different directories, modify the paths in your .env file or update Config class in src/config.py.
Place .txt files in ./inputs/ directory:
./inputs/interview_data.txt
{
"texts": [
"First interview transcript...",
"Second interview transcript..."
]
}Or simple array:
[
"Text 1",
"Text 2"
]Results are saved as JSON files in ./outputs/ with structure:
{
"original_texts": [...],
"codes": {
"1": {
"Theme Name": [
{"sub_code": "Sub-theme", "priority": "high"}
]
}
},
"consolidated_analysis": {...},
"top_findings": [...],
"insights": [...],
"analysis_timestamp": "2025-08-21T18:03:45"
}The system generates:
- Hierarchical Codes - Main themes with sub-themes and priorities
- Key Insights - Analytical observations about priority distribution
- Top Findings - Ranked list of high-priority items
- Consolidated Analysis - Cross-cluster theme analysis
- Code Hierarchy - Structured view for visualization tools
# Initialize without embeddings for faster processing
coder = QualitativeCoder(use_embeddings=False)Edit .env file:
CHUNK_SIZE=1024
CHUNK_OVERLAP=256
OPENAI_MODEL=gpt-5-nano
-
Missing OpenAI API Key
- Ensure
OPENAI_API_KEYis set in.envfile - Check your OpenAI account has credits
- Ensure
-
Embedding Model Errors (Non-critical)
- Embeddings are optional - system works without them
- Add
HUGGING_FACE_TOKENto enable similarity search
-
Permission Errors
- Ensure write permissions for
outputs/andlogs/directories
- Ensure write permissions for
Missing required environment variables- Check your.envfileCould not initialize embeddings- Optional feature, system still worksError saving results- Check directory permissions
- Prepare Data: Place interview transcripts in
./inputs/ - Configure: Set up
.envwith your OpenAI API key - Run Analysis:
python main.pyor use custom script - Review Results: Check
./outputs/for JSON analysis files - Extract Insights: Use the generated codes and insights for your research
openai>=1.0.0- OpenAI API integrationtransformers>=4.21.0- HuggingFace models (optional)torch>=2.0.0- Deep learning backend (optional)scikit-learn>=1.3.0- Local vector operationsnltk>=3.8- Sentence tokenizationnumpy>=1.24.0- Numerical operationspython-dotenv>=1.0.0- Environment managementtermcolor>=2.3.0- Colored output
This project is for research and educational purposes.
For issues or questions:
- Check the troubleshooting section above
- Review the log files in
./logs/ - Ensure all dependencies are installed correctly