briefcase.ai is an AI-powered system for extracting, categorizing, and evaluating privacy-related statements from online service documents (such as privacy policies, terms of service, and related legal documents). The project leverages state-of-the-art NLP models to automate the analysis of privacy documents, providing structured bullet points and criticality-based categorization for downstream applications.
- Project Overview
- Features
- Project Structure
- Key Components
- Database Models
- Notebooks
- Website Interface
- Setup & Installation
- License
briefcase.ai automates the extraction and classification of privacy-related statements from service documents. It uses a fine-tuned SentenceTransformer model to identify and categorize key statements, helping users and organizations better understand privacy risks and obligations in online services. The system is built using briefcase.ai branding for the web interface.
- Automated Document Processing: Extracts and processes privacy-related documents using advanced NLP techniques.
- Bullet Point Extraction: Identifies and extracts key statements from documents using ChromaDB-based similarity matching.
- Intelligent Categorization: Assigns extracted statements to privacy categories with standardized mappings.
- Multi-Database Support: Supports PostgreSQL, MongoDB, and ChromaDB for flexible data storage.
- Model Training & Evaluation: Comprehensive pipelines for training and evaluating custom SentenceTransformer models.
- Data Pipeline: Complete ETL pipeline for ingesting, processing, and storing privacy document data.
briefcase.ai/
├── data/ # Raw and processed datasets
│ ├── data_all.csv # Complete dataset
│ ├── data_all_cleaned.csv # Cleaned dataset for training
│ └── reviewed_service_ids.txt # Service IDs for processing
├── database/ # Database abstraction layer
│ ├── base.py # SQLAlchemy base configuration
│ ├── chromadb.py # ChromaDB connection and operations
│ ├── mongodb.py # MongoDB connection and operations
│ ├── postgresql.py # PostgreSQL connection and operations
│ ├── config.py # Database configuration
│ ├── models/ # Database models
│ │ ├── policy.py # Policy and PolicyProcessed models
│ │ └── service_id.py # Service ID model
│ └── utils/
│ └── categories.py # Category utility functions
├── src/ # Main source code
│ ├── components/
│ │ ├── data_ingestion/ # Data loading and preprocessing
│ │ │ ├── data_loader.py # Data fetching from ToS;DR APIs
│ │ │ ├── preprocessor.py # Data cleaning and standardization
│ │ │ ├── utils.py # Language detection and text cleaning
│ │ │ └── config/ # Configuration files
│ │ │ ├── api_settings.py # API endpoint configurations
│ │ │ ├── category_maps.py # Category standardization mappings
│ │ │ └── classification_levels.py # Classification level definitions
│ │ ├── model_training/ # Model training pipeline
│ │ │ ├── training.py # SentenceTransformer training logic
│ │ │ └── constants.py # Training hyperparameters
│ │ └── model_evaluation/ # Model evaluation and metrics
│ │ └── evaluation.py # Performance evaluation and reporting
│ └── pipelines/
│ └── data_ingestion.py # Main data pipeline orchestration
└── utils/ # Utility modules
└── chroma_bullet_extractor.py # ChromaDB-based text extraction
data_loader.py: Fetches privacy policy data from ToS;DR APIs and stores in databasespreprocessor.py: Cleans and standardizes category mappings using predefined transformationsutils.py: Language detection, text cleaning, and data validation utilities- Configuration: API settings, category mappings, and classification level definitions
training.py: Complete SentenceTransformer training pipeline with triplet lossconstants.py: Training hyperparameters and model configuration- Supports checkpoint saving and resuming training
evaluation.py: Comprehensive model performance analysis- Generates F1 scores, confusion matrices, and classification reports
- Supports query-based predictions for testing
- Multi-database support: PostgreSQL, MongoDB, and ChromaDB
- SQLAlchemy models: Policy and ServiceId models for structured data
- Abstraction layer: Unified interface across different database backends
- ChromaDB Integration: Vector-based similarity search for document analysis
- Smart text chunking: Intelligent document segmentation
- Category-based extraction: Organized extraction by privacy categories
- Modern UI: Interactive demo interface branded as briefcase.ai
- Dual functionality: Document finding and text analysis capabilities
- Real-time processing: Live document analysis with visual feedback
Policy: Raw policy data with point_id, category, source, service_id, and service_namePolicyProcessed: Cleaned and processed policy data with standardized categories
- Service identification and metadata management
Data Cleaning.ipynb: Interactive data preprocessing and cleaning workflowsTraining.ipynb: Model training experiments and hyperparameter tuningEvaluation.ipynb: Model performance analysis and visualization
The project includes a modern web interface branded as briefcase.ai that provides:
- Interactive Demo: Two-tab interface for document finding and text analysis
- Real-time Analysis: Live processing of privacy policies and terms of service
- Category Visualization: Organized display of extracted privacy points
- Responsive Design: Mobile-friendly interface with modern styling
- Python 3.12 or higher
- Git
-
Clone the repository:
git clone https://github.com/AryehRotberg/briefcase.ai.git cd briefcase.ai -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt # If requirements.txt exists # Or install individually based on the modules you need: pip install sentence-transformers chromadb sqlalchemy pandas numpy scikit-learn pip install postgresql-adapter pymongo # For database support pip install aiohttp asyncio # For async data loading
-
Configure environment:
cp .env.example .env # If .env.example exists # Edit .env with your database credentials and API keys
-
Initialize databases (if using):
# Run database initialization scripts python -c "from database.base import Base; from database.postgresql import engine; Base.metadata.create_all(engine)"
# Run the complete data ingestion pipeline
python src/pipelines/data_ingestion.py# Train a new model
from src.components.model_training.training import ModelTraining
trainer = ModelTraining()
# Configure and run trainingOpen website/index.html in a web browser to access the briefcase.ai interface.
from utils.chroma_bullet_extractor import ChromaBulletExtractor
# Initialize and use the extractorThis project is licensed under the MIT License. See LICENSE for details.
