@@ -5,9 +5,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
55## Commands
66
77### Testing
8- - Run all tests: ` cd piedomains/tests && pytest `
9- - Run tests with tox : ` tox ` ( tests with Python 3.10, 3.11)
8+ - Run all tests: ` pytest piedomains/tests/ -v `
9+ - Run tests without ML models : ` pytest piedomains/ tests/ -v -m "not ml" `
1010- Run specific test: ` pytest piedomains/tests/test_001_pred_domain_text.py `
11+ - Run with coverage: ` pytest piedomains/tests/ --cov=piedomains `
1112
1213### Linting and Code Quality
1314- Run pylint: ` pylint piedomains/ ` (uses configuration from ` pylintrc ` )
@@ -17,55 +18,145 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
1718
1819### Installation and Development
1920- Install package: ` pip install -e . ` (from repository root)
20- - Install with requirements: ` pip install -r requirements.txt `
21- - Console script: ` classify_domains ` (entry point defined in setup.py)
21+ - Install with dev dependencies: ` pip install -e ".[dev]" `
22+ - Console script: ` classify_domains ` (entry point defined in pyproject.toml)
23+
24+ ### Package Management
25+ - Build package: ` python -m build `
26+ - Upload to PyPI: ` python -m twine upload dist/* `
27+ - Validate README: ` python -c "import docutils.core; docutils.core.publish_doctree(open('README.rst').read())" `
2228
2329### Documentation
2430- Build docs: ` cd docs && make html `
2531- Documentation is built with Sphinx and deployed to ReadTheDocs
2632
2733## Architecture
2834
29- ### Core Components
35+ ### v0.3.0+ Modern Architecture
3036
31- ** Piedomain Class (` piedomain.py ` )** : Central prediction engine that implements three prediction methods:
32- - ` pred_shalla_cat_with_text() ` : Text-based domain classification using HTML content
33- - ` pred_shalla_cat_with_images() ` : Image-based classification using homepage screenshots
34- - ` pred_shalla_cat() ` : Combined approach using both text and images
37+ ** New API Design (` api.py ` )** : Modern, user-friendly interface
38+ - ` DomainClassifier ` : Main class with intuitive methods
39+ - ` .classify() ` : Combined text + image analysis (most accurate)
40+ - ` .classify_by_text() ` : Text-only analysis (faster)
41+ - ` .classify_by_images() ` : Image-only analysis (visual content)
42+ - ` .classify_batch() ` : Batch processing with progress tracking
43+ - ` classify_domains() ` : Convenience function for quick usage
44+ - Archive.org integration for historical analysis
3545
36- ** Base Class (` base.py ` )** : Handles model downloading and loading from Harvard Dataverse. Models are cached locally after first download.
46+ ** Modular Classifiers (` classifiers/ ` )** :
47+ - ` TextClassifier ` : Specialized text content analysis
48+ - ` ImageClassifier ` : Screenshot-based visual analysis
49+ - ` CombinedClassifier ` : Ensemble approach combining both modalities
3750
38- ** Domain Module (` domain.py ` )** : Main API entry point that exposes the three prediction functions from Piedomain class.
51+ ** Content Processors (` processors/ ` )** :
52+ - ` TextProcessor ` : HTML parsing, text extraction and cleaning
53+ - ` ContentProcessor ` : Content fetching and caching logic
3954
40- ### Machine Learning Pipeline
55+ ** Legacy API (` domain.py ` )** : Backward-compatible functions
56+ - ` pred_shalla_cat_*() ` functions preserved for existing users
57+ - Will show deprecation warnings in future versions
4158
42- 1 . ** Text Processing** : HTML content is scraped, cleaned (removing non-English words, stopwords, punctuation), and fed to a TensorFlow text model
43- 2 . ** Image Processing** : Screenshots are taken via Selenium WebDriver, resized to 254x254, and processed by a TensorFlow CNN model
44- 3 . ** Model Calibration** : Text predictions are post-processed using isotonic regression calibrators (stored in ` model/calibrate/text/ ` )
45- 4 . ** Ensemble** : Final predictions combine text and image probabilities with equal weighting
59+ ** Core Engine (` piedomain.py ` )** : Low-level prediction engine with ML pipeline
60+ - TensorFlow model inference with proper memory management
61+ - Batch processing with configurable sizes
62+ - Resource cleanup with context managers
63+
64+ ### Machine Learning Pipeline
4665
47- ### Data Flow
48- - Input: List of domain names
49- - HTML extraction: Requests + BeautifulSoup for text content
50- - Screenshot capture: Selenium Chrome WebDriver in headless mode
51- - Feature processing: NLTK for text cleanup, PIL for image preprocessing
52- - Prediction: TensorFlow models for both modalities
53- - Output: Pandas DataFrame with predictions, probabilities, and metadata
66+ 1 . ** Content Fetching** :
67+ - Live content: HTTP requests with retry logic and connection pooling
68+ - Historical content: Archive.org integration with ` ArchiveFetcher `
69+ - Caching: Automatic file-based caching for reuse
70+ 2 . ** Text Processing** :
71+ - HTML parsing with BeautifulSoup
72+ - Text extraction and cleaning (removing non-English words, stopwords, punctuation)
73+ - NLTK-based text preprocessing with fallbacks
74+ 3 . ** Image Processing** :
75+ - Screenshot capture via Selenium WebDriver with proper resource management
76+ - Image resizing to 254x254 with PIL
77+ - Tensor preprocessing for CNN model
78+ 4 . ** Model Inference** :
79+ - TensorFlow 2.11+ models with explicit memory cleanup
80+ - Batch processing with configurable sizes for scalability
81+ - Text model calibration using isotonic regression
82+ 5 . ** Ensemble** : Final predictions combine text and image probabilities with equal weighting
83+
84+ ### Data Flow Architecture
85+ - ** Input** : List of domain names or URLs, optional archive dates
86+ - ** Fetching** : Modular fetcher system (` LiveFetcher ` /` ArchiveFetcher ` )
87+ - ** Processing** : Separate text and image processing pipelines
88+ - ** Inference** : TensorFlow models with batch optimization and memory management
89+ - ** Output** : Pandas DataFrame with predictions, probabilities, and comprehensive metadata
90+ - ** Cleanup** : Automatic resource cleanup (WebDriver, temp files, tensors)
5491
5592### Categories
5693The model predicts among 41 Shallalist categories defined in ` constants.py ` including: adv, alcohol, automobile, dating, downloads, drugs, education, finance, forum, gamble, government, news, politics, porn, recreation, shopping, socialnet, etc.
5794
5895### Model Storage
59- - Models are downloaded from Harvard Dataverse on first use
60- - Cached in ` model/shallalist/ ` directory structure
61- - Text model: ` saved_model/piedomains/ `
62- - Image model: ` saved_model/pydomains_images/ `
63- - Calibrators: ` calibrate/text/*.sav ` files
64-
65- ### Key Dependencies
66- - TensorFlow 2.11+ for neural network inference
67- - Selenium 4.8 for web scraping and screenshots (requires ChromeDriver)
68- - NLTK for text processing and English word filtering
69- - scikit-learn for model calibration
70- - BeautifulSoup4 for HTML parsing
71- - Pillow for image processing
96+ - ** Download** : Models automatically downloaded from Harvard Dataverse on first use
97+ - ** Cache Structure** : ` model/shallalist/ ` directory structure
98+ - ** Text Model** : ` saved_model/piedomains/ ` (TensorFlow SavedModel format)
99+ - ** Image Model** : ` saved_model/pydomains_images/ ` (TensorFlow SavedModel format)
100+ - ** Calibrators** : ` calibrate/text/*.sav ` files (scikit-learn isotonic regression)
101+ - ** Version Management** : ` latest=True ` parameter forces model updates
102+
103+ ### Key Dependencies & Architecture
104+ - ** TensorFlow 2.11-2.15** : Neural network inference with memory management
105+ - ** Selenium 4.8** : WebDriver automation with context manager cleanup
106+ - ** NLTK** : Text processing with lazy initialization and fallbacks
107+ - ** scikit-learn 1.5** : Model calibration and post-processing
108+ - ** BeautifulSoup4** : HTML parsing and content extraction
109+ - ** Pillow 10.3** : Image processing and tensor conversion
110+ - ** webdriver-manager** : Automatic ChromeDriver management
111+ - ** pandas 1.4** : DataFrame output and data manipulation
112+
113+ ## Usage Patterns
114+
115+ ### Modern API (Recommended)
116+ ``` python
117+ from piedomains import DomainClassifier
118+
119+ classifier = DomainClassifier()
120+ result = classifier.classify([" cnn.com" , " amazon.com" ])
121+ ```
122+
123+ ### Legacy API (Backward Compatible)
124+ ``` python
125+ from piedomains import pred_shalla_cat
126+ result = pred_shalla_cat([" cnn.com" , " amazon.com" ])
127+ ```
128+
129+ ### Archive Analysis
130+ ``` python
131+ # Historical content from 2020
132+ result = classifier.classify([" facebook.com" ], archive_date = " 20200101" )
133+ ```
134+
135+ ## Performance & Scaling
136+ - ** Batch Size** : Default 32, configurable via environment variables
137+ - ** Memory Management** : Explicit TensorFlow tensor cleanup in batch operations
138+ - ** Resource Cleanup** : Automatic WebDriver and temp file cleanup via context managers
139+ - ** Caching** : File-based caching for HTML and images reduces repeated fetching
140+ - ** Network** : HTTP connection pooling with session reuse for improved performance
141+ - ** Reliability** : Retry logic with exponential backoff and proper error handling
142+
143+ ## Critical Quality Assurance
144+
145+ ### Security Features
146+ - ** Input Sanitization** : Comprehensive validation for URLs/domains and archive dates
147+ - ** Path Traversal Protection** : Safe tar extraction in ` utils.safe_extract() `
148+ - ** Resource Limits** : Configurable timeouts and batch sizes prevent resource exhaustion
149+ - ** Error Isolation** : Robust error handling prevents crashes from malformed inputs
150+
151+ ### Performance Monitoring
152+ - ** Memory Usage** : TensorFlow tensor cleanup and resource management
153+ - ** Network Efficiency** : Connection pooling reduces overhead for batch operations
154+ - ** Progress Tracking** : Built-in progress monitoring for long-running operations
155+ - ** Cache Optimization** : Intelligent caching reduces redundant network requests
156+
157+ ### Testing Strategy
158+ - ** Unit Tests** : 14 test modules covering all components
159+ - ** Integration Tests** : End-to-end testing with mock and real scenarios
160+ - ** Performance Tests** : Memory usage and batch processing validation
161+ - ** Security Tests** : Input validation and edge case handling
162+ - ** ML Tests** : Marked with ` @pytest.mark.ml ` for optional model testing
0 commit comments