Skip to content

Commit 9850e74

Browse files
committed
a bit more security first + sandbox implementation
1 parent c363b00 commit 9850e74

37 files changed

+3308
-467
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,8 @@ venv/
110110
ENV/
111111
env.bak/
112112
venv.bak/
113+
docs_test_env/
114+
test_fix_env/
113115

114116
# Spyder project settings
115117
.spyderproject
@@ -139,3 +141,4 @@ piedomains/model/shallalist/
139141
archive_html_*/
140142
archive_images_*/
141143
/cache
144+
/test_cache

CHANGELOG.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,24 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.4.2] - 2025-12-15
9+
10+
### Fixed
11+
- **Dependency Management**: Removed `_has_llm` anti-pattern and implemented proper Python dependency management via pyproject.toml
12+
- **BeautifulSoup Warning**: Fixed deprecation warning by replacing `text=True` with `string=True` in text processor
13+
- **Pytest Warnings**: Added missing `performance` marker to pytest configuration to eliminate unknown mark warnings
14+
- **LLM Classifier**: Fixed duplicate `max_tokens` parameter error in connection test
15+
16+
### Changed
17+
- **Documentation Links**: Updated all references from ReadTheDocs to GitHub Pages (https://themains.github.io/piedomains/)
18+
- **PyPI Links**: Updated PyPI badge to use current domain (pypi.org instead of pypi.python.org)
19+
- **README**: Streamlined documentation by removing editorial content and marketing language, focusing on minimal practical instructions
20+
21+
### Improved
22+
- **Code Quality**: All tests now run without warnings (eliminated 3 targeted warnings)
23+
- **Package Building**: Resolved build conflicts and ensured clean package compilation
24+
- **Link Verification**: All documentation and package links verified as working
25+
826
## [0.4.0] - 2025-12-15
927

1028
### 🚨 Breaking Changes
@@ -20,7 +38,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2038
### 🔧 Changed
2139
- **Type Hints**: Modernized all type annotations to use Python 3.11+ union syntax (`|`)
2240
- **Import Structure**: Added `from __future__ import annotations` for cleaner type hints
23-
- **Project Structure**:
41+
- **Project Structure**:
2442
- Moved `piedomains/tests/``tests/`
2543
- Moved `piedomains/notebooks/``notebooks/`
2644
- **Configuration**: Enhanced error handling with proper logging in config validation
@@ -108,7 +126,7 @@ This release represents a major cleanup and modernization of the codebase, remov
108126
- `test_013_performance_benchmarks.py`: Performance and scalability testing
109127
- Mock-based testing for reliable CI/CD
110128
- Performance benchmarking and memory usage monitoring
111-
- **Improved Documentation**:
129+
- **Improved Documentation**:
112130
- New quickstart-focused README with 3-line setup
113131
- Comprehensive API examples and migration guide
114132
- `examples/new_api_demo.py`: Interactive demonstration script
@@ -131,7 +149,7 @@ This release represents a major cleanup and modernization of the codebase, remov
131149
```python
132150
# Modern API
133151
from piedomains import DomainClassifier
134-
152+
135153
# New API available
136154
from piedomains import DomainClassifier
137155
```
@@ -216,7 +234,7 @@ This release represents a major cleanup and modernization of the codebase, remov
216234
- Error details and context
217235
- **Comprehensive Test Suite**: 6 new test modules added
218236
- Domain validation tests
219-
- Text processing tests
237+
- Text processing tests
220238
- Error handling tests
221239
- Utility function tests
222240
- Configuration system tests
@@ -248,4 +266,4 @@ This release represents a major cleanup and modernization of the codebase, remov
248266
- **Tensor Management**: Proper cleanup of TensorFlow tensors to prevent memory leaks
249267

250268
## [0.0.19] - Previous Release
251-
- Legacy version with basic functionality
269+
- Legacy version with basic functionality

README.md

Lines changed: 55 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -1,199 +1,125 @@
1-
# piedomains: AI-powered domain content classification
1+
# piedomains
22

33
[![CI](https://github.com/themains/piedomains/actions/workflows/ci.yml/badge.svg)](https://github.com/themains/piedomains/actions/workflows/ci.yml)
4-
[![PyPI Version](https://img.shields.io/pypi/v/piedomains.svg)](https://pypi.python.org/pypi/piedomains)
5-
[![Documentation](https://github.com/themains/piedomains/actions/workflows/docs.yml/badge.svg)](https://github.com/themains/piedomains/actions/workflows/docs.yml)
4+
[![PyPI Version](https://img.shields.io/pypi/v/piedomains.svg)](https://pypi.org/project/piedomains)
65

7-
**piedomains** predicts website content categories using traditional ML models or modern LLMs (GPT-4, Claude, Gemini). Analyze domain names, text content, and homepage screenshots to classify websites as news, shopping, adult content, education, etc. with high accuracy and flexible custom categories.
6+
Classify website content categories using machine learning models or LLMs (GPT-4, Claude, Gemini).
87

9-
## 🚀 Quickstart
8+
## Installation
109

11-
Install and classify domains in 3 lines:
12-
13-
```python
10+
```bash
1411
pip install piedomains
12+
```
13+
14+
Requires Python 3.11+
15+
16+
## Basic Usage
1517

18+
```python
1619
from piedomains import DomainClassifier
17-
classifier = DomainClassifier()
1820

19-
# Classify current content
21+
classifier = DomainClassifier()
2022
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
2123
print(result[['domain', 'pred_label', 'pred_prob']])
2224

23-
# Expected output:
25+
# Output:
2426
# domain pred_label pred_prob
2527
# 0 cnn.com news 0.876543
2628
# 1 amazon.com shopping 0.923456
2729
# 2 wikipedia.org education 0.891234
2830
```
2931

30-
## 📊 Key Features
31-
32-
- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
33-
- **LLM-Powered**: Use GPT-4o, Claude 3.5, Gemini with custom categories and instructions
34-
- **Historical Analysis**: Classify websites from any point in time using archive.org
35-
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
36-
- **Easy Integration**: Modern Python API with pandas output
37-
- **Flexible Categories**: 41 default categories or define your own with AI models
38-
39-
## ⚡ Usage Examples
40-
41-
### Basic Classification
32+
## Classification Methods
4233

4334
```python
44-
from piedomains import DomainClassifier
45-
46-
classifier = DomainClassifier()
35+
# Combined text + image analysis (most accurate)
36+
result = classifier.classify(["github.com"])
4737

48-
# Combined analysis (most accurate)
49-
result = classifier.classify(["github.com", "reddit.com"])
50-
51-
# Text-only (faster)
38+
# Text-only classification (faster)
5239
result = classifier.classify_by_text(["news.google.com"])
5340

54-
# Images-only (good for visual content)
41+
# Image-only classification
5542
result = classifier.classify_by_images(["instagram.com"])
56-
```
57-
58-
### Historical Analysis
5943

60-
```python
61-
# Analyze how Facebook looked in 2010 vs today
62-
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
63-
new_facebook = classifier.classify(["facebook.com"])
64-
65-
print(f"2010: {old_facebook.iloc[0]['pred_label']}")
66-
print(f"2024: {new_facebook.iloc[0]['pred_label']}")
44+
# Batch processing
45+
results = classifier.classify_batch(domains, method="text", batch_size=50)
6746
```
6847

69-
### Batch Processing
48+
## Historical Analysis
7049

7150
```python
72-
# Process large lists efficiently
73-
domains = ["site1.com", "site2.com", ...] # 1000s of domains
74-
results = classifier.classify_batch(
75-
domains,
76-
method="text", # text|images|combined
77-
batch_size=50, # Process 50 at a time
78-
show_progress=True # Progress bar
79-
)
51+
# Analyze archived versions from archive.org
52+
old_result = classifier.classify(["facebook.com"], archive_date="20100101")
8053
```
8154

82-
### 🤖 LLM-Powered Classification
83-
84-
Use modern AI models (GPT-4, Claude, Gemini) for flexible, accurate classification:
55+
## LLM Classification
8556

8657
```python
87-
from piedomains import DomainClassifier
88-
89-
classifier = DomainClassifier()
90-
91-
# Configure your preferred AI provider
58+
# Configure LLM provider
9259
classifier.configure_llm(
93-
provider="openai", # openai, anthropic, google
94-
model="gpt-4o", # multimodal model
95-
api_key="sk-...", # or set via environment variable
96-
categories=["news", "shopping", "social", "tech", "education"]
60+
provider="openai",
61+
model="gpt-4o",
62+
api_key="sk-...",
63+
categories=["news", "shopping", "social", "tech"]
9764
)
9865

99-
# Text-only LLM classification
100-
result = classifier.classify_by_llm(["cnn.com", "github.com"])
66+
# LLM-powered classification
67+
result = classifier.classify_by_llm(["example.com"])
10168

102-
# Multimodal classification (text + screenshots)
103-
result = classifier.classify_by_llm_multimodal(["instagram.com"])
104-
105-
# Custom classification instructions
69+
# With custom instructions
10670
result = classifier.classify_by_llm(
107-
["khanacademy.org", "reddit.com"],
108-
custom_instructions="Classify by educational value: educational, entertainment, mixed"
71+
["site.com"],
72+
custom_instructions="Classify by educational value"
10973
)
110-
111-
# Track usage and costs
112-
stats = classifier.get_llm_usage_stats()
113-
print(f"API calls: {stats['total_requests']}, Cost: ${stats['estimated_cost_usd']:.4f}")
11474
```
11575

116-
**LLM Benefits:**
117-
- **Custom Categories**: Define your own classification schemes
118-
- **Multimodal Analysis**: Combines text + visual understanding
119-
- **Latest AI**: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
120-
- **Cost Tracking**: Built-in usage monitoring and limits
121-
- **Flexible Prompts**: Customize instructions for specific use cases
122-
123-
**Supported Providers:**
124-
- **OpenAI**: GPT-4o, GPT-4-turbo, GPT-3.5-turbo
125-
- **Anthropic**: Claude 3.5 Sonnet, Claude 3 Opus/Haiku
126-
- **Google**: Gemini 1.5 Pro, Gemini Pro Vision
127-
- **Others**: Any litellm-supported model
128-
76+
Set API keys via environment variables:
12977
```bash
130-
# Set API keys via environment variables
13178
export OPENAI_API_KEY="sk-..."
13279
export ANTHROPIC_API_KEY="sk-ant-..."
13380
export GOOGLE_API_KEY="..."
13481
```
13582

136-
## 🏷️ Supported Categories
137-
138-
News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.
139-
140-
## 📈 Performance
83+
## Categories
14184

142-
- **Speed**: ~10-50 domains/minute (depends on method and network)
143-
- **Accuracy**: 85-95% depending on content type and method
144-
- **Memory**: <500MB for batch processing
145-
- **Caching**: Automatic content caching for faster re-runs
85+
41 categories: news, finance, shopping, education, government, adult content, gambling, social networks, search engines, and others based on Shallalist taxonomy.
14686

147-
## 🔧 Installation
87+
## Security
14888

149-
**Requirements**: Python 3.11+
89+
When analyzing unknown domains, use Docker or isolated environments:
15090

15191
```bash
152-
# Basic installation
153-
pip install piedomains
154-
155-
# For development
156-
git clone https://github.com/themains/piedomains
157-
cd piedomains
158-
pip install -e .
159-
```
160-
161-
## 💡 API Usage
162-
163-
```python
92+
docker build -t piedomains-sandbox .
93+
docker run --rm -it piedomains-sandbox python -c "
16494
from piedomains import DomainClassifier
16595
classifier = DomainClassifier()
166-
result = classifier.classify_by_text(["example.com"])
96+
result = classifier.classify(['example.com'])
97+
print(result[['domain', 'pred_label']])
98+
"
16799
```
168100

169-
## 📖 Documentation
101+
For testing, use known-safe domains: `["wikipedia.org", "github.com", "cnn.com"]`
170102

171-
- **API Reference**: https://piedomains.readthedocs.io
172-
- **Examples**: `/examples` directory
173-
- **Notebooks**: `/notebooks` (training & analysis)
103+
## Documentation
174104

175-
## 🤝 Contributing
105+
- [API Reference](https://themains.github.io/piedomains/)
106+
- [Examples](examples/)
107+
- [Security Guide](examples/sandbox/)
108+
109+
## Development
176110

177111
```bash
178-
# Setup development environment
179112
git clone https://github.com/themains/piedomains
180113
cd piedomains
181114
pip install -e ".[dev]"
182-
183-
# Run tests
184115
pytest tests/ -v
185-
186-
# Run linting
187-
ruff check piedomains/
188116
```
189117

190-
## 📄 License
118+
## License
191119

192-
MIT License - see LICENSE file.
120+
MIT License
193121

194-
## 📚 Citation
195-
196-
If you use piedomains in research, please cite:
122+
## Citation
197123

198124
```bibtex
199125
@software{piedomains,
@@ -202,4 +128,4 @@ If you use piedomains in research, please cite:
202128
year={2024},
203129
url={https://github.com/themains/piedomains}
204130
}
205-
```
131+
```

examples/README.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,24 @@ python new_api_demo.py
2020
python llm_demo.py # Requires API key
2121
```
2222

23+
## 🔒 Security & Sandbox Examples
24+
25+
**⚠️ Important**: For unknown/suspicious domains, use the sandbox examples to protect your system:
26+
27+
```bash
28+
# Safe, isolated domain classification
29+
cd examples/sandbox
30+
python3 secure_classify.py suspicious-domain.com --text-only
31+
32+
# Interactive secure mode
33+
python3 secure_classify.py --interactive
34+
35+
# See all sandboxing options
36+
python3 sandbox_demo.py
37+
```
38+
39+
See **[`sandbox/`](sandbox/)** directory for complete security examples including Docker isolation, macOS sandboxing, and VM setup guides.
40+
2341
### LLM Demo Setup
2442

2543
For LLM examples, set your API key:
@@ -34,4 +52,4 @@ export GOOGLE_API_KEY="..." # Google
3452
Note: These scripts require the piedomains package to be installed:
3553
```bash
3654
pip install -e ..
37-
```
55+
```

0 commit comments

Comments
 (0)