Skip to content

Commit 1ca00fd

Browse files
committed
simplify the API
1 parent 29201d4 commit 1ca00fd

25 files changed

+3038
-1330
lines changed

README.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,13 @@
55
[![Downloads](https://pepy.tech/badge/piedomains)](https://pepy.tech/project/piedomains)
66
[![Documentation](https://img.shields.io/badge/docs-github.io-blue)](https://themains.github.io/piedomains/)
77

8-
## 🚀 What's New in v0.5.0
8+
## 🚀 What's New in v0.6.0
99

10-
- **Playwright Migration**: Complete transition from Selenium to modern Playwright for faster, more reliable web content extraction
11-
- **12.8x Performance Boost**: Optimized parallel processing (13.2s → 1.0s per domain)
12-
- **Enhanced Docker Security**: Production-ready containerization with security sandboxing and resource limits
13-
- **Unified Content Pipeline**: Text and image extraction now use the same Playwright engine for consistency
10+
- **Streamlined JSON API**: Simple, consistent JSON responses for easy integration with any workflow
11+
- **Enhanced LLM Support**: Built-in support for OpenAI, Anthropic, and Google AI models with custom category definitions
12+
- **Advanced Archive Analysis**: Analyze historical website versions from archive.org with intelligent rate limiting
13+
- **Separated Data Collection**: Collect website content once, run multiple classification approaches (ML + LLM + ensemble)
14+
- **41 Content Categories**: Comprehensive classification including news, shopping, social media, education, finance, and more
1415

1516
## Installation
1617

@@ -23,17 +24,18 @@ Requires Python 3.11+
2324
## Basic Usage
2425

2526
```python
26-
from piedomains import DomainClassifier
27+
from piedomains import DomainClassifier, DataCollector
2728

2829
classifier = DomainClassifier()
29-
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
30-
print(result[['domain', 'pred_label', 'pred_prob']])
30+
results = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
31+
32+
for result in results:
33+
print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})")
3134

3235
# Output:
33-
# domain pred_label pred_prob
34-
# 0 cnn.com news 0.876543
35-
# 1 amazon.com shopping 0.923456
36-
# 2 wikipedia.org education 0.891234
36+
# cnn.com: news (0.876)
37+
# amazon.com: shopping (0.923)
38+
# wikipedia.org: education (0.891)
3739
```
3840

3941
## Classification Methods
@@ -48,8 +50,10 @@ result = classifier.classify_by_text(["news.google.com"])
4850
# Image-only classification
4951
result = classifier.classify_by_images(["instagram.com"])
5052

51-
# Batch processing
52-
results = classifier.classify_batch(domains, method="text", batch_size=50)
53+
# Batch processing with separated workflow
54+
collector = DataCollector()
55+
collection = collector.collect_batch(domains, batch_size=50)
56+
results = classifier.classify_from_collection(collection, method="text")
5357
```
5458

5559
## Historical Analysis
@@ -60,12 +64,9 @@ old_result = classifier.classify(["facebook.com"], archive_date="20100101")
6064

6165
# Batch processing with archive.org (respects rate limits)
6266
domains = ["google.com", "wikipedia.org", "cnn.com"]
63-
historical_results = classifier.classify_batch(
64-
domains,
65-
archive_date="20050101",
66-
method="text",
67-
batch_size=10 # Archive.org uses conservative defaults
68-
)
67+
collector = DataCollector(archive_date="20050101")
68+
collection = collector.collect_batch(domains, batch_size=10) # Archive.org uses conservative defaults
69+
historical_results = classifier.classify_from_collection(collection, method="text")
6970
```
7071

7172
### Archive.org Rate Limits & Best Practices
Lines changed: 13 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,29 @@
1-
piedomains.classifiers package
2-
==============================
1+
piedomains classifiers
2+
====================
33

4-
Submodules
5-
----------
4+
Domain Classification Modules
5+
-----------------------------
66

7-
piedomains.classifiers.combined\_classifier module
8-
--------------------------------------------------
7+
piedomains.text module
8+
----------------------
99

10-
.. automodule:: piedomains.classifiers.combined_classifier
10+
.. automodule:: piedomains.text
1111
:members:
1212
:show-inheritance:
1313
:undoc-members:
1414

15-
piedomains.classifiers.image\_classifier module
16-
-----------------------------------------------
15+
piedomains.image module
16+
-----------------------
1717

18-
.. automodule:: piedomains.classifiers.image_classifier
18+
.. automodule:: piedomains.image
1919
:members:
2020
:show-inheritance:
2121
:undoc-members:
2222

23-
piedomains.classifiers.llm\_classifier module
24-
---------------------------------------------
23+
piedomains.llm module
24+
--------------------
2525

26-
.. automodule:: piedomains.classifiers.llm_classifier
27-
:members:
28-
:show-inheritance:
29-
:undoc-members:
30-
31-
piedomains.classifiers.text\_classifier module
32-
----------------------------------------------
33-
34-
.. automodule:: piedomains.classifiers.text_classifier
35-
:members:
36-
:show-inheritance:
37-
:undoc-members:
38-
39-
Module contents
40-
---------------
41-
42-
.. automodule:: piedomains.classifiers
26+
.. automodule:: piedomains.llm
4327
:members:
4428
:show-inheritance:
4529
:undoc-members:

examples/README.md

Lines changed: 46 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,56 @@
1-
# Examples
1+
# Piedomains Examples
22

3-
This directory contains example scripts demonstrating piedomains functionality:
3+
This directory contains examples demonstrating the piedomains library's capabilities.
44

5-
## Traditional ML Classification
6-
- `new_api_demo.py`: Modern DomainClassifier API demonstration
7-
- `archive_demo.py`: Basic archive.org classification demo
8-
- `archive_functionality_demo.py`: Archive functionality testing
9-
- `final_archive_demo.py`: Final archive integration test
10-
- `jupyter_demo.py`: Jupyter notebook demonstration
5+
## 🚀 Quick Start - New JSON API
116

12-
## LLM-Powered Classification
13-
- `llm_demo.py`: LLM-based classification with OpenAI, Anthropic, Google models
7+
The piedomains library now features a clean JSON-only API that separates data collection from inference:
148

15-
## Running Examples
9+
```python
10+
from piedomains import DomainClassifier
1611

17-
```bash
18-
cd examples
19-
python new_api_demo.py
20-
python llm_demo.py # Requires API key
12+
# Simple classification - returns JSON instead of DataFrames
13+
classifier = DomainClassifier()
14+
results = classifier.classify(["cnn.com", "github.com"])
15+
16+
for result in results:
17+
print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})")
18+
print(f" Model: {result['model_used']}")
19+
print(f" Data: {result['text_path']}, {result['image_path']}")
20+
```
21+
22+
## 🔧 Separated Workflow
23+
24+
For advanced use cases, separate data collection from inference:
25+
26+
```python
27+
from piedomains import DataCollector, DomainClassifier
28+
29+
# Step 1: Collect data (can be reused)
30+
collector = DataCollector()
31+
data = collector.collect(["example.com"])
32+
33+
# Step 2: Run inference (try different models on same data)
34+
classifier = DomainClassifier()
35+
text_results = classifier.classify_from_collection(data, method="text")
36+
image_results = classifier.classify_from_collection(data, method="images")
2137
```
2238

39+
## 📁 Available Examples
40+
41+
### Core Functionality
42+
- `json_only_demo.py` - **NEW**: JSON-only API demonstration
43+
- `separated_workflow_demo.py` - **NEW**: Data collection & inference separation
44+
- `new_api_demo.py` - Traditional API (now returns JSON)
45+
- `jupyter_demo.py` - Jupyter notebook examples
46+
47+
### Archive & Historical Analysis
48+
- `final_archive_demo.py` - Archive.org integration
49+
- Historical snapshots with `archive_date="20200101"`
50+
51+
### LLM-Powered Classification
52+
- `llm_demo.py` - LLM-based classification with multiple providers
53+
2354
## 🔒 Security & Sandbox Examples
2455

2556
**⚠️ Important**: For unknown/suspicious domains, use the sandbox examples to protect your system:

examples/json_only_demo.py

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Demo of the new JSON-only classification architecture.
4+
5+
This example shows the clean JSON API that replaces DataFrames.
6+
"""
7+
8+
import sys
9+
from pathlib import Path
10+
11+
# Add project root to path
12+
sys.path.insert(0, str(Path(__file__).parent.parent))
13+
14+
try:
15+
from piedomains import DataCollector, DomainClassifier
16+
except ImportError as e:
17+
print(f"Import error: {e}")
18+
print("This demo requires the piedomains package to be installed.")
19+
sys.exit(1)
20+
21+
22+
def demo_json_api():
23+
"""Demonstrate the new JSON-only API."""
24+
print("🚀 Piedomains JSON-Only Architecture Demo")
25+
print("=" * 50)
26+
27+
# Test domains
28+
domains = ["example.com", "httpbin.org"]
29+
30+
# Create classifier
31+
classifier = DomainClassifier(cache_dir="demo_cache")
32+
33+
print(f"\n🔤 Testing JSON API with {len(domains)} domains...")
34+
print("Domains:", domains)
35+
36+
try:
37+
# Test the new JSON-only classify method
38+
results = classifier.classify(domains)
39+
40+
print("\n✅ Classification complete!")
41+
print(f"Result type: {type(results)}")
42+
print(f"Number of results: {len(results)}")
43+
44+
print("\n📊 Results:")
45+
for i, result in enumerate(results):
46+
print(f"\n{i+1}. Domain: {result.get('domain', 'unknown')}")
47+
print(f" URL: {result.get('url', 'unknown')}")
48+
print(f" Category: {result.get('category', 'unknown')}")
49+
print(f" Confidence: {result.get('confidence', 0.0):.3f}")
50+
print(f" Model Used: {result.get('model_used', 'unknown')}")
51+
print(
52+
f" Data Collection Time: {result.get('date_time_collected', 'unknown')}"
53+
)
54+
print(f" Text Path: {result.get('text_path', 'none')}")
55+
print(f" Image Path: {result.get('image_path', 'none')}")
56+
57+
if result.get("error"):
58+
print(f" ❌ Error: {result['error']}")
59+
else:
60+
print(" ✅ Success")
61+
62+
# Test different classification methods
63+
print("\n🔤 Testing text-only classification...")
64+
text_results = classifier.classify_by_text(domains)
65+
print(f"Text results: {len(text_results)} domains")
66+
for result in text_results:
67+
print(
68+
f" {result['domain']}: {result.get('category', 'error')} "
69+
f"({result.get('confidence', 0):.3f}) - {result.get('model_used', 'unknown')}"
70+
)
71+
72+
print("\n🖼️ Testing image-only classification...")
73+
image_results = classifier.classify_by_images(domains)
74+
print(f"Image results: {len(image_results)} domains")
75+
for result in image_results:
76+
print(
77+
f" {result['domain']}: {result.get('category', 'error')} "
78+
f"({result.get('confidence', 0):.3f}) - {result.get('model_used', 'unknown')}"
79+
)
80+
81+
# Show JSON structure
82+
print("\n📋 JSON Schema Example:")
83+
if results:
84+
example_result = results[0]
85+
import json
86+
87+
print(json.dumps(example_result, indent=2))
88+
89+
print("\n✅ Demo completed successfully!")
90+
print("\nKey improvements:")
91+
print("- 🗂️ Pure JSON output (no pandas dependency)")
92+
print("- 🔄 Unified data collection → inference pipeline")
93+
print("- 📁 Clear data file paths for debugging")
94+
print("- ♻️ Data reuse across multiple classification approaches")
95+
print("- 🌐 Language-agnostic JSON format")
96+
97+
except Exception as e:
98+
print(f"❌ Demo failed: {e}")
99+
import traceback
100+
101+
traceback.print_exc()
102+
print("\nThis is expected if:")
103+
print("- Dependencies are missing")
104+
print("- Network is unavailable")
105+
print("- ML models aren't downloaded")
106+
print("\nThe demo shows the API structure even without full functionality.")
107+
108+
109+
def demo_separated_workflow():
110+
"""Show the separated data collection and inference workflow."""
111+
print("\n" + "=" * 50)
112+
print("🔧 Separated Data Collection & Inference Demo")
113+
print("=" * 50)
114+
115+
domains = ["httpbin.org"]
116+
117+
try:
118+
print("\n📦 Step 1: Data Collection")
119+
collector = DataCollector(cache_dir="demo_separated")
120+
collection_data = collector.collect(domains)
121+
122+
print("✅ Collection complete!")
123+
print(f" Collection ID: {collection_data['collection_id']}")
124+
print(f" Successful: {collection_data['summary']['successful']}")
125+
print(f" Failed: {collection_data['summary']['failed']}")
126+
127+
print("\n🧠 Step 2: Classification")
128+
classifier = DomainClassifier()
129+
130+
print("Running text classification on collected data...")
131+
results = classifier.classify_from_collection(collection_data, method="text")
132+
133+
print("✅ Inference complete!")
134+
for result in results:
135+
print(
136+
f" {result['domain']}: {result.get('category', 'error')} "
137+
f"({result.get('confidence', 0):.3f})"
138+
)
139+
140+
print("\n♻️ Data Reuse: The same collected data can now be used with:")
141+
print(" - Different ML model versions")
142+
print(" - LLM-based classification")
143+
print(" - Ensemble approaches")
144+
print(" - External analysis tools")
145+
146+
except Exception as e:
147+
print(f"❌ Separated workflow demo failed: {e}")
148+
149+
150+
if __name__ == "__main__":
151+
demo_json_api()
152+
demo_separated_workflow()

0 commit comments

Comments
 (0)