You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**piedomains** predicts website content categories using traditional ML models or modern LLMs (GPT-4, Claude, Gemini). Analyze domain names, text content, and homepage screenshots to classify websites as news, shopping, adult content, education, etc. with high accuracy and flexible custom categories.
6
+
Classify website content categories using machine learning models or LLMs (GPT-4, Claude, Gemini).
8
7
9
-
## 🚀 Quickstart
8
+
## Installation
10
9
11
-
Install and classify domains in 3 lines:
12
-
13
-
```python
10
+
```bash
14
11
pip install piedomains
12
+
```
13
+
14
+
Requires Python 3.11+
15
+
16
+
## Basic Usage
15
17
18
+
```python
16
19
from piedomains import DomainClassifier
17
-
classifier = DomainClassifier()
18
20
19
-
# Classify current content
21
+
classifier = DomainClassifier()
20
22
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
-**Custom Categories**: Define your own classification schemes
118
-
-**Multimodal Analysis**: Combines text + visual understanding
119
-
-**Latest AI**: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
120
-
-**Cost Tracking**: Built-in usage monitoring and limits
121
-
-**Flexible Prompts**: Customize instructions for specific use cases
122
-
123
-
**Supported Providers:**
124
-
-**OpenAI**: GPT-4o, GPT-4-turbo, GPT-3.5-turbo
125
-
-**Anthropic**: Claude 3.5 Sonnet, Claude 3 Opus/Haiku
126
-
-**Google**: Gemini 1.5 Pro, Gemini Pro Vision
127
-
-**Others**: Any litellm-supported model
128
-
76
+
Set API keys via environment variables:
129
77
```bash
130
-
# Set API keys via environment variables
131
78
export OPENAI_API_KEY="sk-..."
132
79
export ANTHROPIC_API_KEY="sk-ant-..."
133
80
export GOOGLE_API_KEY="..."
134
81
```
135
82
136
-
## 🏷️ Supported Categories
137
-
138
-
News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.
139
-
140
-
## 📈 Performance
83
+
## Categories
141
84
142
-
-**Speed**: ~10-50 domains/minute (depends on method and network)
143
-
-**Accuracy**: 85-95% depending on content type and method
144
-
-**Memory**: <500MB for batch processing
145
-
-**Caching**: Automatic content caching for faster re-runs
85
+
41 categories: news, finance, shopping, education, government, adult content, gambling, social networks, search engines, and others based on Shallalist taxonomy.
146
86
147
-
## 🔧 Installation
87
+
## Security
148
88
149
-
**Requirements**: Python 3.11+
89
+
When analyzing unknown domains, use Docker or isolated environments:
150
90
151
91
```bash
152
-
# Basic installation
153
-
pip install piedomains
154
-
155
-
# For development
156
-
git clone https://github.com/themains/piedomains
157
-
cd piedomains
158
-
pip install -e .
159
-
```
160
-
161
-
## 💡 API Usage
162
-
163
-
```python
92
+
docker build -t piedomains-sandbox .
93
+
docker run --rm -it piedomains-sandbox python -c "
164
94
from piedomains import DomainClassifier
165
95
classifier = DomainClassifier()
166
-
result = classifier.classify_by_text(["example.com"])
96
+
result = classifier.classify(['example.com'])
97
+
print(result[['domain', 'pred_label']])
98
+
"
167
99
```
168
100
169
-
## 📖 Documentation
101
+
For testing, use known-safe domains: `["wikipedia.org", "github.com", "cnn.com"]`
0 commit comments