Skip to content

Commit a53c841

Browse files
committed
add llm classification
1 parent 00c0849 commit a53c841

26 files changed

+3549
-152
lines changed

.github/workflows/python-publish.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Publish to PyPI when a new release is created or manually triggered
2+
# Uses OpenID Connect trusted publishing for enhanced security
23
# For more information see: https://docs.github.com/en/actions/guides/publishing-python-packages
34
name: python-publish
45
on:
@@ -7,10 +8,11 @@ on:
78
workflow_dispatch: # Enables manual triggering
89
permissions:
910
contents: read
10-
id-token: write
11+
id-token: write # Required for trusted publishing with OIDC
1112
jobs:
1213
publish:
1314
runs-on: ubuntu-latest
15+
environment: pypi # Use PyPI environment for trusted publishing
1416
steps:
1517
- uses: actions/checkout@v4
1618
- name: Set up Python
@@ -25,7 +27,8 @@ jobs:
2527

2628
- name: Build package
2729
run: uv build
30+
2831
- name: Publish to PyPI
2932
uses: pypa/gh-action-pypi-publish@release/v1
30-
with:
31-
password: ${{ secrets.PYPI_API_TOKEN }}
33+
# Trusted publishing with OpenID Connect - no API token needed
34+
# Configured via PyPI project settings: https://pypi.org/manage/account/publishing/

README.md

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
[![PyPI Version](https://img.shields.io/pypi/v/piedomains.svg)](https://pypi.python.org/pypi/piedomains)
55
[![Documentation](https://github.com/themains/piedomains/actions/workflows/docs.yml/badge.svg)](https://github.com/themains/piedomains/actions/workflows/docs.yml)
66

7-
**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.
7+
**piedomains** predicts website content categories using traditional ML models or modern LLMs (GPT-4, Claude, Gemini). Analyze domain names, text content, and homepage screenshots to classify websites as news, shopping, adult content, education, etc. with high accuracy and flexible custom categories.
88

99
## 🚀 Quickstart
1010

@@ -30,10 +30,11 @@ print(result[['domain', 'pred_label', 'pred_prob']])
3030
## 📊 Key Features
3131

3232
- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
33+
- **LLM-Powered**: Use GPT-4o, Claude 3.5, Gemini with custom categories and instructions
3334
- **Historical Analysis**: Classify websites from any point in time using archive.org
3435
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
3536
- **Easy Integration**: Modern Python API with pandas output
36-
- **41 Categories**: From news/finance to adult/gambling content
37+
- **Flexible Categories**: 41 default categories or define your own with AI models
3738

3839
## ⚡ Usage Examples
3940

@@ -78,6 +79,60 @@ results = classifier.classify_batch(
7879
)
7980
```
8081

82+
### 🤖 LLM-Powered Classification
83+
84+
Use modern AI models (GPT-4, Claude, Gemini) for flexible, accurate classification:
85+
86+
```python
87+
from piedomains import DomainClassifier
88+
89+
classifier = DomainClassifier()
90+
91+
# Configure your preferred AI provider
92+
classifier.configure_llm(
93+
provider="openai", # openai, anthropic, google
94+
model="gpt-4o", # multimodal model
95+
api_key="sk-...", # or set via environment variable
96+
categories=["news", "shopping", "social", "tech", "education"]
97+
)
98+
99+
# Text-only LLM classification
100+
result = classifier.classify_by_llm(["cnn.com", "github.com"])
101+
102+
# Multimodal classification (text + screenshots)
103+
result = classifier.classify_by_llm_multimodal(["instagram.com"])
104+
105+
# Custom classification instructions
106+
result = classifier.classify_by_llm(
107+
["khanacademy.org", "reddit.com"],
108+
custom_instructions="Classify by educational value: educational, entertainment, mixed"
109+
)
110+
111+
# Track usage and costs
112+
stats = classifier.get_llm_usage_stats()
113+
print(f"API calls: {stats['total_requests']}, Cost: ${stats['estimated_cost_usd']:.4f}")
114+
```
115+
116+
**LLM Benefits:**
117+
- **Custom Categories**: Define your own classification schemes
118+
- **Multimodal Analysis**: Combines text + visual understanding
119+
- **Latest AI**: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
120+
- **Cost Tracking**: Built-in usage monitoring and limits
121+
- **Flexible Prompts**: Customize instructions for specific use cases
122+
123+
**Supported Providers:**
124+
- **OpenAI**: GPT-4o, GPT-4-turbo, GPT-3.5-turbo
125+
- **Anthropic**: Claude 3.5 Sonnet, Claude 3 Opus/Haiku
126+
- **Google**: Gemini 1.5 Pro, Gemini Pro Vision
127+
- **Others**: Any litellm-supported model
128+
129+
```bash
130+
# Set API keys via environment variables
131+
export OPENAI_API_KEY="sk-..."
132+
export ANTHROPIC_API_KEY="sk-ant-..."
133+
export GOOGLE_API_KEY="..."
134+
```
135+
81136
## 🏷️ Supported Categories
82137

83138
News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.

TRUSTED_PUBLISHING_FIX.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Trusted Publishing Configuration Fix
2+
3+
## Problem Analysis
4+
5+
The error shows:
6+
- Repository: `themains/know-your-ip`
7+
- Environment: `MISSING`**This is the key issue**
8+
- Branch: `master`
9+
- Workflow: `python-publish.yml`
10+
11+
## Required Actions
12+
13+
### 1. Update Workflow File in `themains/know-your-ip`
14+
15+
Replace the `python-publish.yml` file in `themains/know-your-ip` with this exact content:
16+
17+
```yaml
18+
# Publish to PyPI when a new release is created or manually triggered
19+
# Uses OpenID Connect trusted publishing for enhanced security
20+
name: python-publish
21+
22+
on:
23+
release:
24+
types: [published]
25+
workflow_dispatch: # Enables manual triggering
26+
27+
permissions:
28+
contents: read
29+
id-token: write # Required for trusted publishing with OIDC
30+
31+
jobs:
32+
publish:
33+
runs-on: ubuntu-latest
34+
environment: pypi # ← THIS IS CRITICAL - currently missing!
35+
36+
steps:
37+
- uses: actions/checkout@v4
38+
39+
- name: Set up Python
40+
uses: actions/setup-python@v5
41+
with:
42+
python-version: '3.11'
43+
44+
- name: Install uv
45+
uses: astral-sh/setup-uv@v3
46+
with:
47+
version: "latest"
48+
49+
- name: Build package
50+
run: uv build
51+
52+
- name: Publish to PyPI
53+
uses: pypa/gh-action-pypi-publish@release/v1
54+
# Trusted publishing - no API token needed
55+
```
56+
57+
### 2. Configure Trusted Publisher on PyPI
58+
59+
Go to your PyPI project (know-your-ip) settings:
60+
61+
1. **Visit**: https://pypi.org/manage/project/know-your-ip/settings/publishing/
62+
2. **Add trusted publisher** with these **EXACT** values:
63+
64+
- **Repository owner**: `themains`
65+
- **Repository name**: `know-your-ip`
66+
- **Workflow filename**: `python-publish.yml`
67+
- **Environment name**: `pypi`
68+
69+
### 3. Create PyPI Environment in GitHub
70+
71+
In the `themains/know-your-ip` repository:
72+
73+
1. Go to **Settings****Environments**
74+
2. Click **New environment**
75+
3. Name it: `pypi`
76+
4. (Optional) Add protection rules like requiring reviews
77+
78+
### 4. Test the Configuration
79+
80+
1. Create a test release in `themains/know-your-ip`
81+
2. Check the workflow logs for:
82+
```
83+
* `environment`: `pypi` # Should no longer be MISSING
84+
```
85+
3. Verify successful publication
86+
87+
## Verification Checklist
88+
89+
- [ ] Workflow file updated with `environment: pypi`
90+
- [ ] PyPI trusted publisher configured for `themains/know-your-ip`
91+
- [ ] GitHub environment `pypi` created
92+
- [ ] Test release triggers workflow successfully
93+
- [ ] Environment claim appears in logs (not MISSING)
94+
95+
## Security Benefits
96+
97+
**No API tokens to manage**
98+
**Automatic token rotation**
99+
**Audit trail through OIDC**
100+
**Reduced credential exposure**
101+
102+
## Troubleshooting
103+
104+
If you still get errors:
105+
106+
1. **Double-check repository names** (themains/know-your-ip)
107+
2. **Verify branch name** (master vs main)
108+
3. **Confirm PyPI project name** matches exactly
109+
4. **Check environment name** is exactly `pypi`
110+
111+
The key fix is adding `environment: pypi` to the workflow, which will make the environment claim appear in the OIDC token instead of being `MISSING`.

docs/source/conf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,8 @@
5858
'nltk',
5959
'scikit-learn',
6060
'sklearn',
61-
'joblib'
61+
'joblib',
62+
'litellm'
6263
]
6364

6465
# Source file configuration

docs/source/piedomains.classifiers.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,14 @@ piedomains.classifiers.image\_classifier module
2020
:show-inheritance:
2121
:undoc-members:
2222

23+
piedomains.classifiers.llm\_classifier module
24+
---------------------------------------------
25+
26+
.. automodule:: piedomains.classifiers.llm_classifier
27+
:members:
28+
:show-inheritance:
29+
:undoc-members:
30+
2331
piedomains.classifiers.text\_classifier module
2432
----------------------------------------------
2533

docs/source/piedomains.llm.rst

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
piedomains.llm package
2+
======================
3+
4+
Submodules
5+
----------
6+
7+
piedomains.llm.config module
8+
----------------------------
9+
10+
.. automodule:: piedomains.llm.config
11+
:members:
12+
:show-inheritance:
13+
:undoc-members:
14+
15+
piedomains.llm.prompts module
16+
-----------------------------
17+
18+
.. automodule:: piedomains.llm.prompts
19+
:members:
20+
:show-inheritance:
21+
:undoc-members:
22+
23+
piedomains.llm.response\_parser module
24+
--------------------------------------
25+
26+
.. automodule:: piedomains.llm.response_parser
27+
:members:
28+
:show-inheritance:
29+
:undoc-members:
30+
31+
Module contents
32+
---------------
33+
34+
.. automodule:: piedomains.llm
35+
:members:
36+
:show-inheritance:
37+
:undoc-members:

docs/source/piedomains.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Subpackages
88
:maxdepth: 4
99

1010
piedomains.classifiers
11+
piedomains.llm
1112
piedomains.processors
1213

1314
Submodules

examples/README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,35 @@
22

33
This directory contains example scripts demonstrating piedomains functionality:
44

5+
## Traditional ML Classification
6+
- `new_api_demo.py`: Modern DomainClassifier API demonstration
57
- `archive_demo.py`: Basic archive.org classification demo
68
- `archive_functionality_demo.py`: Archive functionality testing
79
- `final_archive_demo.py`: Final archive integration test
810
- `jupyter_demo.py`: Jupyter notebook demonstration
911

12+
## LLM-Powered Classification
13+
- `llm_demo.py`: LLM-based classification with OpenAI, Anthropic, Google models
14+
1015
## Running Examples
1116

1217
```bash
1318
cd examples
14-
python archive_demo.py
19+
python new_api_demo.py
20+
python llm_demo.py # Requires API key
21+
```
22+
23+
### LLM Demo Setup
24+
25+
For LLM examples, set your API key:
26+
```bash
27+
export OPENAI_API_KEY="sk-..." # OpenAI
28+
export ANTHROPIC_API_KEY="sk-ant-..." # Anthropic
29+
export GOOGLE_API_KEY="..." # Google
1530
```
1631

32+
## Installation
33+
1734
Note: These scripts require the piedomains package to be installed:
1835
```bash
1936
pip install -e ..

0 commit comments

Comments
 (0)