Skip to content

Commit 3349d8c

Browse files
committed
ci/cd
1 parent a3dda9a commit 3349d8c

File tree

4 files changed

+299
-75
lines changed

4 files changed

+299
-75
lines changed

README.rst

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ piedomains: predict the kind of content hosted by a domain based on domain name
1212
.. image:: https://static.pepy.tech/badge/piedomains
1313
:target: https://pepy.tech/project/piedomains
1414

15-
The package infers the kind of content hosted by a domain using the domain name, the textual content, and the screenshot of the homepage.
15+
The package infers the kind of content hosted by a domain using the domain name or full URL, the textual content, and the screenshot of the homepage.
1616

1717
We use domain category labels from `Shallalist <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZXTQ7V>`__ and build our own training dataset by scraping and taking screenshots of the homepage. The final dataset used to train the model is posted on the `Harvard Dataverse <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZXTQ7V>`__. Python notebooks used to build the models can be found `here <https://github.com/themains/piedomains/tree/55cd5ea68ccec58ab2152c5f1d6fb9e6cf5df363/piedomains/notebooks>`__ and the model files can be found `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YHWCDC>`__
1818

@@ -31,15 +31,15 @@ General API
3131

3232
- What it does:
3333

34-
- Predicts the kind of content hosted by a domain based on the domain name and the HTML of the homepage.
35-
- The function can use locally stored HTML files or fetch fresh HTML files.
36-
- If you specify a local folder, the function will look for HTML files corresponding to the domain.
34+
- Predicts the kind of content hosted by a domain based on the domain name or full URL and the HTML content.
35+
- The function can use locally stored HTML files or fetch fresh HTML files from the specified URLs.
36+
- If you specify a local folder, the function will look for HTML files corresponding to the domain name.
3737
- The HTML files must be stored as `domainname.html`.
3838
- The function returns a pandas dataframe with predicted labels and corresponding probabilities.
3939

4040
- Inputs:
4141

42-
- `input`: list of domains. Either `input` or `html_path` must be specified.
42+
- `input`: list of URLs or domain names. Either `input` or `html_path` must be specified.
4343
- `html_path`: path to the folder where the HTMLs are stored. Either `input` or `html_path` must be specified.
4444
- `latest`: use the latest model. The default is `True.`
4545
- Note: The function will by default look for a `html` folder on the same level as model files.
@@ -52,20 +52,21 @@ General API
5252
::
5353
5454
from piedomains import domain
55-
domains = [
55+
# URLs and domains can be mixed
56+
inputs = [
5657
"forbes.com",
57-
"xvideos.com",
58+
"https://xvideos.com",
5859
"last.fm",
59-
"facebook.com",
60+
"https://facebook.com/news",
6061
"bellesa.co",
61-
"marketwatch.com"
62+
"https://marketwatch.com/investing"
6263
]
63-
# with only domains
64-
result = domain.pred_shalla_cat_with_text(domains)
64+
# with URLs/domains
65+
result = domain.pred_shalla_cat_with_text(inputs)
6566
# with html path where htmls are stored (offline mode)
6667
result = domain.pred_shalla_cat_with_text(html_path="path/to/htmls")
67-
# with domains and html path, html_path will be used to store htmls
68-
result = domain.pred_shalla_cat_with_text(domains, html_path="path/to/htmls")
68+
# with URLs/domains and html path, html_path will be used to store htmls
69+
result = domain.pred_shalla_cat_with_text(inputs, html_path="path/to/htmls")
6970
print(result)
7071
- Sample output:
7172
::

piedomains/domain.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@
1313

1414

1515
def main(argv=sys.argv[1:]):
16-
title = "Predict the category of the domain using the content of the domain and the screenshot of the homepage"
16+
title = "Predict the category of URLs or domains using content and homepage screenshots"
1717
parser = argparse.ArgumentParser(description=title)
18-
parser.add_argument("--input", default=None, help="Domain name to classify")
18+
parser.add_argument("--input", default=None, help="URL or domain name to classify (e.g., 'example.com' or 'https://example.com/page')")
1919
args = parser.parse_args(argv)
2020
print(args)
2121
if not args.input:

0 commit comments

Comments
 (0)