You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The package infers the kind of content hosted by a domain using the domain name, the textual content, and the screenshot of the homepage.
15
+
The package infers the kind of content hosted by a domain using the domain name or full URL, the textual content, and the screenshot of the homepage.
16
16
17
17
We use domain category labels from `Shallalist <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZXTQ7V>`__ and build our own training dataset by scraping and taking screenshots of the homepage. The final dataset used to train the model is posted on the `Harvard Dataverse <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZXTQ7V>`__. Python notebooks used to build the models can be found `here <https://github.com/themains/piedomains/tree/55cd5ea68ccec58ab2152c5f1d6fb9e6cf5df363/piedomains/notebooks>`__ and the model files can be found `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YHWCDC>`__
18
18
@@ -31,15 +31,15 @@ General API
31
31
32
32
- What it does:
33
33
34
-
- Predicts the kind of content hosted by a domain based on the domain name and the HTML of the homepage.
35
-
- The function can use locally stored HTML files or fetch fresh HTML files.
36
-
- If you specify a local folder, the function will look for HTML files corresponding to the domain.
34
+
- Predicts the kind of content hosted by a domain based on the domain name or full URL and the HTML content.
35
+
- The function can use locally stored HTML files or fetch fresh HTML files from the specified URLs.
36
+
- If you specify a local folder, the function will look for HTML files corresponding to the domain name.
37
37
- The HTML files must be stored as `domainname.html`.
38
38
- The function returns a pandas dataframe with predicted labels and corresponding probabilities.
39
39
40
40
- Inputs:
41
41
42
-
- `input`: list of domains. Either `input` or `html_path` must be specified.
42
+
- `input`: list of URLs or domain names. Either `input` or `html_path` must be specified.
43
43
- `html_path`: path to the folder where the HTMLs are stored. Either `input` or `html_path` must be specified.
44
44
- `latest`: use the latest model. The default is `True.`
45
45
- Note: The function will by default look for a `html` folder on the same level as model files.
@@ -52,20 +52,21 @@ General API
52
52
::
53
53
54
54
from piedomains import domain
55
-
domains = [
55
+
# URLs and domains can be mixed
56
+
inputs = [
56
57
"forbes.com",
57
-
"xvideos.com",
58
+
"https://xvideos.com",
58
59
"last.fm",
59
-
"facebook.com",
60
+
"https://facebook.com/news",
60
61
"bellesa.co",
61
-
"marketwatch.com"
62
+
"https://marketwatch.com/investing"
62
63
]
63
-
# with only domains
64
-
result = domain.pred_shalla_cat_with_text(domains)
64
+
# with URLs/domains
65
+
result = domain.pred_shalla_cat_with_text(inputs)
65
66
# with html path where htmls are stored (offline mode)
66
67
result = domain.pred_shalla_cat_with_text(html_path="path/to/htmls")
67
-
# with domains and html path, html_path will be used to store htmls
68
-
result = domain.pred_shalla_cat_with_text(domains, html_path="path/to/htmls")
68
+
# with URLs/domains and html path, html_path will be used to store htmls
69
+
result = domain.pred_shalla_cat_with_text(inputs, html_path="path/to/htmls")
0 commit comments