Skip to content

Commit 6e49ea5

Browse files
Reorganize, dedupe, and shorten README (#349)
* Separate how-to vs. explanation, in README * Separate how-to vs. explanation, in docstrings Overhauls the README, nudging it a couple steps in the direction of Diátaxis technical documentation principles.
1 parent 25643e4 commit 6e49ea5

File tree

2 files changed

+161
-220
lines changed

2 files changed

+161
-220
lines changed

README.md

Lines changed: 105 additions & 186 deletions
Original file line numberDiff line numberDiff line change
@@ -3,250 +3,185 @@
33
`tldextract` accurately separates a URL's subdomain, domain, and public suffix,
44
using [the Public Suffix List (PSL)](https://publicsuffix.org).
55

6-
Say you want just the "google" part of https://www.google.com. *Everybody gets
7-
this wrong.* Splitting on the "." and taking the 2nd-to-last element only works
8-
for simple domains, e.g. .com. Consider
9-
[http://forums.bbc.co.uk](http://forums.bbc.co.uk): the naive splitting method
10-
will give you "co" as the domain, instead of "bbc".
6+
**Why?** Naive URL parsing like splitting on dots fails for domains like
7+
`forums.bbc.co.uk` (gives "co" instead of "bbc"). `tldextract` handles the edge
8+
cases, so you don't have to.
119

12-
Rather than juggle TLDs,
13-
gTLDs, ccTLDs, and their exceptions yourself, `tldextract` extracts the currently living public
14-
suffixes according to [the Public Suffix List](https://publicsuffix.org).
15-
You can optionally support the Public Suffix List's [private
16-
domains](#public-vs-private-domains) as well.
17-
18-
> A "public suffix" is one under which Internet users can directly register
19-
> names.
20-
21-
A public suffix is also sometimes called an effective TLD (eTLD).
22-
23-
## Usage
10+
## Quick Start
2411

2512
```python
2613
>>> import tldextract
2714

2815
>>> tldextract.extract('http://forums.news.cnn.com/')
2916
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)
3017

31-
>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
18+
>>> tldextract.extract('http://forums.bbc.co.uk/')
3219
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)
3320

34-
>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
35-
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg', is_private=False)
36-
```
37-
38-
Note subdomain and suffix are _optional_. Not all URL-like inputs have a
39-
subdomain or a valid suffix.
40-
41-
```python
42-
>>> tldextract.extract('google.com')
43-
ExtractResult(subdomain='', domain='google', suffix='com', is_private=False)
44-
45-
>>> tldextract.extract('google.notavalidsuffix')
46-
ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='', is_private=False)
47-
48-
>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
49-
ExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)
50-
```
51-
52-
To rejoin the original hostname, if it was indeed a valid, registered hostname:
53-
54-
```python
21+
>>> # Access the parts you need
5522
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
23+
>>> ext.domain
24+
'bbc'
5625
>>> ext.top_domain_under_public_suffix
5726
'bbc.co.uk'
5827
>>> ext.fqdn
5928
'forums.bbc.co.uk'
6029
```
6130

62-
In addition to the Python interface, there is a command-line interface. Split
63-
the URL components by space:
64-
65-
```zsh
66-
$ tldextract 'http://forums.bbc.co.uk'
67-
forums bbc co.uk
68-
```
69-
7031
## Install
7132

72-
Latest release on PyPI:
73-
7433
```zsh
7534
pip install tldextract
7635
```
7736

78-
Or the latest dev version:
37+
## How-to Guides
7938

80-
```zsh
81-
pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'
82-
```
83-
84-
## Note about caching
39+
### How to disable HTTP suffix list fetching for production
8540

86-
Beware when first calling `tldextract`, it updates its TLD list with a live HTTP
87-
request. This updated TLD set is usually cached indefinitely in `$HOME/.cache/python-tldextract`.
88-
To control the cache's location, set the `TLDEXTRACT_CACHE` environment variable or set the
89-
`cache_dir` path when constructing a `TLDExtract`.
41+
```python
42+
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())
43+
no_fetch_extract('http://www.google.com')
44+
```
9045

91-
(Arguably runtime bootstrapping like that shouldn't be the default behavior,
92-
like for production systems. But I want you to have the latest TLDs, especially
93-
when I haven't kept this code up to date.)
46+
### How to set a custom cache location
9447

48+
Via environment variable:
9549

9650
```python
97-
# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
98-
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())
99-
no_fetch_extract('http://www.google.com')
51+
export TLDEXTRACT_CACHE="/path/to/cache"
52+
```
10053

101-
# extract callable that reads/writes the updated TLD set to a different path
102-
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
103-
custom_cache_extract('http://www.google.com')
54+
Or in code:
10455

105-
# extract callable that doesn't use caching
106-
no_cache_extract = tldextract.TLDExtract(cache_dir=None)
107-
no_cache_extract('http://www.google.com')
56+
```python
57+
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/cache/')
10858
```
10959

110-
If you want to stay fresh with the TLD definitions--though they don't change
111-
often--delete the cache file occasionally, or run
60+
### How to update TLD definitions
61+
62+
Command line:
11263

11364
```zsh
11465
tldextract --update
11566
```
11667

117-
or:
68+
Or delete the cache folder:
11869

11970
```zsh
120-
env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update
71+
rm -rf $HOME/.cache/python-tldextract
12172
```
12273

123-
It is also recommended to delete the file after upgrading this lib.
124-
125-
## Advanced usage
126-
127-
### Public vs. private domains
128-
129-
The PSL [maintains a concept of "private"
130-
domains](https://publicsuffix.org/list/).
131-
132-
> PRIVATE domains are amendments submitted by the domain holder, as an
133-
> expression of how they operate their domain security policy. … While some
134-
> applications, such as browsers when considering cookie-setting, treat all
135-
> entries the same, other applications may wish to treat ICANN domains and
136-
> PRIVATE domains differently.
137-
138-
By default, `tldextract` treats public and private domains the same.
74+
### How to treat private domains as suffixes
13975

14076
```python
141-
>>> extract = tldextract.TLDExtract()
142-
>>> extract('waiterrant.blogspot.com')
143-
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)
77+
extract = tldextract.TLDExtract(include_psl_private_domains=True)
78+
extract('waiterrant.blogspot.com')
79+
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
14480
```
14581

146-
The following overrides this.
82+
### How to use a local suffix list
14783

14884
```python
149-
>>> extract = tldextract.TLDExtract()
150-
>>> extract('waiterrant.blogspot.com', include_psl_private_domains=True)
151-
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
85+
extract = tldextract.TLDExtract(
86+
suffix_list_urls=["file:///path/to/your/list.dat"],
87+
cache_dir='/path/to/cache/',
88+
fallback_to_snapshot=False)
15289
```
15390

154-
To change the default for all extract calls:
91+
### How to use a remote suffix list
15592

15693
```python
157-
>>> extract = tldextract.TLDExtract(include_psl_private_domains=True)
158-
>>> extract('waiterrant.blogspot.com')
159-
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
94+
extract = tldextract.TLDExtract(
95+
suffix_list_urls=["https://myserver.com/suffix-list.dat"])
16096
```
16197

162-
The thinking behind the default is, it's the more common case when people
163-
mentally parse a domain name. It doesn't assume familiarity with the PSL nor
164-
that the PSL makes a public/private distinction. Note this default may run
165-
counter to the default parsing behavior of other, PSL-based libraries.
166-
167-
### Specifying your own URL or file for Public Suffix List data
168-
169-
You can specify your own input data in place of the default Mozilla Public Suffix List:
98+
### How to add extra suffixes
17099

171100
```python
172101
extract = tldextract.TLDExtract(
173-
suffix_list_urls=["http://foo.bar.baz"],
174-
# Recommended: Specify your own cache file, to minimize ambiguities about where
175-
# tldextract is getting its data, or cached data, from.
176-
cache_dir='/path/to/your/cache/',
177-
fallback_to_snapshot=False)
102+
extra_suffixes=["foo", "bar.baz"])
178103
```
179104

180-
If the cached version of public suffix definitions doesn't exist, such as on
181-
the first run, the above snippet will request the URLs you specified in order,
182-
and use the first successful response.
183-
184-
If you want to use input data from your local filesystem, use the `file://`
185-
protocol with an absolute path:
105+
### How to validate URLs before extraction
186106

187107
```python
188-
extract = tldextract.TLDExtract(
189-
suffix_list_urls=["file://" + "/absolute/path/to/your/local/suffix/list/file"],
190-
cache_dir='/path/to/your/cache/',
191-
fallback_to_snapshot=False)
108+
from urllib.parse import urlsplit
109+
110+
split_url = urlsplit("https://example.com:8080/path")
111+
result = tldextract.extract_urllib(split_url)
192112
```
193113

194-
This also works via command line update:
114+
## Command Line
195115

196116
```zsh
197-
tldextract --update --suffix_list_url "http://foo.bar.baz"
117+
$ tldextract http://forums.bbc.co.uk
118+
forums bbc co.uk
119+
120+
$ tldextract --update # Update cached suffix list
121+
$ tldextract --help # See all options
198122
```
199123

200-
Using your own URLs could be useful in production when you don't want the delay
201-
with updating the suffix list on first use, or if you are behind a complex
202-
firewall.
124+
## Understanding Domain Parsing
125+
126+
### Public Suffix List
127+
128+
`tldextract` uses the [Public Suffix List](https://publicsuffix.org), a
129+
community-maintained list of domain suffixes. The PSL contains both:
130+
131+
- **Public suffixes**: Where anyone can register a domain (`.com`, `.co.uk`,
132+
`.org.kg`)
133+
- **Private suffixes**: Operated by companies for customer subdomains
134+
(`blogspot.com`, `github.io`)
135+
136+
Web browsers use this same list for security decisions like cookie scoping.
137+
138+
### Suffix vs. TLD
139+
140+
While `.com` is a top-level domain (TLD), many suffixes like `.co.uk` are
141+
technically second-level. The PSL uses "public suffix" to cover both.
203142

204-
You can also specify additional suffixes in the `extra_suffixes` param. These
205-
will be merged into whatever public suffix definitions are already in use by
206-
`tldextract`.
143+
### Default behavior with private domains
144+
145+
By default, `tldextract` treats private suffixes as regular domains:
207146

208147
```python
209-
extract = tldextract.TLDExtract(
210-
extra_suffixes=["foo", "bar", "baz"])
148+
>>> tldextract.extract('waiterrant.blogspot.com')
149+
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)
211150
```
212151

213-
## FAQ
152+
To treat them as suffixes instead, see
153+
[How to treat private domains as suffixes](#how-to-treat-private-domains-as-suffixes).
214154

215-
### Can you add suffix \_\_\_\_? Can you make an exception for domain \_\_\_\_?
155+
### Caching behavior
216156

217-
This project doesn't contain an actual list of public suffixes. That comes from
218-
[the Public Suffix List (PSL)](https://publicsuffix.org/). Submit amendments there.
157+
By default, `tldextract` fetches the latest Public Suffix List on first use and
158+
caches it indefinitely in `$HOME/.cache/python-tldextract`.
219159

220-
In the meantime, you can tell tldextract about your exception by either
221-
forking the PSL and using your fork in the `suffix_list_urls` param, or adding
222-
your suffix piecemeal with the `extra_suffixes` param.
160+
### URL validation
223161

224-
### I see my suffix in [the Public Suffix List (PSL)](https://publicsuffix.org/), but this library doesn't extract it.
162+
`tldextract` accepts any string and is very lenient. It prioritizes ease of use
163+
over strict validation, extracting domains from any string, even partial URLs or
164+
non-URLs.
225165

226-
Check if your suffix is in the private section of the list. See [this
227-
documentation](#public-vs-private-domains).
166+
## FAQ
228167

229-
### If I pass an invalid URL, I still get a result, no error. What gives?
168+
### Can you add/remove suffix \_\_\_\_?
230169

231-
To keep `tldextract` light in LoC & overhead, and because there are plenty of
232-
URL validators out there, this library is very lenient with input. If valid
233-
URLs are important to you, validate them before calling `tldextract`.
170+
`tldextract` doesn't maintain the suffix list. Submit changes to
171+
[the Public Suffix List](https://publicsuffix.org/submit/).
234172

235-
To avoid parsing a string twice, you can pass `tldextract` the output of
236-
[`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html) methods.
237-
For example:
173+
Meanwhile, use the `extra_suffixes` parameter, or fork the PSL and pass it to
174+
this library with the `suffix_list_urls` parameter.
238175

239-
```py
240-
extractor = TLDExtract()
241-
split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
242-
split_suffix = extractor.extract_urllib(split_url)
243-
url_to_crawl = f"{split_url.scheme}://{split_suffix.top_domain_under_public_suffix}:{split_url.port}"
244-
```
176+
### My suffix is in the PSL but not extracted correctly
177+
178+
Check if it's in the "PRIVATE" section. See
179+
[How to treat private domains as suffixes](#how-to-treat-private-domains-as-suffixes).
245180

246-
`tldextract`'s lenient string parsing stance lowers the learning curve of using
247-
the library, at the cost of desensitizing users to the nuances of URLs. This
248-
could be overhauled. For example, users could opt into validation, either
249-
receiving exceptions or error metadata on results.
181+
### Why does it parse invalid URLs?
182+
183+
See [URL validation](#url-validation) and
184+
[How to validate URLs before extraction](#how-to-validate-urls-before-extraction).
250185

251186
## Contribute
252187

@@ -256,33 +191,17 @@ receiving exceptions or error metadata on results.
256191
2. Change into the new directory.
257192
3. `pip install --upgrade --editable '.[testing]'`
258193

259-
### Running the test suite
260-
261-
Run all tests against all supported Python versions:
262-
263-
```zsh
264-
tox --parallel
265-
```
266-
267-
Run all tests against a specific Python environment configuration:
268-
269-
```zsh
270-
tox -l
271-
tox -e py311
272-
```
273-
274-
### Code Style
275-
276-
Automatically format all code:
194+
### Running tests
277195

278196
```zsh
279-
ruff format .
197+
tox --parallel # Test all Python versions
198+
tox -e py311 # Test specific Python version
199+
ruff format . # Format code
280200
```
281201

282202
## History
283203

284-
This package started by implementing the chosen answer from [this StackOverflow question on
285-
getting the "domain name" from a URL](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219).
286-
However, the proposed regex solution doesn't address many country codes like
287-
com.au, or the exceptions to country codes like the registered domain
288-
parliament.uk. The Public Suffix List does, and so does this package.
204+
This package started from a
205+
[StackOverflow answer](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219)
206+
about regex-based domain extraction. The regex approach fails for many domains,
207+
so this library switched to the Public Suffix List for accuracy.

0 commit comments

Comments
 (0)