33` tldextract ` accurately separates a URL's subdomain, domain, and public suffix,
44using [ the Public Suffix List (PSL)] ( https://publicsuffix.org ) .
55
6- Say you want just the "google" part of https://www.google.com . * Everybody gets
7- this wrong.* Splitting on the "." and taking the 2nd-to-last element only works
8- for simple domains, e.g. .com. Consider
9- [ http://forums.bbc.co.uk ] ( http://forums.bbc.co.uk ) : the naive splitting method
10- will give you "co" as the domain, instead of "bbc".
6+ ** Why?** Naive URL parsing like splitting on dots fails for domains like
7+ ` forums.bbc.co.uk ` (gives "co" instead of "bbc"). ` tldextract ` handles the edge
8+ cases, so you don't have to.
119
12- Rather than juggle TLDs,
13- gTLDs, ccTLDs, and their exceptions yourself, ` tldextract ` extracts the currently living public
14- suffixes according to [ the Public Suffix List] ( https://publicsuffix.org ) .
15- You can optionally support the Public Suffix List's [ private
16- domains] ( #public-vs-private-domains ) as well.
17-
18- > A "public suffix" is one under which Internet users can directly register
19- > names.
20-
21- A public suffix is also sometimes called an effective TLD (eTLD).
22-
23- ## Usage
10+ ## Quick Start
2411
2512``` python
2613>> > import tldextract
2714
2815>> > tldextract.extract(' http://forums.news.cnn.com/' )
2916ExtractResult(subdomain = ' forums.news' , domain = ' cnn' , suffix = ' com' , is_private = False )
3017
31- >> > tldextract.extract(' http://forums.bbc.co.uk/' ) # United Kingdom
18+ >> > tldextract.extract(' http://forums.bbc.co.uk/' )
3219ExtractResult(subdomain = ' forums' , domain = ' bbc' , suffix = ' co.uk' , is_private = False )
3320
34- >> > tldextract.extract(' http://www.worldbank.org.kg/' ) # Kyrgyzstan
35- ExtractResult(subdomain = ' www' , domain = ' worldbank' , suffix = ' org.kg' , is_private = False )
36- ```
37-
38- Note subdomain and suffix are _ optional_ . Not all URL-like inputs have a
39- subdomain or a valid suffix.
40-
41- ``` python
42- >> > tldextract.extract(' google.com' )
43- ExtractResult(subdomain = ' ' , domain = ' google' , suffix = ' com' , is_private = False )
44-
45- >> > tldextract.extract(' google.notavalidsuffix' )
46- ExtractResult(subdomain = ' google' , domain = ' notavalidsuffix' , suffix = ' ' , is_private = False )
47-
48- >> > tldextract.extract(' http://127.0.0.1:8080/deployed/' )
49- ExtractResult(subdomain = ' ' , domain = ' 127.0.0.1' , suffix = ' ' , is_private = False )
50- ```
51-
52- To rejoin the original hostname, if it was indeed a valid, registered hostname:
53-
54- ``` python
21+ >> > # Access the parts you need
5522>> > ext = tldextract.extract(' http://forums.bbc.co.uk' )
23+ >> > ext.domain
24+ ' bbc'
5625>> > ext.top_domain_under_public_suffix
5726' bbc.co.uk'
5827>> > ext.fqdn
5928' forums.bbc.co.uk'
6029```
6130
62- In addition to the Python interface, there is a command-line interface. Split
63- the URL components by space:
64-
65- ``` zsh
66- $ tldextract ' http://forums.bbc.co.uk'
67- forums bbc co.uk
68- ```
69-
7031## Install
7132
72- Latest release on PyPI:
73-
7433``` zsh
7534pip install tldextract
7635```
7736
78- Or the latest dev version:
37+ ## How-to Guides
7938
80- ``` zsh
81- pip install -e ' git://github.com/john-kurkowski/tldextract.git#egg=tldextract'
82- ```
83-
84- ## Note about caching
39+ ### How to disable HTTP suffix list fetching for production
8540
86- Beware when first calling ` tldextract ` , it updates its TLD list with a live HTTP
87- request. This updated TLD set is usually cached indefinitely in ` $HOME/.cache/python- tldextract` .
88- To control the cache's location, set the ` TLDEXTRACT_CACHE ` environment variable or set the
89- ` cache_dir ` path when constructing a ` TLDExtract ` .
41+ ``` python
42+ no_fetch_extract = tldextract.TLDExtract( suffix_list_urls = ())
43+ no_fetch_extract( ' http://www.google.com ' )
44+ ```
9045
91- (Arguably runtime bootstrapping like that shouldn't be the default behavior,
92- like for production systems. But I want you to have the latest TLDs, especially
93- when I haven't kept this code up to date.)
46+ ### How to set a custom cache location
9447
48+ Via environment variable:
9549
9650``` python
97- # extract callable that falls back to the included TLD snapshot, no live HTTP fetching
98- no_fetch_extract = tldextract.TLDExtract(suffix_list_urls = ())
99- no_fetch_extract(' http://www.google.com' )
51+ export TLDEXTRACT_CACHE = " /path/to/cache"
52+ ```
10053
101- # extract callable that reads/writes the updated TLD set to a different path
102- custom_cache_extract = tldextract.TLDExtract(cache_dir = ' /path/to/your/cache/' )
103- custom_cache_extract(' http://www.google.com' )
54+ Or in code:
10455
105- # extract callable that doesn't use caching
106- no_cache_extract = tldextract.TLDExtract(cache_dir = None )
107- no_cache_extract(' http://www.google.com' )
56+ ``` python
57+ custom_cache_extract = tldextract.TLDExtract(cache_dir = ' /path/to/cache/' )
10858```
10959
110- If you want to stay fresh with the TLD definitions--though they don't change
111- often--delete the cache file occasionally, or run
60+ ### How to update TLD definitions
61+
62+ Command line:
11263
11364``` zsh
11465tldextract --update
11566```
11667
117- or :
68+ Or delete the cache folder :
11869
11970``` zsh
120- env TLDEXTRACT_CACHE= " ~/tldextract .cache" tldextract --update
71+ rm -rf $HOME / .cache/python-tldextract
12172```
12273
123- It is also recommended to delete the file after upgrading this lib.
124-
125- ## Advanced usage
126-
127- ### Public vs. private domains
128-
129- The PSL [ maintains a concept of "private"
130- domains] ( https://publicsuffix.org/list/ ) .
131-
132- > PRIVATE domains are amendments submitted by the domain holder, as an
133- > expression of how they operate their domain security policy. … While some
134- > applications, such as browsers when considering cookie-setting, treat all
135- > entries the same, other applications may wish to treat ICANN domains and
136- > PRIVATE domains differently.
137-
138- By default, ` tldextract ` treats public and private domains the same.
74+ ### How to treat private domains as suffixes
13975
14076``` python
141- >> > extract = tldextract.TLDExtract()
142- >> > extract(' waiterrant.blogspot.com' )
143- ExtractResult(subdomain = ' waiterrant ' , domain = ' blogspot ' , suffix = ' com' , is_private = False )
77+ extract = tldextract.TLDExtract(include_psl_private_domains = True )
78+ extract(' waiterrant.blogspot.com' )
79+ # ExtractResult(subdomain='', domain='waiterrant ', suffix='blogspot. com', is_private=True )
14480```
14581
146- The following overrides this.
82+ ### How to use a local suffix list
14783
14884``` python
149- >> > extract = tldextract.TLDExtract()
150- >> > extract(' waiterrant.blogspot.com' , include_psl_private_domains = True )
151- ExtractResult(subdomain = ' ' , domain = ' waiterrant' , suffix = ' blogspot.com' , is_private = True )
85+ extract = tldextract.TLDExtract(
86+ suffix_list_urls = [" file:///path/to/your/list.dat" ],
87+ cache_dir = ' /path/to/cache/' ,
88+ fallback_to_snapshot = False )
15289```
15390
154- To change the default for all extract calls:
91+ ### How to use a remote suffix list
15592
15693``` python
157- >> > extract = tldextract.TLDExtract(include_psl_private_domains = True )
158- >> > extract(' waiterrant.blogspot.com' )
159- ExtractResult(subdomain = ' ' , domain = ' waiterrant' , suffix = ' blogspot.com' , is_private = True )
94+ extract = tldextract.TLDExtract(
95+ suffix_list_urls = [" https://myserver.com/suffix-list.dat" ])
16096```
16197
162- The thinking behind the default is, it's the more common case when people
163- mentally parse a domain name. It doesn't assume familiarity with the PSL nor
164- that the PSL makes a public/private distinction. Note this default may run
165- counter to the default parsing behavior of other, PSL-based libraries.
166-
167- ### Specifying your own URL or file for Public Suffix List data
168-
169- You can specify your own input data in place of the default Mozilla Public Suffix List:
98+ ### How to add extra suffixes
17099
171100``` python
172101extract = tldextract.TLDExtract(
173- suffix_list_urls = [" http://foo.bar.baz" ],
174- # Recommended: Specify your own cache file, to minimize ambiguities about where
175- # tldextract is getting its data, or cached data, from.
176- cache_dir = ' /path/to/your/cache/' ,
177- fallback_to_snapshot = False )
102+ extra_suffixes = [" foo" , " bar.baz" ])
178103```
179104
180- If the cached version of public suffix definitions doesn't exist, such as on
181- the first run, the above snippet will request the URLs you specified in order,
182- and use the first successful response.
183-
184- If you want to use input data from your local filesystem, use the ` file:// `
185- protocol with an absolute path:
105+ ### How to validate URLs before extraction
186106
187107``` python
188- extract = tldextract.TLDExtract(
189- suffix_list_urls = [ " file:// " + " /absolute/path/to/your/local/suffix/list/file " ],
190- cache_dir = ' /path/to/your/cache/ ' ,
191- fallback_to_snapshot = False )
108+ from urllib.parse import urlsplit
109+
110+ split_url = urlsplit( " https://example.com:8080/path " )
111+ result = tldextract.extract_urllib(split_url )
192112```
193113
194- This also works via command line update:
114+ ## Command Line
195115
196116``` zsh
197- tldextract --update --suffix_list_url " http://foo.bar.baz"
117+ $ tldextract http://forums.bbc.co.uk
118+ forums bbc co.uk
119+
120+ $ tldextract --update # Update cached suffix list
121+ $ tldextract --help # See all options
198122```
199123
200- Using your own URLs could be useful in production when you don't want the delay
201- with updating the suffix list on first use, or if you are behind a complex
202- firewall.
124+ ## Understanding Domain Parsing
125+
126+ ### Public Suffix List
127+
128+ ` tldextract ` uses the [ Public Suffix List] ( https://publicsuffix.org ) , a
129+ community-maintained list of domain suffixes. The PSL contains both:
130+
131+ - ** Public suffixes** : Where anyone can register a domain (` .com ` , ` .co.uk ` ,
132+ ` .org.kg ` )
133+ - ** Private suffixes** : Operated by companies for customer subdomains
134+ (` blogspot.com ` , ` github.io ` )
135+
136+ Web browsers use this same list for security decisions like cookie scoping.
137+
138+ ### Suffix vs. TLD
139+
140+ While ` .com ` is a top-level domain (TLD), many suffixes like ` .co.uk ` are
141+ technically second-level. The PSL uses "public suffix" to cover both.
203142
204- You can also specify additional suffixes in the ` extra_suffixes ` param. These
205- will be merged into whatever public suffix definitions are already in use by
206- ` tldextract ` .
143+ ### Default behavior with private domains
144+
145+ By default, ` tldextract ` treats private suffixes as regular domains:
207146
208147``` python
209- extract = tldextract.TLDExtract(
210- extra_suffixes = [ " foo " , " bar " , " baz " ] )
148+ >> > tldextract.extract( ' waiterrant.blogspot.com ' )
149+ ExtractResult( subdomain = ' waiterrant ' , domain = ' blogspot ' , suffix = ' com ' , is_private = False )
211150```
212151
213- ## FAQ
152+ To treat them as suffixes instead, see
153+ [ How to treat private domains as suffixes] ( #how-to-treat-private-domains-as-suffixes ) .
214154
215- ### Can you add suffix \_\_\_\_ ? Can you make an exception for domain \_\_\_\_ ?
155+ ### Caching behavior
216156
217- This project doesn't contain an actual list of public suffixes. That comes from
218- [ the Public Suffix List (PSL) ] ( https://publicsuffix.org/ ) . Submit amendments there .
157+ By default, ` tldextract ` fetches the latest Public Suffix List on first use and
158+ caches it indefinitely in ` $HOME/.cache/python-tldextract ` .
219159
220- In the meantime, you can tell tldextract about your exception by either
221- forking the PSL and using your fork in the ` suffix_list_urls ` param, or adding
222- your suffix piecemeal with the ` extra_suffixes ` param.
160+ ### URL validation
223161
224- ### I see my suffix in [ the Public Suffix List (PSL)] ( https://publicsuffix.org/ ) , but this library doesn't extract it.
162+ ` tldextract ` accepts any string and is very lenient. It prioritizes ease of use
163+ over strict validation, extracting domains from any string, even partial URLs or
164+ non-URLs.
225165
226- Check if your suffix is in the private section of the list. See [ this
227- documentation] ( #public-vs-private-domains ) .
166+ ## FAQ
228167
229- ### If I pass an invalid URL, I still get a result, no error. What gives ?
168+ ### Can you add/remove suffix \_\_\_\_ ?
230169
231- To keep ` tldextract ` light in LoC & overhead, and because there are plenty of
232- URL validators out there, this library is very lenient with input. If valid
233- URLs are important to you, validate them before calling ` tldextract ` .
170+ ` tldextract ` doesn't maintain the suffix list. Submit changes to
171+ [ the Public Suffix List] ( https://publicsuffix.org/submit/ ) .
234172
235- To avoid parsing a string twice, you can pass ` tldextract ` the output of
236- [ ` urllib.parse ` ] ( https://docs.python.org/3/library/urllib.parse.html ) methods.
237- For example:
173+ Meanwhile, use the ` extra_suffixes ` parameter, or fork the PSL and pass it to
174+ this library with the ` suffix_list_urls ` parameter.
238175
239- ``` py
240- extractor = TLDExtract()
241- split_url = urllib.parse.urlsplit(" https://foo.bar.com:8080" )
242- split_suffix = extractor.extract_urllib(split_url)
243- url_to_crawl = f " { split_url.scheme} :// { split_suffix.top_domain_under_public_suffix} : { split_url.port} "
244- ```
176+ ### My suffix is in the PSL but not extracted correctly
177+
178+ Check if it's in the "PRIVATE" section. See
179+ [ How to treat private domains as suffixes] ( #how-to-treat-private-domains-as-suffixes ) .
245180
246- ` tldextract ` 's lenient string parsing stance lowers the learning curve of using
247- the library, at the cost of desensitizing users to the nuances of URLs. This
248- could be overhauled. For example, users could opt into validation, either
249- receiving exceptions or error metadata on results .
181+ ### Why does it parse invalid URLs?
182+
183+ See [ URL validation] ( #url-validation ) and
184+ [ How to validate URLs before extraction ] ( #how-to-validate-urls-before-extraction ) .
250185
251186## Contribute
252187
@@ -256,33 +191,17 @@ receiving exceptions or error metadata on results.
2561912 . Change into the new directory.
2571923 . ` pip install --upgrade --editable '.[testing]' `
258193
259- ### Running the test suite
260-
261- Run all tests against all supported Python versions:
262-
263- ``` zsh
264- tox --parallel
265- ```
266-
267- Run all tests against a specific Python environment configuration:
268-
269- ``` zsh
270- tox -l
271- tox -e py311
272- ```
273-
274- ### Code Style
275-
276- Automatically format all code:
194+ ### Running tests
277195
278196``` zsh
279- ruff format .
197+ tox --parallel # Test all Python versions
198+ tox -e py311 # Test specific Python version
199+ ruff format . # Format code
280200```
281201
282202## History
283203
284- This package started by implementing the chosen answer from [ this StackOverflow question on
285- getting the "domain name" from a URL] ( http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219 ) .
286- However, the proposed regex solution doesn't address many country codes like
287- com.au, or the exceptions to country codes like the registered domain
288- parliament.uk. The Public Suffix List does, and so does this package.
204+ This package started from a
205+ [ StackOverflow answer] ( http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219 )
206+ about regex-based domain extraction. The regex approach fails for many domains,
207+ so this library switched to the Public Suffix List for accuracy.
0 commit comments