You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
options.scope.protocol_must_match =False# Only crawl pages with the same protocol as the startpoint (e.g. only https). Default is False.
75
74
options.scope.subdomain_must_match =True# Only crawl pages with the same subdomain as the startpoint. If the startpoint is not a subdomain, no subdomains will be crawled. Default is True.
76
75
options.scope.hostname_must_match =True# Only crawl pages with the same hostname as the startpoint (e.g. only `finnwea`). Default is True.
@@ -86,7 +85,7 @@ The English phrase "Everything but the kitchen sink" means "almost anything one
options.identity.auth = HTTPBasicAuth('user', 'pass') # Or any other authentication (http://docs.python-requests.org/en/master/user/authentication/). Default is None.
options.misc.debug =False# If debug is enabled extra information will be logged to the console. Default is False.
116
122
options.misc.verify_ssl_certificates =True# If verification is enabled all SSL certificates will be checked for validity. Default is True.
117
-
options.misc.trusted_certificates =None# You can pass the path to a CA_BUNDLE file or directory with certificates of trusted CAs. Default is None.
123
+
options.misc.trusted_certificates =None# You can pass the path to a CA_BUNDLE file (.pem) or directory with certificates of trusted CAs. Default is None.
options.scope.protocol_must_match=False# Only crawl pages with the same protocol as the startpoint (e.g. only https). Default is False.
90
90
options.scope.subdomain_must_match=True# Only crawl pages with the same subdomain as the startpoint. If the startpoint is not a subdomain, no subdomains will be crawled. Default is True.
91
91
options.scope.hostname_must_match=True# Only crawl pages with the same hostname as the startpoint (e.g. only `finnwea`). Default is True.
options.identity.auth=HTTPBasicAuth('user', 'pass') # Or any other authentication (http://docs.python-requests.org/en/master/user/authentication/). Default is None.
options.misc.debug=False# If debug is enabled extra information will be logged to the console. Default is False.
131
138
options.misc.verify_ssl_certificates=True# If verification is enabled all SSL certificates will be checked for validity. Default is True.
132
139
options.misc.trusted_certificates=None# You can pass the path to a CA_BUNDLE file (.pem) or directory with certificates of trusted CAs. Default is None.
"""The OptionsRouting class can contain routes that prevent the crawler from crawling similar pages multiple times.
252
+
253
+
Attributes:
254
+
minimum_threshold (int): The minimum amount of requests to crawl (matching a certain route) before ignoring the rest. Default is 20.
255
+
routes (arr): The regular expressions that represent routes that should not be cralwed more times than the minimum treshold. Default is an empty array.
256
+
257
+
Note:
258
+
An example would be if you have a news site with URLs like (/news/3443, news/2132, news/9475, etc). You can add a regular expression
259
+
that matches this route so only X requests that match regular expression will be crawled (where X is the minimum treshold).
260
+
261
+
Note:
262
+
The crawler will only stop crawling requests of certain routes at exactly the minimum treshold if the maximum threads option is set to 1.
263
+
If the maximum threads option is set to a value higher than 1 the threshold will get a bit higher depending on the amount of threads used.
264
+
265
+
"""
266
+
267
+
def__init__(self):
268
+
"""Constructs an OptionsRouting instance."""
269
+
270
+
self.minimum_threshold=20
271
+
self.routes= []
272
+
273
+
248
274
classOptionsMisc(object):
249
275
"""The OptionsMisc class contains all kind of misc options.
0 commit comments