-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Description
I am working on a web scraping project using Python's requests library. The goal is to scrape emails from numerous URLs. To handle network delays, I set the timeout parameter as timeout=(10, 10).
However, when I run the script for multiple URLs, I encounter an issue where the program gets stuck on a request and does not respect the timeout settings. This results in the script hanging indefinitely, especially when scraping a large number of URLs.
Here’s the code snippet I’m using:
import requests
urls = [
"http://example.com",
"http://anotherexample.com",
# ... more URLs
]
HEADERS={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
for url in urls:
try:
response = requests.get(url, headers=HEADERS, timeout=(10, 10))
if response.status_code == 200:
# Extract emails (simplified for demonstration)
print(f"Emails from {url}: ", response.text)
except requests.exceptions.Timeout:
print(f"Timeout occurred for {url}")
except requests.exceptions.RequestException as e:
print(f"Error occurred for {url}: {e}")
Despite using the timeout parameter, the script sometimes gets stuck indefinitely and doesn’t proceed to the next URL.
Steps Taken:
- Tried reducing the timeout values to (5, 5) but encountered the same issue.
Ensured that the URLs are valid and accessible.
My Questions:
-
Why might the timeout not work as expected in this case?
-
How can I ensure that the script doesn't hang indefinitely when scraping a large number of URLs?
Any help or suggestions to resolve this issue would be greatly appreciated.
Environment:
Python version: 3.10.10
requests version: 2.32.3