Skip to content

Timeout Issue When Scraping Emails from Multiple URLs Using Python requests #6862

@mominurr

Description

@mominurr

I am working on a web scraping project using Python's requests library. The goal is to scrape emails from numerous URLs. To handle network delays, I set the timeout parameter as timeout=(10, 10).

However, when I run the script for multiple URLs, I encounter an issue where the program gets stuck on a request and does not respect the timeout settings. This results in the script hanging indefinitely, especially when scraping a large number of URLs.

Here’s the code snippet I’m using:

import requests  

urls = [  
    "http://example.com",  
    "http://anotherexample.com",  
    # ... more URLs  
]  
HEADERS={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
for url in urls:  
    try:  
        response = requests.get(url, headers=HEADERS, timeout=(10, 10))  
        if response.status_code == 200:  
            # Extract emails (simplified for demonstration)  
            print(f"Emails from {url}: ", response.text)  
    except requests.exceptions.Timeout:  
        print(f"Timeout occurred for {url}")  
    except requests.exceptions.RequestException as e:  
        print(f"Error occurred for {url}: {e}")  

Despite using the timeout parameter, the script sometimes gets stuck indefinitely and doesn’t proceed to the next URL.

Steps Taken:

  1. Tried reducing the timeout values to (5, 5) but encountered the same issue.
    Ensured that the URLs are valid and accessible.

My Questions:

  1. Why might the timeout not work as expected in this case?

  2. How can I ensure that the script doesn't hang indefinitely when scraping a large number of URLs?

Any help or suggestions to resolve this issue would be greatly appreciated.

Environment:

Python version: 3.10.10

requests version: 2.32.3

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions