Skip to content

Laurence-Wu/zLibraryScrapper

Repository files navigation

Z-Library Crawler Project

A comprehensive Python-based web scraping tool for searching, extracting, and downloading books from Z-Library (zh.z-lib.fm). This project provides automated book discovery, metadata extraction, download link generation, and file management capabilities.

πŸš€ Overview

This project is designed to automate the process of searching for books on Z-Library, extracting detailed metadata, generating download links, and organizing the results. It features advanced filtering capabilities, year-based traversal, book name matching algorithms, and comprehensive configuration options.

✨ Key Features

πŸ” Advanced Search & Filtering

  • Multi-criteria Search: Search by book name, author, language, file type, publication year, and content type
  • Fuzzy Match Control: Option to include or exclude fuzzy search matches
  • File Type Filtering: Support for EPUB, PDF, MOBI, AZW3, TXT, FB2, RTF formats
  • Language Preferences: Target specific languages (e.g., Chinese, English)
  • Publication Year Filtering: Search within specific year ranges
  • Content Type Selection: Filter between books and articles

πŸ“Š Data Extraction & Management

  • Comprehensive Metadata: Extract title, author, language, file size, format, and URLs
  • JSON Output: Structured data storage with configurable file naming
  • Batch Processing: Handle multiple search queries and pagination
  • Download Link Generation: Both synchronous and asynchronous methods
  • Session Management: Persistent login and cookie handling

πŸ€– Automation Features

  • Year Traversal: Automatically search across multiple years (2000-2025)
  • Selenium WebDriver: Headless browser automation with Chrome
  • Rate Limiting: Configurable delays to avoid detection
  • Retry Logic: Automatic retry on failed requests
  • Progress Tracking: Detailed logging and statistics

πŸ“š Book Name Matching

  • RapidFuzz Integration: Advanced fuzzy string matching
  • Name Extraction: Extract book names from output JSON files
  • Similarity Scoring: Intelligent book name comparison
  • Duplicate Detection: Identify similar or duplicate entries

πŸ“ Project Structure

SeekHubProject/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ requirements.txt                    # Python dependencies
β”œβ”€β”€ .gitignore                         # Git ignore patterns
β”œβ”€β”€ traversal_year.py                  # Year-based search automation
β”œβ”€β”€ unprocessesd_json_generator.py     # Raw search data generator
β”œβ”€β”€ download_json_generator.py         # Download link generator
β”œβ”€β”€ processesd_json_generator.py       # Processed data generator
β”œβ”€β”€ OS_function_tests.py              # System function tests
β”œβ”€β”€ output/                           # Generated output files
β”‚   β”œβ”€β”€ json/                        # Search result JSON files
β”‚   β”œβ”€β”€ auth/                        # Authentication data (cookies)
β”‚   └── downloads/                   # Downloaded book files
└── zlibraryCrowler/                  # Main crawler package
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ main.py                      # Basic web driver setup
    β”œβ”€β”€ config.py                    # Comprehensive configuration
    β”œβ”€β”€ .env.example                 # Environment variables template
    β”œβ”€β”€ login.py                     # Authentication management
    β”œβ”€β”€ search.py                    # Search functionality
    β”œβ”€β”€ getSearchDownloadLinks.py    # Download link extraction
    β”œβ”€β”€ downloadFiles.py             # File download management
    β”œβ”€β”€ getCookies.py                # Cookie handling
    β”œβ”€β”€ textProcess.py               # Text processing utilities
    └── bookNameMatching/            # Book matching algorithms
        β”œβ”€β”€ ExtractNamesFromOutputJson.py  # Name extraction
        └── rapidFuzzMatching.py           # Fuzzy matching

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • Chrome browser (for Selenium WebDriver)
  • Z-Library account credentials

1. Clone the Repository

git clone <repository-url>
cd SeekHubProject

2. Install Dependencies

pip install -r requirements.txt

3. Environment Configuration

# Copy the environment template
cp zlibraryCrowler/.env.example zlibraryCrowler/.env

# Edit the .env file with your credentials
EMAIL=your_email@example.com
PASSWORD=your_password

4. Configure Search Parameters

Edit zlibraryCrowler/config.py to set your preferences:

# Basic search configuration
BOOK_NAME_TO_SEARCH = "Python Programming"
PREFERRED_LANGUAGE = "english"
PREFERRED_FILE_TYPES = ["PDF", "EPUB"]
PREFERRED_YEAR = 2020
MAX_PAGES_TO_SCRAPE = 5

# Advanced options
INCLUDE_FUZZY_MATCHES = False
USE_HEADLESS_BROWSER = True
MAX_CONCURRENT_REQUESTS = 3

πŸš€ Usage

Basic Search

# Generate unprocessed search results
python unprocessesd_json_generator.py

# Extract download links
python download_json_generator.py

# Process and clean data
python processesd_json_generator.py

Year Traversal (Automated)

# Search across all years from 2000-2025
python traversal_year.py

Book Name Matching

# Extract book names from JSON files
python zlibraryCrowler/bookNameMatching/ExtractNamesFromOutputJson.py

# Perform fuzzy matching
python zlibraryCrowler/bookNameMatching/rapidFuzzMatching.py

βš™οΈ Configuration Options

Search Parameters

Parameter Description Options
BOOK_NAME_TO_SEARCH Target book name String or None
PREFERRED_LANGUAGE Language filter "chinese", "english", etc.
PREFERRED_FILE_TYPES File format filters ["EPUB", "PDF", "MOBI", "AZW3", "TXT", "FB2", "RTF"]
PREFERRED_YEAR Publication year Integer (0 to ignore)
PREFERRED_CONTENT_TYPES Content type filter ["book", "article"]
PREFERRED_ORDER Result ordering "popular", "bestmatch", "newest", "oldest"
MAX_PAGES_TO_SCRAPE Maximum pages to process Integer
INCLUDE_FUZZY_MATCHES Include fuzzy matches Boolean

Performance Settings

Parameter Description Default
USE_HEADLESS_BROWSER Run browser in background True
MAX_CONCURRENT_REQUESTS Async request limit 3
REQUEST_DELAY Delay between requests (seconds) 1
MAX_RETRIES Maximum retry attempts 5
BROWSER_TIMEOUT WebDriver timeout (seconds) 10

Output Configuration

Parameter Description Default
OUTPUT_FOLDERS['json'] JSON output directory ./output/json/
OUTPUT_FOLDERS['auth'] Authentication data directory ./output/auth/
OUTPUT_FOLDERS['downloads'] Downloaded files directory ./output/downloads/
PROCESS_NAME File naming prefix "zlibrary_crawler"

πŸ“Š Output Files

JSON Structure

{
  "id": "book_unique_id",
  "title": "Book Title",
  "author": "Author Name",
  "language": "english",
  "file_type": "PDF",
  "file_size": "2.5 MB",
  "book_page_url": "https://zh.z-lib.fm/book/...",
  "download_url": "https://zh.z-lib.fm/dl/...",
  "download_links": [...]
}

File Naming Convention

  • Search Results: {process_name}_{book_name}_{language}_{file_types}_{year}_{hash}_books.json
  • Download Links: {process_name}_{book_name}_{language}_{file_types}_{year}_{hash}_download_links.json
  • Downloaded Files: {process_name}_{original_filename}.{extension}

πŸ”§ Advanced Features

Async Download Link Extraction

The project supports both synchronous (Selenium) and asynchronous (aiohttp) methods for extracting download links:

# Enable async extraction in config.py
USE_ASYNC_EXTRACTION = True
MAX_CONCURRENT_REQUESTS = 3

Rate Limiting & Bot Detection Avoidance

  • Configurable delays between requests
  • Browser automation controls
  • User-agent rotation
  • Session persistence

Error Handling & Logging

  • Comprehensive error logging
  • Retry mechanisms
  • Progress tracking
  • Statistics reporting

Book Name Matching Algorithms

  • RapidFuzz: Advanced fuzzy string matching
  • Similarity Scoring: Intelligent comparison metrics
  • Batch Processing: Handle multiple book comparisons

πŸ›‘οΈ Security & Best Practices

Authentication

  • Store credentials in .env file (never commit to version control)
  • Session cookies are automatically managed and persisted
  • Login status verification on each page navigation

Rate Limiting

# Recommended settings to avoid being blocked
REQUEST_DELAY = 1          # 1 second between requests
PAGE_LOAD_DELAY = 2        # 2 seconds after page loads
MAX_CONCURRENT_REQUESTS = 3 # Maximum simultaneous requests

Legal Considerations

  • This tool is for educational and research purposes
  • Respect Z-Library's terms of service
  • Use reasonable request rates to avoid server overload
  • Ensure you have proper rights to download content

🚨 Troubleshooting

Common Issues

1. Login Failures

# Check credentials in .env file
EMAIL=your_correct_email@domain.com
PASSWORD=your_correct_password

# Verify Z-Library website availability
ZLIBRARY_BASE_URL = "https://zh.z-lib.fm"

2. WebDriver Issues

# Update Chrome WebDriver
pip install --upgrade webdriver-manager

# Verify Chrome browser installation
google-chrome --version  # Linux
# or check Chrome installation on Windows/Mac

3. Rate Limiting

# Increase delays in config.py
REQUEST_DELAY = 2
PAGE_LOAD_DELAY = 3
MAX_CONCURRENT_REQUESTS = 2

4. Memory Issues with Large Datasets

# Reduce concurrent operations
MAX_CONCURRENT_REQUESTS = 1
MAX_PAGES_TO_SCRAPE = 3

# Process in smaller batches
# Use year traversal for systematic processing

πŸ“ˆ Performance Optimization

For Large-Scale Operations

  1. Use Year Traversal: Process data year by year to manage memory
  2. Enable Async Processing: Use USE_ASYNC_EXTRACTION = True
  3. Optimize Concurrency: Balance MAX_CONCURRENT_REQUESTS vs. rate limits
  4. Monitor Output Sizes: Large JSON files may need processing in chunks

Memory Management

# Recommended settings for large datasets
MAX_PAGES_TO_SCRAPE = 5     # Limit pages per search
MAX_CONCURRENT_REQUESTS = 2  # Reduce concurrent operations
USE_HEADLESS_BROWSER = True  # Save memory

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is for educational and research purposes. Please ensure compliance with Z-Library's terms of service and applicable copyright laws.

πŸ†˜ Support

For issues, questions, or contributions:

  1. Check the troubleshooting section above
  2. Review the configuration options
  3. Create an issue with detailed error logs
  4. Include your environment details (Python version, OS, etc.)

⚠️ Disclaimer: This tool is intended for educational and research purposes. Users are responsible for ensuring compliance with all applicable laws and terms of service. The developers are not responsible for any misuse of this software.

About

This is the crowler program for the SeekHub project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages