Skip to content

A full-stack web application for scraping, storing and managing job listings from LinkedIn

Notifications You must be signed in to change notification settings

uzair-javed-1/LinkedinJobScrapper

Repository files navigation

๐Ÿš€ Job Scraper Dashboard - Complete Documentation

Version FastAPI React PostgreSQL License

A full-stack web application for scraping, storing, and managing job listings from LinkedIn

Features โ€ข Installation โ€ข Usage โ€ข API Docs โ€ข Contributing

๐Ÿ“‹ Table of Contents

๐ŸŽฏ Project Overview

What is Job Scraper Dashboard?

Job Scraper Dashboard is a comprehensive full-stack application designed to automate job hunting by:

  • Scraping job listings from LinkedIn in real-time
  • Storing data in PostgreSQL
  • Searching through collected jobs with advanced filters
  • Managing job listings through a modern React dashboard
  • Monitoring scraping progress with live updates

Key Benefits

  • Time-Saving: Automate job search across multiple parameters
  • Centralized Storage: All jobs in one place, searchable and filterable
  • Real-time Updates: Live scraping progress monitoring
  • User-Friendly: Intuitive dashboard with responsive design
  • Scalable: Built with production-ready technologies

โœจ Features

Core Features

  • โœ… Real-time LinkedIn Scraping - Live job extraction with progress tracking
  • โœ… Advanced Search - Filter jobs by title, company, location
  • โœ… Job Details View - Complete job descriptions with modal display
  • โœ… Direct Apply Links - One-click access to LinkedIn job postings
  • โœ… Admin Dashboard - Database management and bulk operations
  • โœ… Responsive Design - Works perfectly on mobile and desktop
  • โœ… Background Processing - Non-blocking scraping operations

Dashboard Features

  • Live Scraping Status - Real-time progress monitoring
  • Job Statistics - Total counts and insights
  • Search & Filter - Quick find functionality
  • Delete Operations - Individual and bulk job removal
  • Admin Controls - Database management interface

๐Ÿ—๏ธ Tech Stack

Backend (Python)

Technology Version Purpose
FastAPI 0.104.1 Modern web framework for APIs
SQLAlchemy 2.0.23 ORM for database interactions
Pydantic 2.5.0 Data validation and settings
Selenium 4.15.2 Browser automation for scraping
BeautifulSoup4 4.12.2 HTML parsing for job data
Uvicorn 0.24.0 ASGI server for FastAPI
Psycopg2 2.9.9 PostgreSQL adapter

Frontend (React)

Technology Purpose
React 18 Frontend UI library
React Router Client-side routing
Tailwind CSS Utility-first CSS framework
Lucide React Icon library
Vite Build tool and dev server

Database

Technology Purpose
PostgreSQL 14+ Primary relational database
pgAdmin (optional) Database management GUI

DevOps & Tools

Technology Purpose
dotenv Environment variable management
WebDriver Manager Auto ChromeDriver management
Chrome Headless Headless browser for scraping
Git Version control

๐Ÿ—๏ธ Architecture

System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   React Frontendโ”‚โ—„โ”€โ”€โ–บโ”‚   FastAPI Backendโ”‚โ—„โ”€โ”€โ–บโ”‚   PostgreSQL DB  โ”‚
โ”‚   (Dashboard)   โ”‚    โ”‚   (REST API)     โ”‚    โ”‚   (Job Storage)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                       โ”‚                       โ”‚
         โ”‚                       โ”‚                       โ”‚
         โ–ผ                       โ–ผ                       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   User Browser  โ”‚    โ”‚   Selenium      โ”‚    โ”‚   Data Models   โ”‚
โ”‚   (UI)          โ”‚    โ”‚   (Scraper)     โ”‚    โ”‚   (ORM)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Project Structure

job-scraper-dashboard/
โ”œโ”€โ”€ backend/                    # FastAPI Backend
โ”‚   โ”œโ”€โ”€ main.py                # FastAPI app & endpoints
โ”‚   โ”œโ”€โ”€ database.py            # DB connection setup
โ”‚   โ”œโ”€โ”€ models.py              # SQLAlchemy models
โ”‚   โ”œโ”€โ”€ scraper.py             # LinkedIn scraper class
โ”‚   โ”œโ”€โ”€ requirements.txt       # Python dependencies
โ”‚   โ””โ”€โ”€ .env                   # Environment variables
โ”‚
โ”œโ”€โ”€ frontend/                  # React Frontend
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ components/        # React components
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Header.jsx
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ScrapeForm.jsx
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ SearchBar.jsx
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ JobCard.jsx
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ JobModal.jsx
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ Footer.jsx
โ”‚   โ”‚   โ”œโ”€โ”€ services/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ api.js         # API service layer
โ”‚   โ”‚   โ”œโ”€โ”€ pages/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ App.jsx        # Main dashboard
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ AdminPanel.jsx
โ”‚   โ”‚   โ”œโ”€โ”€ index.css          # Tailwind styles
โ”‚   โ”‚   โ””โ”€โ”€ main.jsx           # React entry point
โ”‚   โ”œโ”€โ”€ package.json           # Frontend dependencies
โ”‚   โ””โ”€โ”€ index.html             # HTML template
โ”‚
โ”œโ”€โ”€ README.md                  # This documentation
โ””โ”€โ”€ .gitignore

Backend File Details

backend/main.py - FastAPI Application

# Core application setup with routes for:
# - Job management (CRUD operations)
# - Scraping control (start/stop/status)
# - Statistics and admin endpoints
# - CORS middleware configuration
# - Background task processing

backend/database.py - Database Connection

# SQLAlchemy engine and session factory setup
# PostgreSQL connection configuration
# Database session dependency injection

backend/models.py - Data Models

# SQLAlchemy ORM models for Job entity
# Includes fields: title, company, location, description, url, etc.
# Unique constraints to prevent duplicate jobs
# Timestamp fields for tracking

backend/scraper.py - LinkedIn Scraper

# Selenium-based web scraper for LinkedIn jobs
# Methods for: page navigation, job extraction, description parsing
# Anti-detection measures and error handling
# Duplicate job checking logic

Frontend File Details

frontend/src/pages/App.jsx - Main Dashboard

// Main application component
// State management for jobs, search, scraping status
// Component composition and layout
// API integration and data fetching

frontend/src/components/ - UI Components

  • Header.jsx - Navigation and branding
  • ScrapeForm.jsx - Scraping controls and form
  • SearchBar.jsx - Search functionality
  • JobCard.jsx - Individual job display
  • JobModal.jsx - Detailed job view
  • Footer.jsx - Footer information

frontend/src/services/api.js - API Service

// Centralized API client
// Methods for all backend interactions
// Error handling and response parsing

๐Ÿš€ Installation

Prerequisites

  • Python 3.9+ (with pip)
  • Node.js 16+ (with npm)
  • PostgreSQL 14+
  • Chrome/Chromium browser
  • Git (for version control)

Step-by-Step Installation

1. Clone the Repository

git clone https://github.com/uzair-javed-1/LinkedinJobScrapper
cd LinkedinJobScrapper

2. Backend Setup

# Navigate to backend directory
cd backend

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On Mac/Linux:
source venv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your database credentials

3. Database Setup

-- Connect to PostgreSQL
psql -U postgres

-- Create database
CREATE DATABASE job_scraper;

-- Create user (optional)
CREATE USER scraper_user WITH PASSWORD 'your_password';

-- Grant privileges
GRANT ALL PRIVILEGES ON DATABASE job_scraper TO scraper_user;

-- Exit psql
\q

4. Frontend Setup

# Navigate to frontend directory
cd ../frontend

# Install Node.js dependencies
npm install

5. Environment Configuration

Edit backend/.env file:

DATABASE_URL=postgresql://scraper_user:your_password@localhost:5432/job_scraper

for reference: setup is this "DATABASE_URL=postgresql://postgres:uzair@localhost:5432/job_scraper"

//uzair is user name of db server and job_scraper is the database which i created and port in which this backend will run is 5432.

โš™๏ธ Configuration

Backend Configuration

Create backend/.env file with:

# Database Configuration
DATABASE_URL=postgresql://username:password@localhost:5432/job_scraper

# Scraping Configuration
SCRAPING_MAX_PAGES=2
SCRAPING_DELAY_SECONDS=2
SCRAPING_HEADLESS=true

# Application Configuration
DEBUG=true
HOST=0.0.0.0
PORT=8000

# CORS Configuration
CORS_ORIGINS=http://localhost:5173,http://localhost:3000

Frontend Configuration

Edit frontend/src/services/api.js:

const API_URL = 'http://localhost:8000';  // Change if backend runs elsewhere

Database Configuration

The application automatically creates tables. Manual configuration includes:

  1. Start PostgreSQL:

    # Linux
    sudo systemctl start postgresql
    
    # Windows (via Services)
    # Start PostgreSQL service
  2. Verify Connection:

    psql -U scraper_user -d job_scraper -h localhost

๐Ÿš€ Usage

Running the Application

Development Mode

Terminal 1 - Backend Server:

cd backend
source venv/bin/activate  # or venv\Scripts\activate on Windows
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Terminal 2 - Frontend Server:

cd frontend
npm run dev

Production Mode

# Backend (production)
cd backend
gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app

# Frontend (build and serve)
cd frontend
npm run build
npm run preview

Access Points

๐ŸŽฏ How to Use the Scraper

Via Web Interface

  1. Login to your LinkedIn: login Linkedin Account and Make it default Browser in which you are logged In.
  2. Open Dashboard: Navigate to http://localhost:5173
  3. Enter Parameters:
    • Keyword: Job title or skill (e.g., "Software Engineer")
    • Location: City, state, or country (e.g., "New York")
  4. Start Scraping: Click "Start Scraping" button
  5. Monitor Progress: Watch real-time updates
  6. Stop Anytime: Click stop button to cancel -> well stop is not working need little fix required, basically i created /stop backend route but in frontend not perfecly alligned with the backend so need react fixes just, well you can go to /stop route or /admin route this will terminate the scrapping process at this time but will fix this later.

Via API

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "keyword": "Data Scientist",
    "location": "San Francisco",
    "max_pages": 2
  }'

Via Python Script

import requests

# Start scraping
response = requests.post(
    "http://localhost:8000/scrape",
    json={
        "keyword": "Marketing Manager",
        "location": "Chicago",
        "max_pages": 3
    }
)
print(f"Scraping started: {response.json()}")

# Check status
status = requests.get("http://localhost:8000/scraping-status").json()
print(f"Current status: {status}")

Scraping Process Details

  1. Initialization:

    • Opens Chrome in headless mode
    • Navigates to LinkedIn jobs search
    • Closes popups and modals
  2. Job Collection:

    • Extracts job cards from each page
    • Gets title, company, location, URL
    • Checks for duplicates in database
  3. Detail Extraction:

    • Visits each job page individually
    • Extracts full job description
    • Captures posting date if available
  4. Data Storage:

    • Saves to PostgreSQL with unique constraints
    • Updates scraping status
    • Commits transaction
  5. Cleanup:

    • Closes browser
    • Updates final status
    • Releases database connection

Scraping Parameters

Parameter Default Description Recommended Range
max_pages 2 Number of pages to scrape 1-5
delay 1-3s Delay between requests 1-5 seconds
headless True Headless browser mode True/False
timeout 15s Page load timeout 10-30 seconds

๐ŸŒ API Documentation

Base URL

http://localhost:8000

Authentication

No authentication required for development. For production, implement JWT or API keys.

Endpoints Reference

1. Health Check

GET /

curl http://localhost:8000/

Response:

{
  "message": "Job Scraper API",
  "status": "running"
}

2. Get All Jobs

GET /jobs

Query Parameters:

Parameter Type Default Description
skip integer 0 Number of records to skip
limit integer 50 Maximum records to return
search string "" Search term for title/company/location

Example:

curl "http://localhost:8000/jobs?search=engineer&skip=0&limit=20"

Response:

[
  {
    "id": 1,
    "title": "Senior Software Engineer",
    "company": "Tech Corp Inc.",
    "location": "San Francisco, CA",
    "description": "We are looking for a Senior Software Engineer...",
    "url": "https://linkedin.com/jobs/view/123456",
    "source": "LinkedIn",
    "posted_date": "2023-12-01T10:00:00Z",
    "scraped_at": "2023-12-01T14:30:00Z"
  }
]

3. Get Single Job

GET /jobs/{job_id}

curl http://localhost:8000/jobs/1

4. Delete Job

DELETE /jobs/{job_id}

curl -X DELETE http://localhost:8000/jobs/1

Response:

{
  "message": "Job deleted"
}

5. Start Scraping

POST /scrape

Request Body:

{
  "keyword": "Software Engineer",
  "location": "New York",
  "max_pages": 2
}

Response:

{
  "message": "Scraping started",
  "keyword": "Software Engineer"
}

6. Get Scraping Status

GET /scraping-status

curl http://localhost:8000/scraping-status

Response:

{
  "is_scraping": true,
  "current_page": 1,
  "total_pages": 2,
  "jobs_found": 5,
  "current_job": "Scraping: Senior Backend Engineer at Google"
}

7. Stop Scraping

POST /stop-scraping

curl -X POST http://localhost:8000/stop-scraping

Response:

{
  "message": "Stopping scraper..."
}

8. Get Statistics

GET /stats

curl http://localhost:8000/stats

Response:

{
  "total_jobs": 150
}

9. Delete All Jobs (Admin)

DELETE /admin/delete-all

curl -X DELETE http://localhost:8000/admin/delete-all

Response:

{
  "message": "Deleted 150 jobs",
  "count": 150
}

Error Responses

Status Code Description
400 Bad Request - Invalid input
404 Not Found - Resource doesn't exist
409 Conflict - Duplicate job
500 Internal Server Error

Pydantic Models

ScrapeRequest

class ScrapeRequest(BaseModel):
    keyword: str
    location: str
    max_pages: int = 2

JobResponse

class JobResponse(BaseModel):
    id: int
    title: str
    company: str
    location: Optional[str]
    description: Optional[str]
    url: str
    source: str
    posted_date: Optional[datetime]
    scraped_at: datetime

๐Ÿ’ป Frontend Guide

Component Overview

1. App (Main Dashboard)

Location: frontend/src/pages/App.jsx

  • Main application container
  • Manages global state (jobs, search, scraping status)
  • Renders all other components

2. Header Component

Location: frontend/src/components/Header.jsx

  • Application header with title
  • Navigation to admin panel
  • Responsive design with gradient background

3. ScrapeForm Component

Location: frontend/src/components/ScrapeForm.jsx

  • Form for starting scraping jobs
  • Real-time progress monitoring
  • Start/stop controls with visual feedback

Features:

  • Keyword and location inputs
  • Live scraping status updates
  • Progress bar and job count
  • Success/error notifications

4. SearchBar Component

Location: frontend/src/components/SearchBar.jsx

  • Search functionality for jobs
  • Real-time filtering
  • Clear search option

5. JobCard Component

Location: frontend/src/components/JobCard.jsx

  • Displays individual job listing
  • Compact view with essential info
  • Action buttons (View, Apply, Delete)

Job Card Layout:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Senior Software Engineer        โ”‚ โ† Title
โ”‚ Google                          โ”‚ โ† Company
โ”‚ ๐Ÿ“ Mountain View, CA            โ”‚ โ† Location
โ”‚                                 โ”‚
โ”‚ [๐Ÿ‘๏ธ View] [๐Ÿ“ค Apply]      [๐Ÿ—‘๏ธ] โ”‚ โ† Action buttons
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6. JobModal Component

Location: frontend/src/components/JobModal.jsx

  • Modal popup for detailed job view
  • Full job description
  • Direct apply link to LinkedIn

7. Footer Component

Location: frontend/src/components/Footer.jsx

  • Footer with author information
  • Contact details and GitHub link

8. AdminPanel Component

Location: frontend/src/pages/AdminPanel.jsx

  • Administrative interface
  • Statistics display
  • Bulk delete functionality
  • Database management tools

API Service Layer

Location: frontend/src/services/api.js

Methods Available:

// Get jobs with search
api.getJobs(search = '', skip = 0, limit = 50)

// Get single job
api.getJob(id)

// Start scraping
api.scrapeJobs(keyword, location, maxPages = 2)

// Get scraping status
api.getScrapingStatus()

// Stop scraping
api.stopScraping()

// Delete job
api.deleteJob(id)

// Delete all jobs
api.deleteAllJobs()

// Get statistics
api.getStats()

State Management

The application uses React hooks for state management:

// Main state variables
const [jobs, setJobs] = useState([])          // Job listings
const [search, setSearch] = useState('')      // Search term
const [loading, setLoading] = useState(false) // Loading state
const [scraping, setScraping] = useState(false) // Scraping status
const [selectedJob, setSelectedJob] = useState(null) // Selected job for modal

Routing

// React Router setup in main.jsx
<BrowserRouter>
  <Routes>
    <Route path="/" element={<App />} />
    <Route path="/admin" element={<AdminPanel />} />
  </Routes>
</BrowserRouter>

๐Ÿ—„๏ธ Database Schema

Jobs Table

CREATE TABLE jobs (
    id SERIAL PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    company VARCHAR(500) NOT NULL,
    location VARCHAR(500),
    description TEXT,
    url VARCHAR(1000) NOT NULL,
    source VARCHAR(100) DEFAULT 'LinkedIn',
    posted_date TIMESTAMP WITH TIME ZONE,
    scraped_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    
    -- Unique constraint to prevent duplicates
    UNIQUE(title, company, url)
);

Field Descriptions

Field Type Description Constraints
id SERIAL Auto-incrementing primary key PRIMARY KEY
title VARCHAR(500) Job title NOT NULL
company VARCHAR(500) Company name NOT NULL
location VARCHAR(500) Job location NULLABLE
description TEXT Full job description NULLABLE
url VARCHAR(1000) LinkedIn job URL NOT NULL
source VARCHAR(100) Source platform DEFAULT 'LinkedIn'
posted_date TIMESTAMPTZ Original posting date NULLABLE
scraped_at TIMESTAMPTZ When job was scraped DEFAULT NOW()

Indexes for Performance

-- Create indexes for faster queries
CREATE INDEX idx_jobs_title ON jobs(title);
CREATE INDEX idx_jobs_company ON jobs(company);
CREATE INDEX idx_jobs_location ON jobs(location);
CREATE INDEX idx_jobs_scraped_at ON jobs(scraped_at DESC);

-- Full-text search index (optional)
CREATE INDEX idx_jobs_search ON jobs 
USING gin(to_tsvector('english', title || ' ' || company || ' ' || COALESCE(location, '')));

Sample Data

INSERT INTO jobs (title, company, location, url, description, source)
VALUES (
    'Senior Software Engineer',
    'Google',
    'Mountain View, CA',
    'https://linkedin.com/jobs/view/123456',
    'Join our team to build scalable systems...',
    'LinkedIn'
);

๐Ÿค– Scraping Details

LinkedInScraper Class

Location: backend/scraper.py

Key Methods:

  1. __init__(self)

    • Sets up Chrome browser with headless options
    • Configures WebDriver with anti-detection settings
    • Initializes WebDriverWait for element waiting
  2. close_popups(self)

    • Closes LinkedIn popups and modals
    • Uses multiple CSS selectors for robustness
    • Handles various popup types
  3. scrape_jobs(self, keyword, location, max_pages=20)

    • Main scraping method
    • Handles pagination and job extraction
    • Returns list of job dictionaries
  4. get_job_description(self, job_url)

    • Visits individual job pages
    • Extracts full job descriptions
    • Handles "Show more" buttons

Scraping Flow

1. Initialize Browser
   โ†“
2. Navigate to LinkedIn Jobs Search
   โ†“
3. Close Popups
   โ†“
4. For each page:
   โ”‚   4.1. Scroll page
   โ”‚   4.2. Parse HTML with BeautifulSoup
   โ”‚   4.3. Extract job cards
   โ”‚   4.4. For each job card:
   โ”‚       โ”‚   4.4.1. Extract basic info
   โ”‚       โ”‚   4.4.2. Check for duplicates
   โ”‚       โ”‚   4.4.3. Visit job page for description
   โ”‚       โ”‚   4.4.4. Save to database
   โ”‚       โ†“
   โ†“
5. Cleanup and Close Browser

Selectors Used

# Job card selectors
JOB_CARD_SELECTORS = [
    'div.job-search-card',
    'div.base-card',
    'li.jobs-search-results__list-item'
]

# Title selectors
TITLE_SELECTORS = [
    'h3.base-search-card__title',
    'a.base-card__full-link'
]

# Company selectors
COMPANY_SELECTORS = [
    'h4.base-search-card__subtitle',
    'a.hidden-nested-link'
]

# Description selectors
DESCRIPTION_SELECTORS = [
    'div.show-more-less-html__markup',
    'div.jobs-description__content',
    'div.description__text',
    'section.description'
]

Anti-Detection Measures

  1. User-Agent Rotation: Uses realistic user agent string
  2. Headless Mode: Runs browser in background
  3. Random Delays: Varies timing between requests
  4. Scroll Simulation: Mimics human scrolling behavior
  5. Popup Handling: Closes all interfering popups

Rate Limiting

To avoid LinkedIn blocking:

  • Default delay: 1-3 seconds between requests
  • Max pages per scrape: 2 (configurable)
  • Random delays to mimic human behavior
  • Consider using proxies for production

๐Ÿ”ง Troubleshooting

Common Issues & Solutions

Issue 1: Database Connection Failed

Symptoms:

  • "Could not connect to database" error
  • Jobs not saving to database
  • API returning 500 errors

Solutions:

  1. Check if PostgreSQL is running:

    # Linux
    sudo systemctl status postgresql
    
    # Windows
    # Check Services for PostgreSQL
  2. Verify connection string in .env:

    DATABASE_URL=postgresql://username:password@localhost:5432/database
  3. Test connection manually:

    psql -U username -d database -h localhost

Issue 2: ChromeDriver Not Found

Symptoms:

  • "ChromeDriver executable needs to be in PATH" error
  • Selenium fails to start
  • Browser not opening

Solutions:

  1. Update Chrome and ChromeDriver:

    pip install --upgrade webdriver-manager
  2. Check Chrome installation:

    google-chrome --version
    # or
    chromium --version
  3. Run in non-headless mode for debugging:

    # In scraper.py, comment out:
    # options.add_argument('--headless')

Issue 3: Memory Issues

Symptoms:

  • Application slowing down over time
  • High memory usage in task manager
  • Browser crashes during scraping

Solutions:

  1. Limit scraping pages:

    max_pages=2  # Reduce from higher values
  2. Increase delay between requests:

    time.sleep(3)  # Increase from 1 second
  3. Implement periodic browser restart

Issue 4: LinkedIn Blocking/Throttling

Symptoms:

  • CAPTCHA appears
  • "Access denied" errors
  • No job cards found
  • IP address temporarily blocked

Solutions:

  1. Add longer, random delays:

    import random
    time.sleep(random.randint(3, 7))
  2. Reduce scraping frequency:

    max_pages=1  # Scrape fewer pages at once
  3. Use proxy rotation (advanced)

Issue 5: Frontend Not Connecting to Backend

Symptoms:

  • "Failed to fetch" errors in console
  • API calls timing out
  • Blank dashboard
  • CORS errors

Solutions:

  1. Check if backend is running:

    curl http://localhost:8000
  2. Update API URL in frontend:

    // In frontend/src/services/api.js
    const API_URL = 'http://localhost:8000';
  3. Check CORS configuration:

    # In main.py
    app.add_middleware(
        CORSMiddleware,
        allow_origins=["http://localhost:5173"],
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
    )

Debugging Tips

Enable Debug Logging

import logging

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('debug.log'),
        logging.StreamHandler()
    ]
)

Check Application Logs

# Backend logs
tail -f uvicorn.log

# Database logs (Linux)
tail -f /var/log/postgresql/postgresql-14-main.log

# Application logs
tail -f debug.log

Test API Endpoints

curl http://localhost:8000/                        # Health check
curl http://localhost:8000/jobs                    # Get jobs
curl http://localhost:8000/stats                   # Get stats

Database Diagnostics

-- Check job count
SELECT COUNT(*) FROM jobs;

-- Check recent jobs
SELECT * FROM jobs ORDER BY scraped_at DESC LIMIT 5;

-- Check for duplicates
SELECT title, company, COUNT(*)
FROM jobs
GROUP BY title, company
HAVING COUNT(*) > 1;

Performance Optimization Tips

  1. Database Indexing:

    CREATE INDEX idx_jobs_combined ON jobs(title, company, location);
    CREATE INDEX idx_jobs_posted_date ON jobs(posted_date DESC);
  2. Connection Pooling:

    engine = create_engine(
        DATABASE_URL,
        pool_size=10,
        max_overflow=20,
        pool_recycle=3600,
        pool_pre_ping=True
    )
  3. Query Optimization:

    from sqlalchemy.orm import selectinload
    jobs = db.query(Job).options(selectinload(Job.tags)).all()

๐Ÿ› ๏ธ Development

Setting Up Development Environment

1. Clone and Setup

# Clone repository
git clone https://github.com/your-username/job-scraper-dashboard.git
cd job-scraper-dashboard

# Setup backend
cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Setup frontend
cd ../frontend
npm install

2. Development Dependencies

Backend (requirements-dev.txt):

pytest>=7.0.0
pytest-asyncio>=0.20.0
black>=23.0.0
flake8>=6.0.0
mypy>=1.0.0
pre-commit>=3.0.0

Frontend Development:

npm install -D eslint prettier @types/react @types/react-dom

Code Standards

Python Backend

  • Follow PEP 8 guidelines
  • Use type hints for all functions
  • Maximum line length: 100 characters
  • Use docstrings for all public methods

Example:

def get_jobs(
    skip: int = 0,
    limit: int = 50,
    search: Optional[str] = None,
    db: Session = Depends(get_db),
) -> List[JobResponse]:
    """
    Retrieve jobs with optional filtering and pagination.
    
    Args:
        skip: Number of records to skip
        limit: Maximum records to return
        search: Search term for filtering
        db: Database session
    
    Returns:
        List of job objects
    """
    query = db.query(models.Job)
    # ... implementation

React Frontend

  • Use functional components with hooks
  • Follow React naming conventions
  • Use Tailwind CSS for styling
  • Implement prop types or TypeScript

Example:

const JobCard = ({ job, onView, onDelete }) => {
  return (
    <div className="job-card">
      {/* JSX content */}
    </div>
  );
};

JobCard.propTypes = {
  job: PropTypes.object.isRequired,
  onView: PropTypes.func.isRequired,
  onDelete: PropTypes.func.isRequired,
};

Testing

Backend Tests

# tests/test_main.py
def test_get_jobs():
    response = client.get("/jobs")
    assert response.status_code == 200
    assert isinstance(response.json(), list)

Frontend Tests

// JobCard.test.jsx
test('renders job cards', () => {
  render(<JobCard job={mockJob} />);
  expect(screen.getByText(mockJob.title)).toBeInTheDocument();
});

๐Ÿค Contributing

Development Workflow

  1. Fork the Repository
  2. Create Feature Branch
    git checkout -b feature/your-feature-name
  3. Make Changes
    • Follow coding standards
    • Add tests if applicable
    • Update documentation
  4. Commit Changes
    git add .
    git commit -m "Add: Description of changes"
  5. Push and Create PR
    git push origin feature/your-feature-name
    # Create Pull Request on GitHub

Code Standards

  • Use conventional commits: feat:, fix:, docs:, style:, refactor:, test:, chore:
  • Update README for new features
  • Add API documentation for new endpoints
  • Update component documentation

Review Process

  1. Code Review: All PRs require review
  2. Testing: Must pass existing tests
  3. Documentation: Must be updated
  4. CI/CD: Must pass pipeline

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • LinkedIn for providing job data
  • Open source community for libraries and tools
  • Contributors and testers
  • Mentors and advisors

โš ๏ธ Assumptions & Limitations

Assumptions

  1. LinkedIn Structure Stability: LinkedIn's HTML/CSS structure remains relatively unchanged
  2. Public Access: Jobs are accessible without LinkedIn login
  3. English Content: Primary language for job descriptions is English
  4. Geographic Availability: Jobs are available in specified locations
  5. Browser Compatibility: Chrome/Chromium is available on the system
  6. Network Stability: Stable internet connection for scraping

Technical Limitations

  1. Rate Limiting: LinkedIn may block excessive requests
  2. CAPTCHA Challenges: May encounter CAPTCHA during scraping
  3. JavaScript Rendering: Requires Selenium for dynamic content
  4. Memory Usage: Long scraping sessions may use significant memory
  5. Network Dependence: Requires stable internet connection
  6. Browser Updates: ChromeDriver compatibility issues with Chrome updates

Functional Limitations

  1. Single Source: Currently only supports LinkedIn
  2. No Scheduling: Manual scraping only, no automated schedules
  3. Limited Filters: Basic keyword/location filtering only
  4. No User Accounts: Single-user system
  5. No Export: Cannot export data to external formats
  6. No Notifications: No alert system for new jobs

๐Ÿ”ฎ Future Improvements

Phase 1: Immediate (1-2 Months)

  1. Indeed Integration

    • Add support for Indeed.com scraping
    • Unified job storage
    • Source-specific parsing
  2. Advanced Filters

    • Salary range filtering
    • Job type (full-time, contract, etc.)
    • Experience level filtering
    • Remote/hybrid/onsite options
  3. Export Functionality

    • CSV export
    • Excel export
    • PDF reports
    • JSON API for integration

Phase 2: Short-term (3-6 Months)

  1. User Authentication

    • Multi-user support
    • Role-based access (admin/user)
    • User preferences
    • Saved searches
  2. Email Notifications

    • New job alerts
    • Daily/weekly digests
    • Custom notification rules
    • Unsubscribe options
  3. Scheduling System

    • Automated daily scraping
    • Custom schedule configuration
    • Result notifications
    • Performance monitoring

Phase 3: Medium-term (6-12 Months)

  1. Multiple Job Sources

    • Glassdoor integration
    • Monster integration
    • CareerBuilder support
    • Company career pages
  2. Advanced Analytics

    • Job market trends
    • Salary analysis
    • Company insights
    • Location heatmaps
  3. Resume Matching

    • Resume upload
    • Skills matching
    • Job recommendations
    • Application tracking

Phase 4: Long-term (12+ Months)

  1. AI Features

    • Smart job recommendations
    • Resume optimization
    • Interview preparation
    • Salary negotiation tips
  2. Mobile Application

    • iOS app
    • Android app
    • Push notifications
    • Offline access
  3. Enterprise Features

    • Team collaboration
    • Applicant tracking
    • Reporting dashboard
    • API access for businesses

Technical Improvements

  1. Performance Optimization

    • Database indexing optimization
    • Caching implementation
    • Async processing improvements
    • Load balancing
  2. Security Enhancements

    • JWT authentication
    • Rate limiting
    • Input validation
    • Security headers
  3. Monitoring & Logging

    • Application performance monitoring
    • Error tracking
    • Usage analytics
    • Audit logging

Documentation last updated: December 2025
Project Version: 1.0.0
Maintainer: Uzair Javed
Contact: [email protected]
GitHub: uzair-javed-1
LinkedIn: LinkedIn