Skip to content

isshiki-dev/web-content-extractor

Repository files navigation

🌐 Web Content Extractor

TypeScript React Vite Tailwind CSS License

Extract clean, readable content from any webpage in seconds.

Demo Β· Documentation Β· API Reference Β· MCP Integration


Web Content Extractor Logo

✨ Features

Feature Description
πŸš€ One-Click Extraction Paste URL, click extract, get clean markdown
πŸ“ Multiple Formats Export as Markdown, HTML, or plain text
🎨 Beautiful UI Modern interface with dark/light theme support
πŸ’Ύ Local History Auto-saves extraction history in browser
πŸ€– MCP Integration Works with Claude, ChatGPT, and AI agents
πŸ”Œ REST API Programmatic access for automation
⚑ Blazing Fast Uses Mozilla Readability for instant parsing

πŸš€ Quick Start

Prerequisites

  • Node.js 18+
  • pnpm (recommended) or npm

Installation

# Clone the repository
git clone https://github.com/isshiki-dev/web-content-extractor.git
cd web-content-extractor

# Install dependencies
pnpm install

# Start development servers
pnpm dev:full

πŸ’‘ Tip: pnpm dev:full starts both the frontend (port 5173) and API server (port 3001) concurrently.

Manual Start

# Terminal 1: Frontend
pnpm dev

# Terminal 2: API Server
pnpm server

πŸ“– Usage

Web Interface

  1. Open http://localhost:5173
  2. Paste any URL in the input field
  3. Click Extract
  4. View, copy, or save the extracted content

Keyboard Shortcuts

Shortcut Action
Ctrl/Cmd + Enter Extract URL
Ctrl/Cmd + C Copy content
Ctrl/Cmd + S Save as file

πŸ”Œ API Reference

Extract Content

POST /api/extract
Content-Type: application/json

{
  "url": "https://example.com/article"
}

Response:

{
  "success": true,
  "data": {
    "title": "Article Title",
    "content": "# Article Title\n\nExtracted content...",
    "textContent": "Plain text version...",
    "excerpt": "Brief summary...",
    "byline": "Author Name",
    "siteName": "Example Site",
    "length": 1234,
    "url": "https://example.com/article"
  }
}

Save Content

POST /api/save
Content-Type: application/json

{
  "content": "# Title\n\nContent...",
  "filename": "article.md"
}

List Saved Files

GET /api/files

Get Sitemap

GET /sitemap.xml

πŸ€– MCP Server

The MCP (Model Context Protocol) server allows AI agents like Claude to extract web content directly.

Starting the MCP Server

pnpm mcp

Available Tools

extract_content

Extract content from a single URL.

{
  "name": "extract_content",
  "arguments": {
    "url": "https://example.com/article",
    "format": "markdown"
  }
}

extract_multiple

Extract content from multiple URLs in parallel.

{
  "name": "extract_multiple",
  "arguments": {
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2"
    ],
    "format": "markdown"
  }
}

Claude Desktop Integration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "web-extractor": {
      "command": "npx",
      "args": ["tsx", "/path/to/web-content-extractor/server/mcp.ts"]
    }
  }
}

πŸ—οΈ Project Structure

web-content-extractor/
β”œβ”€β”€ πŸ“ public/              # Static assets
β”œβ”€β”€ πŸ“ server/
β”‚   β”œβ”€β”€ api.ts              # Express REST API
β”‚   └── mcp.ts              # MCP server for AI agents
β”œβ”€β”€ πŸ“ src/
β”‚   β”œβ”€β”€ πŸ“ components/
β”‚   β”‚   β”œβ”€β”€ ExtractorForm.tsx
β”‚   β”‚   β”œβ”€β”€ ExtractedContent.tsx
β”‚   β”‚   β”œβ”€β”€ History.tsx
β”‚   β”‚   β”œβ”€β”€ MarkdownDisplay.tsx
β”‚   β”‚   β”œβ”€β”€ SaveDialog.tsx
β”‚   β”‚   └── ui/             # shadcn/ui components
β”‚   β”œβ”€β”€ πŸ“ hooks/
β”‚   β”‚   └── useExtractor.ts
β”‚   β”œβ”€β”€ πŸ“ lib/
β”‚   β”‚   β”œβ”€β”€ extractor.ts    # Server extraction logic
β”‚   β”‚   β”œβ”€β”€ client-extractor.ts
β”‚   β”‚   └── api-handler.ts
β”‚   β”œβ”€β”€ App.tsx
β”‚   └── main.tsx
β”œβ”€β”€ package.json
β”œβ”€β”€ vite.config.ts
└── tailwind.config.js

πŸ› οΈ Tech Stack

React
React 18
TypeScript
TypeScript
Vite
Vite
Tailwind
Tailwind
Node.js
Node.js
Express
Express

Key Dependencies


πŸ“Š Performance

Metric Value
Average extraction time < 500ms
Bundle size (gzipped) ~85kb
Lighthouse score 95+

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❀️ by isshiki-dev

⭐ Star this repo if you find it useful!

About

Extract clean, readable content from any webpage. Convert to Markdown format. Built with React, Vite, and Mozilla Readability.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •