Skip to content

refactor(scraper): replace Crawlee with hand-rolled fetch+cheerio#67

Merged
ThatXliner merged 14 commits intomainfrom
feature/scraper-refactor
Apr 3, 2026
Merged

refactor(scraper): replace Crawlee with hand-rolled fetch+cheerio#67
ThatXliner merged 14 commits intomainfrom
feature/scraper-refactor

Conversation

@ThatXliner
Copy link
Copy Markdown
Collaborator

@ThatXliner ThatXliner commented Mar 30, 2026

Summary

  • Drop Crawlee + Playwright — replaced with fetch + cheerio directly, removing ~2100 lines of lockfile and heavy dependencies
  • Unified upsert pipeline — merged upsertBill, upsertGovernmentContent, upsertCourtCase into a single upsertContent(type, data) using a discriminated union
  • Shared utilities — added fetchWithRetry() (retry + backoff + Retry-After + timeout) and log() (timestamped, scraper-prefixed logging)
  • Simplified runnermain.ts is now a loop over Scraper[] objects with yargs for CLI parsing
  • Fixed pre-existing TS errors in google-images.ts

Net result: -2100 lines, fewer deps, unified patterns, same behavior.

Test plan

  • cd apps/scraper && npx tsc --noEmit — compiles with only pre-existing @acme/db errors
  • pnpm run start:dev govtrack — fetches bills, logs with [HH:MM:SS] [GovTrack] prefix
  • pnpm run start:dev whitehouse — fetches articles with pagination
  • pnpm run start:dev congress — fetches from Congress.gov API
  • pnpm run start:dev scotus — fetches from CourtListener API
  • pnpm run start:dev --help — shows yargs help output
  • grep -r "crawlee" apps/scraper/src/ — no matches
  • Docker build still works

🤖 Generated with Claude Code

ThatXliner and others added 11 commits March 30, 2026 14:18
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…upsertContent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…drop Crawlee

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t + log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…+ log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, yargs, TS errors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
billion-nextjs Ready Ready Preview, Comment Apr 3, 2026 5:19pm

Copy link
Copy Markdown
Collaborator

@lcai000 lcai000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing crawlee and playwright dependencies is good, more simple
reduced LOC is good

TODO: make more descriptive logging, the current logging prints out a bunch of random stuff and a lot is unnecesarry, change it to be better

@ThatXliner ThatXliner merged commit 6f0ac32 into main Apr 3, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants