Skip to content

feat(amp-worker-gc): standalone GC job for control-plane scheduling#1989

Open
mitchhs12 wants to merge 5 commits intomainfrom
mitchhs12/gc-job-extraction
Open

feat(amp-worker-gc): standalone GC job for control-plane scheduling#1989
mitchhs12 wants to merge 5 commits intomainfrom
mitchhs12/gc-job-extraction

Conversation

@mitchhs12
Copy link
Contributor

@mitchhs12 mitchhs12 commented Mar 17, 2026

Summary

Extracts garbage collection from the worker's compaction task into a standalone job type managed by the controller via the job ledger. This decouples GC from compaction so it can run independently regardless of whether materialization jobs are active.

Changes

  • New crate amp-worker-gc: Job descriptor, idempotency key, error types, OTel metrics, and collection algorithm (stream expired files → delete metadata → delete physical files)
  • Worker dispatch: New Gc variant in JobDescriptor enum with execution wiring in job_impl.rs
  • Controller scheduling: Background task that periodically scans active physical table revisions and schedules GC jobs via the job ledger, respecting per-location last_success_at interval checks
  • Config toggle: GcSchedulingConfig with enabled (default false) and interval (default 60s) — GC scheduling is off by default for safe rollout
  • Integration tests: 3 tests verifying the full GC pipeline against real Postgres + filesystem

…eduling

Extract garbage collection from the worker's compaction task into a standalone
job type managed by the controller via the job ledger. This decouples GC from
compaction so it can run independently regardless of whether materialization
jobs are active.

New crate: amp-worker-gc with job descriptor, idempotency key, and collection
algorithm. Controller schedules GC jobs per active physical table revision on
a 60s interval. Workers execute them using the same stream-expired → delete-
metadata → delete-files algorithm as the existing Collector.
Add GcSchedulingConfig with `enabled` (default false) and `interval`
(default 60s) fields so GC scheduling can be toggled without code changes.
The controller only spawns the GC scheduling task when enabled, preventing
unintended GC job creation on deployment.
…egration tests

- Add GcMetrics (expired_files_found, metadata_entries_deleted, files_deleted,
  files_not_found) with OpenTelemetry counters keyed by location_id
- Add last_success_at check in schedule_gc_jobs() to respect the configured
  interval between GC runs per location (RFC compliance)
- Add 3 integration tests in tests/src/tests/it_gc.rs verifying the full
  collection algorithm against real Postgres + filesystem
- Update config.sample.toml with [gc_scheduling] section
The workspace_crates_match_amp_crates_list test validates that the
hardcoded AMP_CRATES list matches actual workspace members. Adding the
new amp-worker-gc crate to the workspace requires updating this list.
@mitchhs12 mitchhs12 force-pushed the mitchhs12/gc-job-extraction branch from c571586 to 74eb098 Compare March 18, 2026 15:13
…e_ref tracing

- Add GcSchedulerMetrics to controller scheduler with gc_jobs_dispatched_total,
  gc_jobs_skipped_in_flight_total, and gc_jobs_skipped_too_recent_total counters
- Add table_ref field to GC job execution tracing span, recorded from the
  revision path after lookup
- Pass Meter to Scheduler for metrics initialization
@mitchhs12 mitchhs12 marked this pull request as ready for review March 18, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant