[experimental] Add trace-aware stage summaries + plotting helper for profiling#1168
Open
jioffe502 wants to merge 11 commits intoNVIDIA:mainfrom
Open
[experimental] Add trace-aware stage summaries + plotting helper for profiling#1168jioffe502 wants to merge 11 commits intoNVIDIA:mainfrom
jioffe502 wants to merge 11 commits intoNVIDIA:mainfrom
Conversation
4 tasks
drobison00
approved these changes
Dec 17, 2025
- Track submission_ts_ns throughout V2 ingest pipeline - Extract ray_wait_s, in_ray_queue_s, ray_start_ts_s, ray_end_ts_s metrics - Enhance wall-time visualization with wait and queue time bars - Add wait/queue time summaries and percentile statistics - Update documentation with new profiling metrics
…ontainer cpu capture
aea29b2 to
5fe9d49
Compare
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
At the Ray stage we now log every major hop a document takes: queue time before the worker picks it up, the full pdf_extractor resident time, and downstream stages (YOLOX ensembles for tables/charts, OCR/text extraction, metadata construction, embedding, storage, etc.). Those show up in results.wall_time.png and the stage-time bar chart.
Inside the PDF extractor we added sub-spans for the previously opaque rasterization leg—rendering the page via PDFium, copying the bitmap into NumPy, scaling to YOLOX size, and padding. Those spans feed the new PDFium breakdown chart and CSV so we can compare per-document/per-page cost. You can now say “document 2062555.pdf spent ~0.65 s/page in scaling, which is 60% of its PDF extractor time” instead of just “this doc was slow.”
The combination of stage-level metrics (queue/wall/resident) plus the PDFium micro-spans gives a holistic view: you can see how much time each document spends waiting in Ray, how much is consumed by the PDF extractor as a whole, and exactly which sub-step dominates inside that extractor.
Task List
enable_traces/trace_output_dirthrough the test config and e2e case so trace payloads are captured automatically during scripted runs.trace_summarygeneration inscripts/tests/cases/e2e.py, writing per-stage aggregates plus per-document totals;run.pynow records trace flags inresults.json.scripts/tests/tools/plot_stage_totals.py, a helper that reads anyresults.jsonand emits a PNG + textual summary showing cumulative resident seconds per stage (with options to sort, collapse nested entries, filter network noise, etc.).Testing:
ENABLE_TRACES=true; verifiedresults.jsoncontains the newtrace_summary, trace files land underartifacts/.../traces/, and the plotting tool produces the expected charts (*.stage_time.png) using both collapsed and nested views.Checklist