Skip to content

Conversation

@akeaswaran
Copy link
Collaborator

@akeaswaran akeaswaran commented Sep 15, 2025

Test Game ID: 401754622

Examples of text that previously returned NAs for yds_receiving but now have valid values:

  • Aaron Philo pass to Dean Patterson for 6 yds to the GWEB 48
  • Cole Pennington sacked  by Christian Garrett for a loss of 2 yards to the GWEB 37
  • to Anthony Lowe for 1 yd to the GT 0, for a TD, Nate Hampton pass (Charlie Viorel PAT MISSED)
  • to Anthony Lowe for 6 yds to the GWEB 35, Nate Hampton pass
  • to Brett Seither for 21 yds to the GWEB 32 for a 1ST down

Summary by CodeRabbit

  • New Features

    • Expanded support for 2025 play-by-play text variants to derive receiving yards from more phrasings (e.g., “pass to … for X yards,” trailing “pass,” and cases without an explicit passer).
  • Bug Fixes

    • Improved extraction for receiving yardage including negative (“for a loss of”) and zero-yard (“for no gain”) cases.
    • Per-game validation in EPA/WPA processing: underpopulated games are skipped with a runtime alert.
    • Normalized API output column renamed to season_type.
  • Documentation

    • Package bumped to v2.1.0 and release notes updated.
  • Tests

    • Tests adjusted to use subset/membership checks for column validations.

akeaswaran and others added 10 commits September 3, 2025 14:01
Removed deprecated labels for elapsed time columns in cfbd_drives and added new columns to cfbd_live_plays documentation and tests. Updated test expectations to use expect_in instead of expect_setequal for column checks.
Added new columns to cfbd_play_stats_player output, improved sack player aggregation, handled NULL values, and updated documentation and examples to reflect changes. Also updated cfbd_live_plays documentation to include new columns for average start yard line and deserve to win metrics.
Changed cfbd_pbp_data to assign 3 timeouts per half for offense and defense when timeout data is missing from the API. Updated documentation and examples to reflect this behavior.
Added .groups = "drop" to the dplyr::summarise call in add_play_counts to control grouping behavior and prevent potential warnings in future dplyr versions.
Removed the specific count of variables from the return value description in both R and Rd files to improve maintainability and accuracy as the data frame structure may change.
… function usage in play data functions

Corrected spacing and replaced superseded `dplyr::distinct_all()` with `dplyr::distinct()`, and standardized assignment spacing for improved code readability and consistency.
@vercel
Copy link

vercel bot commented Sep 15, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
cfbfastr Ready Ready Preview Comment Oct 13, 2025 3:26pm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

Adds 2025-specific parsing branches to add_yardage yds_receiving logic; changes cfbd_pbp_data to subset and validate per-game plays (skip games with <20 plays and emit alert) for EPA/WPA computation; adjusts multiple tests to check expected columns are present. No public API signature changes.

Changes

Cohort / File(s) Change Summary
Yardage parsing enhancements
R/helper_pbp_add_yardage.R
Extended yds_receiving parsing with multiple 2025-specific text patterns: handle "pass to ... for a loss of ..." → negative yards, "for X yd(s)/yards" → parsed yards, end-of-text "pass"/"PASSER pass" variants, leading "to ..." forms including "for no gain" → 0 and "for a loss" → negative; retains existing patterns and default NA_real_.
Per-game EPA/WPA filtering
R/cfbd_pbp_data.R
Per-game processing now subsets game_plays for each game_id, skips games with fewer than 20 plays (emits cli::cli_alert_danger), and returns/uses game_plays within the map for EPA/WPA computation instead of the accumulating global play_df.
Column rename in stats output
R/cfbd_stats.R
Added renaming of output column seasonTypeseason_type in cfbd_stats_game_advanced post-processing.
Test updates (subset checks & expanded cols)
tests/testthat/test-cfbd_live_scoreboard.R, tests/testthat/test-cfbd_betting_lines.R, tests/testthat/test-cfbd_stats_game_advanced.R
Replaced strict set-equality assertions with containment checks (expect_in) asserting expected columns are present within returned data; expanded expected columns for betting lines to include home_team_id and away_team_id.
Metadata / release notes
DESCRIPTION, NEWS.md, cran-comments.md
Bumped package version to 2.1.0 and added release notes describing cfbd_pbp_data fix and add_yardage improvements; updated CRAN comments accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Row as play_text (row)
  participant Add as add_yardage()
  participant Parser as yds_receiving logic

  Row->>Add: call with play_text + metadata
  Add->>Parser: evaluate case_when branches

  rect rgb(240,248,255)
    note right of Parser: Existing parsing branches
    Parser-->Parser: classic forms → parse yards
  end

  rect rgb(237,250,240)
    note right of Parser: New 2025-specific branches
    Parser-->Parser: "pass to ... for a loss of N" → -N
    Parser-->Parser: "pass to ... for N yd(s)/yards" → N
    Parser-->Parser: trailing "pass"/"PASSER pass" variants → parse after "for"
    Parser-->Parser: leading "to ..." forms: "for no gain" → 0 / "for a loss" → -N / "for N" → N
  end

  alt matched
    Parser-->>Add: numeric yds_receiving
  else no match
    Parser-->>Add: NA_real_
  end

  Add-->>Row: updated row with yds_receiving
Loading
sequenceDiagram
  autonumber
  participant All as all_game_ids
  participant Proc as cfbd_pbp_data map loop
  participant Game as game_plays (per-game)

  All->>Proc: iterate game_id x
  Proc->>Game: subset plays for x -> game_plays
  alt nrow(game_plays) < 20
    Proc->>Proc: cli::cli_alert_danger("skipping game with <20 plays")
    Proc-->>All: skip / return NULL for x
  else
    Proc->>Proc: compute EPA/WPA using game_plays
    Proc-->>All: return game_plays (for aggregation)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

I hop through plays where phrases hide,
I sniff "for loss" and numbers wide.
New branches carved, each pattern chased,
Per-game checks keep short games paced.
Thump! The parser finds each yard retraced. 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly describes the primary change, namely adding more cases for parsing receiving yards, which aligns with the modifications in the pull request to extend yds_receiving functionality for additional play‐by‐play text patterns.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch AE/2025-receiving-yardage-parse-case

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
R/helper_pbp_add_yardage.R (2)

88-93: Good addition for negative-yard “pass to … for a loss of …”; add a zero-yard case too.

Handle the “for no gain” phrasing for plain “pass to …” to avoid leaving NA for these plays.

Apply this diff to add the zero-yard case (order before other “pass to …” yards rules is fine):

+        .data$pass == 1 &
+          stringr::str_detect(.data$play_text, regex("pass to", ignore_case = TRUE)) &
+          stringr::str_detect(.data$play_text, regex("for no gain", ignore_case = TRUE)) ~ 0,

100-105: Tighten the “for X yards” regex; current y\\w*ds? is overbroad.

Use explicit units and a word boundary to avoid accidental matches.

-          stringr::str_detect(.data$play_text, regex(" for \\d+ y\\w*ds?", ignore_case = TRUE)) ~
+          stringr::str_detect(.data$play_text, regex("\\bfor\\s+\\d+\\s*(?:yds?|yards?)\\b", ignore_case = TRUE)) ~
           as.numeric(stringr::str_extract(
-            stringi::stri_extract_first_regex(.data$play_text, "(?<= for)[^,]+"), "\\d+"
+            stringi::stri_extract_first_regex(.data$play_text, "(?<=\\bfor\\b)[^,]+"), "\\d+"
           )),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3312dc and 221e0f7.

📒 Files selected for processing (1)
  • R/helper_pbp_add_yardage.R (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (release)
🔇 Additional comments (1)
R/helper_pbp_add_yardage.R (1)

115-126: Fix contradictory regex: require trailing "pass" (optionally with parens) and allow "to" anywhere

File: R/helper_pbp_add_yardage.R (lines ~115-126). The current branches require '^to ' (start-of-string) and also check for 'pass' at the end — these predicates are effectively contradictory, so the branches will rarely fire. (github.com)

-        # 2025 has some plays list "PASSER pass" at the very end of the play_text
-        .data$pass == 1 &
-          stringr::str_detect(.data$play_text, regex("pass \\(\\w", ignore_case = TRUE)) &
-          stringr::str_detect(.data$play_text, regex("^to ", ignore_case = FALSE)) ~ as.numeric(stringr::str_extract(
-            stringi::stri_extract_first_regex(.data$play_text, "(?<= for)[^,]+"), "\\d+"
-          )),
-        .data$pass == 1 &
-          stringr::str_detect(.data$play_text, regex("pass$", ignore_case = TRUE)) &
-          stringr::str_detect(.data$play_text, regex("^to ", ignore_case = FALSE)) ~ as.numeric(stringr::str_extract(
-            stringi::stri_extract_first_regex(.data$play_text, "(?<= for)[^,]+"), "\\d+"
-          )),
+        # 2025: some plays end with "... pass" (optionally with parens); yards still appear after "for"
+        .data$pass == 1 &
+          stringr::str_detect(.data$play_text, regex("\\bpass(?:\\s*\\([^)]*\\))?\\s*$", ignore_case = TRUE)) &
+          stringr::str_detect(.data$play_text, regex("\\bto\\s", ignore_case = TRUE)) ~
+          as.numeric(stringr::str_extract(
+            stringi::stri_extract_first_regex(.data$play_text, "(?<=\\bfor\\b)[^,]+"), "\\d+"
+          )),

Re-run the pipeline against Game ID 401754622 to confirm these plays are now captured.

Added a check to filter out games with fewer than 20 plays in the play-by-play data processing. This helps avoid issues with EPA/WPA models and improves data validation.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
R/cfbd_pbp_data.R (1)

597-609: LGTM! Per-game processing correctly implemented.

The change from processing and returning the full play_df to per-game game_plays subsets is correct and aligns with the new validation logic. The map_dfr will properly combine the game-specific results.

Note: game_plays is defined on line 591 and reassigned here after piping through the processing functions. While this variable reuse pattern works in R, consider using a different variable name (e.g., processed_plays) for the pipeline result to improve clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69b981c and 142a7a0.

📒 Files selected for processing (2)
  • R/cfbd_pbp_data.R (2 hunks)
  • tests/testthat/test-cfbd_live_scoreboard.R (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
🔇 Additional comments (1)
tests/testthat/test-cfbd_live_scoreboard.R (1)

54-55: LGTM! Test logic corrected.

The inverted argument order now correctly verifies that all expected columns defined in cols are present in the returned data frames, rather than checking the reverse.

Comment on lines +589 to +596
# Note: this should be changed to a complete data validation test in the future
# filter out games with less than 10 plays to avoid issues with EPA/WPA models
game_plays <- play_df %>%
dplyr::filter(.data$game_id == x)
if (nrow(game_plays) < 20) {
cli::cli_alert_danger(glue::glue("Skipping game_id {x} with only {nrow(game_plays)} plays"))
return(NULL)
}
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix comment inconsistency with validation threshold.

Line 590 states "less than 10 plays" but line 593 checks for < 20. Update the comment to match the code.

Apply this diff:

-        # filter out games with less than 10 plays to avoid issues with EPA/WPA models
+        # filter out games with less than 20 plays to avoid issues with EPA/WPA models

Consider extracting the magic number.

The threshold of 20 plays is hardcoded. While the comment acknowledges this should be improved with a complete validation test, consider extracting it as a named constant for maintainability.

Example:

MIN_PLAYS_FOR_EPA_WPA <- 20

Then use:

if (nrow(game_plays) < MIN_PLAYS_FOR_EPA_WPA) {
  cli::cli_alert_danger(glue::glue("Skipping game_id {x} with only {nrow(game_plays)} plays"))
  return(NULL)
}
🤖 Prompt for AI Agents
In R/cfbd_pbp_data.R around lines 589 to 596, the inline comment incorrectly
says "less than 10 plays" while the code checks for fewer than 20 plays; update
the comment to reflect the actual threshold (20) and replace the hardcoded magic
number by introducing a named constant (e.g., MIN_PLAYS_FOR_EPA_WPA <- 20) at
top of the function or file, then use that constant in the if check and the
alert message to improve clarity and maintainability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this is fair -- @saiemgilani was this minimum supposed to be 20 or 10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Update package version to 2.1.0. Add release notes for bug fixes in `cfbd_pbp_data()` and improvements to `add_yardage()` handling missing yardage values. Update cran-comments to reflect minor release and changes.
Added normalization for 'seasonType' to 'season_type' in cfbd_stats_game_advanced. Updated tests to check for column inclusion with expect_in instead of expect_setequal, and added team ID columns to betting lines test.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
R/cfbd_stats.R (1)

269-270: Column renaming for consistency looks good; remove extra blank line.

The addition of the seasonTypeseason_type transformation maintains consistency with the snake_case convention used throughout the codebase. This aligns with the pattern of other column transformations in the function.

However, line 270 introduces an unnecessary blank line.

Apply this diff to remove the extra blank line:

 colnames(df) <- gsub("seasonType", "season_type", colnames(df))
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd8435a and fad6fde.

📒 Files selected for processing (3)
  • R/cfbd_stats.R (1 hunks)
  • tests/testthat/test-cfbd_betting_lines.R (1 hunks)
  • tests/testthat/test-cfbd_stats_game_advanced.R (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
PR: sportsdataverse/cfbfastR#113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.

Applied to files:

  • R/cfbd_stats.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: windows-latest (release)
🔇 Additional comments (3)
tests/testthat/test-cfbd_stats_game_advanced.R (1)

36-38: Test relaxation is reasonable for external API.

Changing from exact set equality (expect_setequal) to containment checks (expect_in) makes the tests more resilient to API changes. This is appropriate when testing external APIs that may add new fields without breaking existing functionality.

Trade-off: The tests will no longer detect if unexpected additional columns appear in the response, but this is acceptable given the benefits of reduced test brittleness.

tests/testthat/test-cfbd_betting_lines.R (2)

22-23: Addition of team ID columns improves data structure.

Including home_team_id and away_team_id in the expected columns enhances referential integrity and provides stable identifiers for teams, which is preferable to relying solely on team names.


30-31: Test relaxation is consistent with other test updates.

The change from expect_setequal to expect_in follows the same pattern applied in test-cfbd_stats_game_advanced.R, providing consistent test behavior across the test suite when validating external API responses.

@saiemgilani saiemgilani merged commit 42152cb into main Oct 13, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants