-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor review action to multi-phase orchestrator with model routing #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 2 correctness issues. Consider addressing the suggestions in the comments.
claudecode/review_orchestrator.py
Outdated
| raw_result = None | ||
| if hasattr(self.claude_runner, "run_prompt"): | ||
| raw_result = self.claude_runner.run_prompt(repo_dir, prompt, model=model) | ||
| if not (isinstance(raw_result, tuple) and len(raw_result) == 3) and hasattr(self.claude_runner, "run_code_review"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Model parameter ignored when falling back to run_code_review
Severity: MEDIUM
Category: correctness
Impact: Per-phase model routing will silently fail when the runner lacks run_prompt method, causing all phases to use the default model instead of configured specialized models. This could lead to increased costs (if default is more expensive) or reduced quality (if default is less capable for certain phases).
Recommendation: Either pass the model parameter to run_code_review and modify it to accept a model argument, or raise an explicit error when run_prompt is unavailable but model differs from default.
claudecode/review_orchestrator.py
Outdated
| finding["confidence"] = float(confidence) | ||
| validated.append(finding) | ||
|
|
||
| # Fallback: keep merged findings if validation returned no decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Validation fallback logic discards all findings when validation returns decisions but none pass threshold
Severity: MEDIUM
Category: correctness
Impact: Valid findings may be silently dropped if the validation phase returns decisions with confidence scores below 0.8, resulting in missed issues that the specialist phases identified.
Recommendation: Consider keeping findings that were explicitly reviewed but didn't meet threshold with a flag, or log a warning when all validated findings are dropped. The current logic at line 239-240 sets validated=[] when decisions exist but none pass, which may be too aggressive.
Dismissed: New review posted for updated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 3 correctness and reliability issues. Consider addressing the suggestions in the comments.
claudecode/review_orchestrator.py
Outdated
| finding["confidence"] = float(confidence) | ||
| validated.append(finding) | ||
|
|
||
| # Fallback: keep merged findings if validation returned no decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Contradictory fallback logic may silently drop all findings
Severity: MEDIUM
Category: correctness
Impact: If validation phase returns decisions like [{"finding_index": 0, "keep": false, "confidence": 0.5, "reason": "..."}, ...] for every finding, all findings are dropped. But if validation returns {"validated_findings": []} (empty list), then fallback to all_candidates kicks in and all findings are kept. This inconsistent behavior could either drop legitimate findings or keep invalid ones depending on validation output format.
Recommendation: Clarify the intent: if validation explicitly returns an empty validated_findings array (decisions is non-empty list), that should probably drop all findings. Consider adding a comment documenting the fallback behavior, or refactor to: validated = all_candidates if not decisions else validated
claudecode/review_orchestrator.py
Outdated
| raw_result = None | ||
| if hasattr(self.claude_runner, "run_prompt"): | ||
| raw_result = self.claude_runner.run_prompt(repo_dir, prompt, model=model) | ||
| if not (isinstance(raw_result, tuple) and len(raw_result) == 3) and hasattr(self.claude_runner, "run_code_review"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Model parameter ignored when falling back to run_code_review
Severity: MEDIUM
Category: correctness
Impact: Per-phase model routing (the main feature of this PR) will not work when run_code_review fallback is triggered. All phases would use the default model instead of the configured per-phase model, defeating the purpose of model routing configuration.
Recommendation: Either ensure run_code_review also accepts and uses the model parameter, or log a warning when falling back to indicate model routing is being bypassed. The fallback path seems to exist for backward compatibility, but the silent model parameter drop could cause confusion.
claudecode/github_action_audit.py
Outdated
| continue # Retry | ||
|
|
||
| # Parse JSON output | ||
| time.sleep(5 * attempt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: First retry attempt has zero sleep time
Severity: LOW
Category: reliability
Impact: When Claude API returns an error, the first retry happens immediately without any delay, which may not give transient issues time to resolve and could hit rate limits in rapid succession.
Recommendation: Use time.sleep(5 * (attempt + 1)) or start the backoff at a non-zero value to ensure meaningful delays between retries.
Dismissed: New review posted for updated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 2 compliance and correctness issues. Please address the high-severity issues before merging.
| @@ -0,0 +1,61 @@ | |||
| """Utilities for merging and deduplicating findings from multiple phases.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Missing tests for new functionality
Severity: HIGH
Category: compliance
Impact: Untested code reduces reliability and makes future changes risky, potentially introducing bugs in the findings deduplication logic.
Recommendation: Create test_findings_merge.py with comprehensive tests for the merge_findings function, including edge cases for duplicate detection and priority ranking.
|
|
||
| # If validator did not return decisions at all, preserve candidates. | ||
| # If it explicitly returned validated_findings (including empty list), trust validator output. | ||
| if not has_validation_output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Inconsistent validation handling between empty and missing validated_findings
Severity: MEDIUM
Category: correctness
Impact: If the validator returns {"validated_findings": []} (explicitly empty), all findings are dropped. If it returns {} (missing key) or fails to parse correctly, all findings are kept. This behavior may lead to unexpected outcomes depending on validator implementation details.
Recommendation: Document this contract clearly in build_validation_prompt() and consider whether empty array should also preserve candidates (fallback to safe behavior) rather than treating it as 'drop all'. At minimum, add logging to make the decision transparent.
Dismissed: New review posted for updated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 5 compliance, correctness, and reliability issues. Consider addressing the suggestions in the comments.
| @@ -0,0 +1,61 @@ | |||
| """Utilities for merging and deduplicating findings from multiple phases.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Missing tests for new functionality
Severity: MEDIUM
Category: compliance
Impact: Violates CLAUDE.md requirement that 'Tests required for new functionality'. The merge_findings function is critical for deduplication and could fail silently if edge cases aren't tested (e.g., handling None values, empty lists, duplicate keys with same severity/confidence).
Recommendation: Create claudecode/test_findings_merge.py with comprehensive tests covering: empty input, single finding, duplicate findings with different severities/confidences, edge cases for line number parsing, and invalid input handling.
|
|
||
| return """ | ||
| lines = pr_diff.splitlines() | ||
| if max_lines > 0 and len(lines) > max_lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: max-diff-lines=0 no longer suppresses inline diff as documented
Severity: MEDIUM
Category: correctness
Impact: Users who set max-diff-lines=0 expecting no inline diff (pure agentic mode) will instead get the entire diff embedded in every phase prompt, increasing token usage and potentially hitting prompt length limits on large PRs.
Recommendation: Change the truncation condition to also handle the max_lines == 0 case, e.g.: if max_lines == 0: return no-diff message at the top of _build_hybrid_diff_section, or adjust to if max_lines >= 0 and len(lines) > max_lines.
claudecode/review_orchestrator.py
Outdated
| original_count = len([f for f in triage_result.get("findings", []) if isinstance(f, dict)]) | ||
| legacy_findings = [] | ||
| for finding in triage_result.get("findings", []): | ||
| if isinstance(finding, dict) and not self.github_client._is_excluded(finding.get("file", "")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Legacy fast-path calls private _is_excluded on github_client without interface guarantee
Severity: MEDIUM
Category: reliability
Impact: If _is_excluded is renamed or its signature changes, the orchestrator will raise AttributeError at runtime during PR reviews, causing the entire review to fail.
Recommendation: Either make _is_excluded a public method (rename to is_excluded) or extract the exclusion check into a shared utility that both modules import.
|
|
||
| # Check for error_during_execution that should trigger retry | ||
|
|
||
| if (isinstance(parsed_result, dict) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: error_during_execution only retried on first attempt, silently accepted on later attempts
Severity: MEDIUM
Category: correctness
Impact: If Claude returns error_during_execution on attempt 1+, the orchestrator may receive garbage data (e.g. an error message string parsed as findings), leading to nonsensical review comments posted on the PR.
Recommendation: Retry error_during_execution on all attempts (up to NUM_RETRIES), or explicitly handle it as an error on non-zero attempts by returning failure instead of falling through to the success path.
claudecode/review_orchestrator.py
Outdated
| "description": pr_data.get("body", ""), | ||
| } | ||
| kept_findings = validated | ||
| original_count = len(validated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: original_count is set after validation, not before filtering
Severity: LOW
Category: correctness
Impact: The filtering_summary in the output will undercount excluded_findings and misrepresent the total_original_findings, making it harder to audit how many findings were dropped by the filter vs. the validator.
Recommendation: Move original_count assignment to before validation (e.g. after merge_findings on line 215) to capture the true pre-filtering total.
Dismissed: New review posted for updated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 7 compliance, correctness, reliability, and security issues. Please address the high-severity issues before merging.
| @@ -0,0 +1,299 @@ | |||
| """Multi-phase review orchestration for GitHub Action execution.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Missing tests for new functionality violates CLAUDE.md requirement
Severity: HIGH
Category: compliance
Impact: Untested code increases risk of regressions and bugs. The orchestrator is a critical component coordinating multi-phase review with complex control flow including legacy compatibility paths, error handling, and finding deduplication logic.
Recommendation: Add test_review_orchestrator.py with comprehensive tests covering: ReviewModelConfig.from_env(), ReviewOrchestrator.run() phases (triage, context discovery, compliance, quality, security, validation), legacy single-pass compatibility path, error handling, and finding merge/deduplication logic.
| @@ -0,0 +1,61 @@ | |||
| """Utilities for merging and deduplicating findings from multiple phases.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Missing tests for new functionality violates CLAUDE.md requirement
Severity: HIGH
Category: compliance
Impact: Untested deduplication logic can lead to findings being incorrectly merged or dropped, causing false negatives (missed issues) or false positives (duplicate issues) in code reviews.
Recommendation: Add test_findings_merge.py with comprehensive tests covering: _finding_key() normalization and key generation, _severity_rank() and _confidence_value() parsing, merge_findings() deduplication logic with various combinations of duplicate and unique findings, edge cases for malformed input.
claudecode/github_action_audit.py
Outdated
| if (isinstance(parsed_result, dict) and | ||
| parsed_result.get('type') == 'result' and | ||
| parsed_result.get('subtype') == 'error_during_execution' and | ||
| attempt == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: error_during_execution retry limited to attempt==0 but NUM_RETRIES is 3
Severity: MEDIUM
Category: correctness
Impact: A transient Claude error on the second attempt will be silently treated as a successful response. The orchestrator will try to interpret the error payload as triage/findings JSON, leading to incorrect phase results (e.g., triage may fail to parse skip_review, causing unexpected flow through the orchestrator).
Recommendation: Extend the retry to cover all attempts before the last one (e.g., attempt < NUM_RETRIES - 1) rather than only attempt == 0, or explicitly handle the error_during_execution subtype as a failure on the final attempt.
claudecode/review_orchestrator.py
Outdated
| original_count = len([f for f in triage_result.get("findings", []) if isinstance(f, dict)]) | ||
| legacy_findings = [] | ||
| for finding in triage_result.get("findings", []): | ||
| if isinstance(finding, dict) and not self.github_client._is_excluded(finding.get("file", "")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Legacy fast-path calls _is_excluded on github_client without type guarantee
Severity: MEDIUM
Category: reliability
Impact: If the triage model returns findings (which the prompt doesn't explicitly forbid), the entire multi-phase review pipeline is bypassed. The legacy path was designed for backward compatibility but can be triggered unintentionally, reducing review quality.
Recommendation: Add explicit validation that triage output matches the expected schema (skip_review, reason, risk_level) before falling into the legacy path. Consider logging a warning when the legacy fast-path is triggered so operators can detect prompt drift.
| self.model_config = model_config | ||
| self.max_diff_lines = max(0, max_diff_lines) | ||
|
|
||
| def _run_phase(self, repo_dir: Path, prompt: str, model: str, phase_name: str) -> Tuple[bool, Dict[str, Any], str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: _run_phase fallback logic may call run_code_review even after successful run_prompt
Severity: MEDIUM
Category: correctness
Impact: If the runner's run_prompt returns an unexpected format, the same prompt runs twice through run_code_review → run_prompt, doubling API cost. With run_code_review also wrapping results in _extract_review_findings, the phase output format may differ from what the orchestrator expects.
Recommendation: After calling run_prompt, explicitly return the failure rather than falling through to run_code_review. The fallback to run_code_review should only apply when run_prompt is not available (i.e., use elif instead of a second if).
scripts/comment-pr-findings.js
Outdated
| const hasMarker = Boolean(review.body && review.body.includes(REVIEW_MARKER)); | ||
| const isBot = review.user && review.user.type === 'Bot'; | ||
| const canDismissLegacyReview = isBot && isOwn; | ||
| const canDismissMarkedReview = hasMarker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: canDismissMarkedReview allows dismissing reviews from any user
Severity: MEDIUM
Category: correctness
Impact: A human reviewer's APPROVED or CHANGES_REQUESTED review could be unintentionally dismissed if their review body happens to contain the marker string, either through quoting a previous bot review or other means.
Recommendation: Add an author check to canDismissMarkedReview, e.g., const canDismissMarkedReview = hasMarker && (isBot || review.user.login === expectedBotLogin) to ensure only bot-authored reviews are dismissed.
scripts/comment-pr-findings.js
Outdated
| const hasMarker = Boolean(review.body && review.body.includes(REVIEW_MARKER)); | ||
| const isBot = review.user && review.user.type === 'Bot'; | ||
| const canDismissLegacyReview = isBot && isOwn; | ||
| const canDismissMarkedReview = hasMarker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Code Review Finding: Review dismissal expanded beyond bot-only scope
Severity: MEDIUM
Category: security
Impact: An attacker who can post a PR review containing the marker string can have their review auto-dismissed by the action on the next commit push, removing a legitimate CHANGES_REQUESTED block. This weakens the review-gate: a malicious insider or compromised account could post a blocking review with the marker, then push a commit to trigger dismissal, bypassing the review requirement.
Recommendation: Add isBot check to canDismissMarkedReview so that only Bot-authored reviews with the marker are dismissible: const canDismissMarkedReview = isBot && hasMarker;
f83d53f to
df71d9f
Compare
Summary
Validation