Skip to content

Conversation

@matej
Copy link
Member

@matej matej commented Feb 9, 2026

Summary

  • replace the single-pass review path with a multi-phase orchestrator (triage, context discovery, compliance, quality, security, validation)
  • add configurable per-phase model routing in the GitHub Action (model-triage, model-compliance, model-quality, model-security, model-validation)
  • switch to hybrid diff behavior: inline diff is an anchor while repository tool-based context reads are required for all PR sizes
  • add finding deduplication and merge utilities and wire orchestrator output to the existing findings and comment pipeline
  • add Claude subagent definitions under .claude/agents and strengthen security evidence requirements in /review
  • update README and action docs for new inputs and workflow semantics

Validation

  • pytest claudecode -q (186 passed)

github-actions[bot]
github-actions bot previously approved these changes Feb 9, 2026
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 correctness issues. Consider addressing the suggestions in the comments.

raw_result = None
if hasattr(self.claude_runner, "run_prompt"):
raw_result = self.claude_runner.run_prompt(repo_dir, prompt, model=model)
if not (isinstance(raw_result, tuple) and len(raw_result) == 3) and hasattr(self.claude_runner, "run_code_review"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Model parameter ignored when falling back to run_code_review

Severity: MEDIUM
Category: correctness

Impact: Per-phase model routing will silently fail when the runner lacks run_prompt method, causing all phases to use the default model instead of configured specialized models. This could lead to increased costs (if default is more expensive) or reduced quality (if default is less capable for certain phases).

Recommendation: Either pass the model parameter to run_code_review and modify it to accept a model argument, or raise an explicit error when run_prompt is unavailable but model differs from default.

finding["confidence"] = float(confidence)
validated.append(finding)

# Fallback: keep merged findings if validation returned no decisions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Validation fallback logic discards all findings when validation returns decisions but none pass threshold

Severity: MEDIUM
Category: correctness

Impact: Valid findings may be silently dropped if the validation phase returns decisions with confidence scores below 0.8, resulting in missed issues that the specialist phases identified.

Recommendation: Consider keeping findings that were explicitly reviewed but didn't meet threshold with a flag, or log a warning when all validated findings are dropped. The current logic at line 239-240 sets validated=[] when decisions exist but none pass, which may be too aggressive.

@github-actions github-actions bot dismissed their stale review February 9, 2026 14:09

Dismissed: New review posted for updated changes.

github-actions[bot]
github-actions bot previously approved these changes Feb 9, 2026
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 3 correctness and reliability issues. Consider addressing the suggestions in the comments.

finding["confidence"] = float(confidence)
validated.append(finding)

# Fallback: keep merged findings if validation returned no decisions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Contradictory fallback logic may silently drop all findings

Severity: MEDIUM
Category: correctness

Impact: If validation phase returns decisions like [{"finding_index": 0, "keep": false, "confidence": 0.5, "reason": "..."}, ...] for every finding, all findings are dropped. But if validation returns {"validated_findings": []} (empty list), then fallback to all_candidates kicks in and all findings are kept. This inconsistent behavior could either drop legitimate findings or keep invalid ones depending on validation output format.

Recommendation: Clarify the intent: if validation explicitly returns an empty validated_findings array (decisions is non-empty list), that should probably drop all findings. Consider adding a comment documenting the fallback behavior, or refactor to: validated = all_candidates if not decisions else validated

raw_result = None
if hasattr(self.claude_runner, "run_prompt"):
raw_result = self.claude_runner.run_prompt(repo_dir, prompt, model=model)
if not (isinstance(raw_result, tuple) and len(raw_result) == 3) and hasattr(self.claude_runner, "run_code_review"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Model parameter ignored when falling back to run_code_review

Severity: MEDIUM
Category: correctness

Impact: Per-phase model routing (the main feature of this PR) will not work when run_code_review fallback is triggered. All phases would use the default model instead of the configured per-phase model, defeating the purpose of model routing configuration.

Recommendation: Either ensure run_code_review also accepts and uses the model parameter, or log a warning when falling back to indicate model routing is being bypassed. The fallback path seems to exist for backward compatibility, but the silent model parameter drop could cause confusion.

continue # Retry

# Parse JSON output
time.sleep(5 * attempt)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: First retry attempt has zero sleep time

Severity: LOW
Category: reliability

Impact: When Claude API returns an error, the first retry happens immediately without any delay, which may not give transient issues time to resolve and could hit rate limits in rapid succession.

Recommendation: Use time.sleep(5 * (attempt + 1)) or start the backoff at a non-zero value to ensure meaningful delays between retries.

@github-actions github-actions bot dismissed their stale review February 10, 2026 11:35

Dismissed: New review posted for updated changes.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 compliance and correctness issues. Please address the high-severity issues before merging.

@@ -0,0 +1,61 @@
"""Utilities for merging and deduplicating findings from multiple phases."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Missing tests for new functionality

Severity: HIGH
Category: compliance

Impact: Untested code reduces reliability and makes future changes risky, potentially introducing bugs in the findings deduplication logic.

Recommendation: Create test_findings_merge.py with comprehensive tests for the merge_findings function, including edge cases for duplicate detection and priority ranking.


# If validator did not return decisions at all, preserve candidates.
# If it explicitly returned validated_findings (including empty list), trust validator output.
if not has_validation_output:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Inconsistent validation handling between empty and missing validated_findings

Severity: MEDIUM
Category: correctness

Impact: If the validator returns {"validated_findings": []} (explicitly empty), all findings are dropped. If it returns {} (missing key) or fails to parse correctly, all findings are kept. This behavior may lead to unexpected outcomes depending on validator implementation details.

Recommendation: Document this contract clearly in build_validation_prompt() and consider whether empty array should also preserve candidates (fallback to safe behavior) rather than treating it as 'drop all'. At minimum, add logging to make the decision transparent.

@github-actions github-actions bot dismissed their stale review February 10, 2026 11:40

Dismissed: New review posted for updated changes.

github-actions[bot]
github-actions bot previously approved these changes Feb 10, 2026
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 5 compliance, correctness, and reliability issues. Consider addressing the suggestions in the comments.

@@ -0,0 +1,61 @@
"""Utilities for merging and deduplicating findings from multiple phases."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Missing tests for new functionality

Severity: MEDIUM
Category: compliance

Impact: Violates CLAUDE.md requirement that 'Tests required for new functionality'. The merge_findings function is critical for deduplication and could fail silently if edge cases aren't tested (e.g., handling None values, empty lists, duplicate keys with same severity/confidence).

Recommendation: Create claudecode/test_findings_merge.py with comprehensive tests covering: empty input, single finding, duplicate findings with different severities/confidences, edge cases for line number parsing, and invalid input handling.


return """
lines = pr_diff.splitlines()
if max_lines > 0 and len(lines) > max_lines:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: max-diff-lines=0 no longer suppresses inline diff as documented

Severity: MEDIUM
Category: correctness

Impact: Users who set max-diff-lines=0 expecting no inline diff (pure agentic mode) will instead get the entire diff embedded in every phase prompt, increasing token usage and potentially hitting prompt length limits on large PRs.

Recommendation: Change the truncation condition to also handle the max_lines == 0 case, e.g.: if max_lines == 0: return no-diff message at the top of _build_hybrid_diff_section, or adjust to if max_lines >= 0 and len(lines) > max_lines.

original_count = len([f for f in triage_result.get("findings", []) if isinstance(f, dict)])
legacy_findings = []
for finding in triage_result.get("findings", []):
if isinstance(finding, dict) and not self.github_client._is_excluded(finding.get("file", "")):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Legacy fast-path calls private _is_excluded on github_client without interface guarantee

Severity: MEDIUM
Category: reliability

Impact: If _is_excluded is renamed or its signature changes, the orchestrator will raise AttributeError at runtime during PR reviews, causing the entire review to fail.

Recommendation: Either make _is_excluded a public method (rename to is_excluded) or extract the exclusion check into a shared utility that both modules import.


# Check for error_during_execution that should trigger retry

if (isinstance(parsed_result, dict) and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: error_during_execution only retried on first attempt, silently accepted on later attempts

Severity: MEDIUM
Category: correctness

Impact: If Claude returns error_during_execution on attempt 1+, the orchestrator may receive garbage data (e.g. an error message string parsed as findings), leading to nonsensical review comments posted on the PR.

Recommendation: Retry error_during_execution on all attempts (up to NUM_RETRIES), or explicitly handle it as an error on non-zero attempts by returning failure instead of falling through to the success path.

"description": pr_data.get("body", ""),
}
kept_findings = validated
original_count = len(validated)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: original_count is set after validation, not before filtering

Severity: LOW
Category: correctness

Impact: The filtering_summary in the output will undercount excluded_findings and misrepresent the total_original_findings, making it harder to audit how many findings were dropped by the filter vs. the validator.

Recommendation: Move original_count assignment to before validation (e.g. after merge_findings on line 215) to capture the true pre-filtering total.

@github-actions github-actions bot dismissed their stale review February 10, 2026 11:45

Dismissed: New review posted for updated changes.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 7 compliance, correctness, reliability, and security issues. Please address the high-severity issues before merging.

@@ -0,0 +1,299 @@
"""Multi-phase review orchestration for GitHub Action execution."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Missing tests for new functionality violates CLAUDE.md requirement

Severity: HIGH
Category: compliance

Impact: Untested code increases risk of regressions and bugs. The orchestrator is a critical component coordinating multi-phase review with complex control flow including legacy compatibility paths, error handling, and finding deduplication logic.

Recommendation: Add test_review_orchestrator.py with comprehensive tests covering: ReviewModelConfig.from_env(), ReviewOrchestrator.run() phases (triage, context discovery, compliance, quality, security, validation), legacy single-pass compatibility path, error handling, and finding merge/deduplication logic.

@@ -0,0 +1,61 @@
"""Utilities for merging and deduplicating findings from multiple phases."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Missing tests for new functionality violates CLAUDE.md requirement

Severity: HIGH
Category: compliance

Impact: Untested deduplication logic can lead to findings being incorrectly merged or dropped, causing false negatives (missed issues) or false positives (duplicate issues) in code reviews.

Recommendation: Add test_findings_merge.py with comprehensive tests covering: _finding_key() normalization and key generation, _severity_rank() and _confidence_value() parsing, merge_findings() deduplication logic with various combinations of duplicate and unique findings, edge cases for malformed input.

if (isinstance(parsed_result, dict) and
parsed_result.get('type') == 'result' and
parsed_result.get('subtype') == 'error_during_execution' and
attempt == 0):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: error_during_execution retry limited to attempt==0 but NUM_RETRIES is 3

Severity: MEDIUM
Category: correctness

Impact: A transient Claude error on the second attempt will be silently treated as a successful response. The orchestrator will try to interpret the error payload as triage/findings JSON, leading to incorrect phase results (e.g., triage may fail to parse skip_review, causing unexpected flow through the orchestrator).

Recommendation: Extend the retry to cover all attempts before the last one (e.g., attempt < NUM_RETRIES - 1) rather than only attempt == 0, or explicitly handle the error_during_execution subtype as a failure on the final attempt.

original_count = len([f for f in triage_result.get("findings", []) if isinstance(f, dict)])
legacy_findings = []
for finding in triage_result.get("findings", []):
if isinstance(finding, dict) and not self.github_client._is_excluded(finding.get("file", "")):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Legacy fast-path calls _is_excluded on github_client without type guarantee

Severity: MEDIUM
Category: reliability

Impact: If the triage model returns findings (which the prompt doesn't explicitly forbid), the entire multi-phase review pipeline is bypassed. The legacy path was designed for backward compatibility but can be triggered unintentionally, reducing review quality.

Recommendation: Add explicit validation that triage output matches the expected schema (skip_review, reason, risk_level) before falling into the legacy path. Consider logging a warning when the legacy fast-path is triggered so operators can detect prompt drift.

self.model_config = model_config
self.max_diff_lines = max(0, max_diff_lines)

def _run_phase(self, repo_dir: Path, prompt: str, model: str, phase_name: str) -> Tuple[bool, Dict[str, Any], str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: _run_phase fallback logic may call run_code_review even after successful run_prompt

Severity: MEDIUM
Category: correctness

Impact: If the runner's run_prompt returns an unexpected format, the same prompt runs twice through run_code_reviewrun_prompt, doubling API cost. With run_code_review also wrapping results in _extract_review_findings, the phase output format may differ from what the orchestrator expects.

Recommendation: After calling run_prompt, explicitly return the failure rather than falling through to run_code_review. The fallback to run_code_review should only apply when run_prompt is not available (i.e., use elif instead of a second if).

const hasMarker = Boolean(review.body && review.body.includes(REVIEW_MARKER));
const isBot = review.user && review.user.type === 'Bot';
const canDismissLegacyReview = isBot && isOwn;
const canDismissMarkedReview = hasMarker;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: canDismissMarkedReview allows dismissing reviews from any user

Severity: MEDIUM
Category: correctness

Impact: A human reviewer's APPROVED or CHANGES_REQUESTED review could be unintentionally dismissed if their review body happens to contain the marker string, either through quoting a previous bot review or other means.

Recommendation: Add an author check to canDismissMarkedReview, e.g., const canDismissMarkedReview = hasMarker && (isBot || review.user.login === expectedBotLogin) to ensure only bot-authored reviews are dismissed.

const hasMarker = Boolean(review.body && review.body.includes(REVIEW_MARKER));
const isBot = review.user && review.user.type === 'Bot';
const canDismissLegacyReview = isBot && isOwn;
const canDismissMarkedReview = hasMarker;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review Finding: Review dismissal expanded beyond bot-only scope

Severity: MEDIUM
Category: security

Impact: An attacker who can post a PR review containing the marker string can have their review auto-dismissed by the action on the next commit push, removing a legitimate CHANGES_REQUESTED block. This weakens the review-gate: a malicious insider or compromised account could post a blocking review with the marker, then push a commit to trigger dismissal, bypassing the review requirement.

Recommendation: Add isBot check to canDismissMarkedReview so that only Bot-authored reviews with the marker are dismissible: const canDismissMarkedReview = isBot && hasMarker;

matej added a commit that referenced this pull request Feb 13, 2026
@matej matej force-pushed the matej/refactor-review-orchestrator branch from f83d53f to df71d9f Compare February 13, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant