Refactor review action to multi-phase orchestrator with model routing #21

matej · 2026-02-09T12:24:57Z

Summary

replace the single-pass review path with a multi-phase orchestrator (triage, context discovery, compliance, quality, security, validation)
add configurable per-phase model routing in the GitHub Action (model-triage, model-compliance, model-quality, model-security, model-validation)
switch to hybrid diff behavior: inline diff is an anchor while repository tool-based context reads are required for all PR sizes
add finding deduplication and merge utilities and wire orchestrator output to the existing findings and comment pipeline
add Claude subagent definitions under .claude/agents and strengthen security evidence requirements in /review
update README and action docs for new inputs and workflow semantics

Validation

pytest claudecode -q (186 passed)

github-actions

Found 2 correctness issues. Consider addressing the suggestions in the comments.

github-actions · 2026-02-09T12:30:05Z

claudecode/review_orchestrator.py

+        raw_result = None
+        if hasattr(self.claude_runner, "run_prompt"):
+            raw_result = self.claude_runner.run_prompt(repo_dir, prompt, model=model)
+        if not (isinstance(raw_result, tuple) and len(raw_result) == 3) and hasattr(self.claude_runner, "run_code_review"):


🤖 Code Review Finding: Model parameter ignored when falling back to run_code_review

Severity: MEDIUM
Category: correctness

Impact: Per-phase model routing will silently fail when the runner lacks run_prompt method, causing all phases to use the default model instead of configured specialized models. This could lead to increased costs (if default is more expensive) or reduced quality (if default is less capable for certain phases).

Recommendation: Either pass the model parameter to run_code_review and modify it to accept a model argument, or raise an explicit error when run_prompt is unavailable but model differs from default.

github-actions · 2026-02-09T12:30:05Z

claudecode/review_orchestrator.py

+                finding["confidence"] = float(confidence)
+                validated.append(finding)
+
+        # Fallback: keep merged findings if validation returned no decisions


🤖 Code Review Finding: Validation fallback logic discards all findings when validation returns decisions but none pass threshold

Severity: MEDIUM
Category: correctness

Impact: Valid findings may be silently dropped if the validation phase returns decisions with confidence scores below 0.8, resulting in missed issues that the specialist phases identified.

Recommendation: Consider keeping findings that were explicitly reviewed but didn't meet threshold with a flag, or log a warning when all validated findings are dropped. The current logic at line 239-240 sets validated=[] when decisions exist but none pass, which may be too aggressive.

Dismissed: New review posted for updated changes.

github-actions

Found 3 correctness and reliability issues. Consider addressing the suggestions in the comments.

github-actions · 2026-02-09T14:09:24Z

claudecode/review_orchestrator.py

+                finding["confidence"] = float(confidence)
+                validated.append(finding)
+
+        # Fallback: keep merged findings if validation returned no decisions


🤖 Code Review Finding: Contradictory fallback logic may silently drop all findings

Severity: MEDIUM
Category: correctness

Impact: If validation phase returns decisions like [{"finding_index": 0, "keep": false, "confidence": 0.5, "reason": "..."}, ...] for every finding, all findings are dropped. But if validation returns {"validated_findings": []} (empty list), then fallback to all_candidates kicks in and all findings are kept. This inconsistent behavior could either drop legitimate findings or keep invalid ones depending on validation output format.

Recommendation: Clarify the intent: if validation explicitly returns an empty validated_findings array (decisions is non-empty list), that should probably drop all findings. Consider adding a comment documenting the fallback behavior, or refactor to: validated = all_candidates if not decisions else validated

github-actions · 2026-02-09T14:09:24Z

claudecode/review_orchestrator.py

+        raw_result = None
+        if hasattr(self.claude_runner, "run_prompt"):
+            raw_result = self.claude_runner.run_prompt(repo_dir, prompt, model=model)
+        if not (isinstance(raw_result, tuple) and len(raw_result) == 3) and hasattr(self.claude_runner, "run_code_review"):


🤖 Code Review Finding: Model parameter ignored when falling back to run_code_review

Severity: MEDIUM
Category: correctness

Impact: Per-phase model routing (the main feature of this PR) will not work when run_code_review fallback is triggered. All phases would use the default model instead of the configured per-phase model, defeating the purpose of model routing configuration.

Recommendation: Either ensure run_code_review also accepts and uses the model parameter, or log a warning when falling back to indicate model routing is being bypassed. The fallback path seems to exist for backward compatibility, but the silent model parameter drop could cause confusion.

github-actions · 2026-02-09T14:09:24Z

claudecode/github_action_audit.py

-                        continue  # Retry
-
-                # Parse JSON output
+                    time.sleep(5 * attempt)


🤖 Code Review Finding: First retry attempt has zero sleep time

Severity: LOW
Category: reliability

Impact: When Claude API returns an error, the first retry happens immediately without any delay, which may not give transient issues time to resolve and could hit rate limits in rapid succession.

Recommendation: Use time.sleep(5 * (attempt + 1)) or start the backoff at a non-zero value to ensure meaningful delays between retries.

Dismissed: New review posted for updated changes.

github-actions

Found 2 compliance and correctness issues. Please address the high-severity issues before merging.

github-actions · 2026-02-10T11:35:41Z

claudecode/findings_merge.py

@@ -0,0 +1,61 @@
+"""Utilities for merging and deduplicating findings from multiple phases."""


🤖 Code Review Finding: Missing tests for new functionality

Severity: HIGH
Category: compliance

Impact: Untested code reduces reliability and makes future changes risky, potentially introducing bugs in the findings deduplication logic.

Recommendation: Create test_findings_merge.py with comprehensive tests for the merge_findings function, including edge cases for duplicate detection and priority ranking.

github-actions · 2026-02-10T11:35:41Z

claudecode/review_orchestrator.py

+
+        # If validator did not return decisions at all, preserve candidates.
+        # If it explicitly returned validated_findings (including empty list), trust validator output.
+        if not has_validation_output:


🤖 Code Review Finding: Inconsistent validation handling between empty and missing validated_findings

Severity: MEDIUM
Category: correctness

Impact: If the validator returns {"validated_findings": []} (explicitly empty), all findings are dropped. If it returns {} (missing key) or fails to parse correctly, all findings are kept. This behavior may lead to unexpected outcomes depending on validator implementation details.

Recommendation: Document this contract clearly in build_validation_prompt() and consider whether empty array should also preserve candidates (fallback to safe behavior) rather than treating it as 'drop all'. At minimum, add logging to make the decision transparent.

Dismissed: New review posted for updated changes.

github-actions

Found 5 compliance, correctness, and reliability issues. Consider addressing the suggestions in the comments.

github-actions · 2026-02-10T11:40:58Z

claudecode/findings_merge.py

@@ -0,0 +1,61 @@
+"""Utilities for merging and deduplicating findings from multiple phases."""


🤖 Code Review Finding: Missing tests for new functionality

Severity: MEDIUM
Category: compliance

Impact: Violates CLAUDE.md requirement that 'Tests required for new functionality'. The merge_findings function is critical for deduplication and could fail silently if edge cases aren't tested (e.g., handling None values, empty lists, duplicate keys with same severity/confidence).

Recommendation: Create claudecode/test_findings_merge.py with comprehensive tests covering: empty input, single finding, duplicate findings with different severities/confidences, edge cases for line number parsing, and invalid input handling.

github-actions · 2026-02-10T11:40:58Z

claudecode/prompts.py


-    return """
+    lines = pr_diff.splitlines()
+    if max_lines > 0 and len(lines) > max_lines:


🤖 Code Review Finding: max-diff-lines=0 no longer suppresses inline diff as documented

Severity: MEDIUM
Category: correctness

Impact: Users who set max-diff-lines=0 expecting no inline diff (pure agentic mode) will instead get the entire diff embedded in every phase prompt, increasing token usage and potentially hitting prompt length limits on large PRs.

Recommendation: Change the truncation condition to also handle the max_lines == 0 case, e.g.: if max_lines == 0: return no-diff message at the top of _build_hybrid_diff_section, or adjust to if max_lines >= 0 and len(lines) > max_lines.

github-actions · 2026-02-10T11:40:58Z

claudecode/review_orchestrator.py

+            original_count = len([f for f in triage_result.get("findings", []) if isinstance(f, dict)])
+            legacy_findings = []
+            for finding in triage_result.get("findings", []):
+                if isinstance(finding, dict) and not self.github_client._is_excluded(finding.get("file", "")):


🤖 Code Review Finding: Legacy fast-path calls private _is_excluded on github_client without interface guarantee

Severity: MEDIUM
Category: reliability

Impact: If _is_excluded is renamed or its signature changes, the orchestrator will raise AttributeError at runtime during PR reviews, causing the entire review to fail.

Recommendation: Either make _is_excluded a public method (rename to is_excluded) or extract the exclusion check into a shared utility that both modules import.

github-actions · 2026-02-10T11:40:58Z

claudecode/github_action_audit.py

-
-                    # Check for error_during_execution that should trigger retry
+
                    if (isinstance(parsed_result, dict) and 


🤖 Code Review Finding: error_during_execution only retried on first attempt, silently accepted on later attempts

Severity: MEDIUM
Category: correctness

Impact: If Claude returns error_during_execution on attempt 1+, the orchestrator may receive garbage data (e.g. an error message string parsed as findings), leading to nonsensical review comments posted on the PR.

Recommendation: Retry error_during_execution on all attempts (up to NUM_RETRIES), or explicitly handle it as an error on non-zero attempts by returning failure instead of falling through to the success path.

github-actions · 2026-02-10T11:40:58Z

claudecode/review_orchestrator.py

+            "description": pr_data.get("body", ""),
+        }
+        kept_findings = validated
+        original_count = len(validated)


🤖 Code Review Finding: original_count is set after validation, not before filtering

Severity: LOW
Category: correctness

Impact: The filtering_summary in the output will undercount excluded_findings and misrepresent the total_original_findings, making it harder to audit how many findings were dropped by the filter vs. the validator.

Recommendation: Move original_count assignment to before validation (e.g. after merge_findings on line 215) to capture the true pre-filtering total.

Dismissed: New review posted for updated changes.

github-actions

Found 7 compliance, correctness, reliability, and security issues. Please address the high-severity issues before merging.

github-actions · 2026-02-10T11:45:43Z

claudecode/review_orchestrator.py

@@ -0,0 +1,299 @@
+"""Multi-phase review orchestration for GitHub Action execution."""


🤖 Code Review Finding: Missing tests for new functionality violates CLAUDE.md requirement

Severity: HIGH
Category: compliance

Impact: Untested code increases risk of regressions and bugs. The orchestrator is a critical component coordinating multi-phase review with complex control flow including legacy compatibility paths, error handling, and finding deduplication logic.

Recommendation: Add test_review_orchestrator.py with comprehensive tests covering: ReviewModelConfig.from_env(), ReviewOrchestrator.run() phases (triage, context discovery, compliance, quality, security, validation), legacy single-pass compatibility path, error handling, and finding merge/deduplication logic.

github-actions · 2026-02-10T11:45:43Z

claudecode/findings_merge.py

@@ -0,0 +1,61 @@
+"""Utilities for merging and deduplicating findings from multiple phases."""


🤖 Code Review Finding: Missing tests for new functionality violates CLAUDE.md requirement

Severity: HIGH
Category: compliance

Impact: Untested deduplication logic can lead to findings being incorrectly merged or dropped, causing false negatives (missed issues) or false positives (duplicate issues) in code reviews.

Recommendation: Add test_findings_merge.py with comprehensive tests covering: _finding_key() normalization and key generation, _severity_rank() and _confidence_value() parsing, merge_findings() deduplication logic with various combinations of duplicate and unique findings, edge cases for malformed input.

github-actions · 2026-02-10T11:45:43Z

claudecode/github_action_audit.py

                    if (isinstance(parsed_result, dict) and 
                        parsed_result.get('type') == 'result' and 
                        parsed_result.get('subtype') == 'error_during_execution' and
                        attempt == 0):


🤖 Code Review Finding: error_during_execution retry limited to attempt==0 but NUM_RETRIES is 3

Severity: MEDIUM
Category: correctness

Impact: A transient Claude error on the second attempt will be silently treated as a successful response. The orchestrator will try to interpret the error payload as triage/findings JSON, leading to incorrect phase results (e.g., triage may fail to parse skip_review, causing unexpected flow through the orchestrator).

Recommendation: Extend the retry to cover all attempts before the last one (e.g., attempt < NUM_RETRIES - 1) rather than only attempt == 0, or explicitly handle the error_during_execution subtype as a failure on the final attempt.

github-actions · 2026-02-10T11:45:43Z

claudecode/review_orchestrator.py

+            original_count = len([f for f in triage_result.get("findings", []) if isinstance(f, dict)])
+            legacy_findings = []
+            for finding in triage_result.get("findings", []):
+                if isinstance(finding, dict) and not self.github_client._is_excluded(finding.get("file", "")):


🤖 Code Review Finding: Legacy fast-path calls _is_excluded on github_client without type guarantee

Severity: MEDIUM
Category: reliability

Impact: If the triage model returns findings (which the prompt doesn't explicitly forbid), the entire multi-phase review pipeline is bypassed. The legacy path was designed for backward compatibility but can be triggered unintentionally, reducing review quality.

Recommendation: Add explicit validation that triage output matches the expected schema (skip_review, reason, risk_level) before falling into the legacy path. Consider logging a warning when the legacy fast-path is triggered so operators can detect prompt drift.

github-actions · 2026-02-10T11:45:43Z

claudecode/review_orchestrator.py

+        self.model_config = model_config
+        self.max_diff_lines = max(0, max_diff_lines)
+
+    def _run_phase(self, repo_dir: Path, prompt: str, model: str, phase_name: str) -> Tuple[bool, Dict[str, Any], str]:


🤖 Code Review Finding: _run_phase fallback logic may call run_code_review even after successful run_prompt

Severity: MEDIUM
Category: correctness

Impact: If the runner's run_prompt returns an unexpected format, the same prompt runs twice through run_code_review → run_prompt, doubling API cost. With run_code_review also wrapping results in _extract_review_findings, the phase output format may differ from what the orchestrator expects.

Recommendation: After calling run_prompt, explicitly return the failure rather than falling through to run_code_review. The fallback to run_code_review should only apply when run_prompt is not available (i.e., use elif instead of a second if).

github-actions · 2026-02-10T11:45:43Z

scripts/comment-pr-findings.js

+      const hasMarker = Boolean(review.body && review.body.includes(REVIEW_MARKER));
+      const isBot = review.user && review.user.type === 'Bot';
+      const canDismissLegacyReview = isBot && isOwn;
+      const canDismissMarkedReview = hasMarker;


🤖 Code Review Finding: canDismissMarkedReview allows dismissing reviews from any user

Severity: MEDIUM
Category: correctness

Impact: A human reviewer's APPROVED or CHANGES_REQUESTED review could be unintentionally dismissed if their review body happens to contain the marker string, either through quoting a previous bot review or other means.

Recommendation: Add an author check to canDismissMarkedReview, e.g., const canDismissMarkedReview = hasMarker && (isBot || review.user.login === expectedBotLogin) to ensure only bot-authored reviews are dismissed.

github-actions · 2026-02-10T11:45:43Z

scripts/comment-pr-findings.js

+      const hasMarker = Boolean(review.body && review.body.includes(REVIEW_MARKER));
+      const isBot = review.user && review.user.type === 'Bot';
+      const canDismissLegacyReview = isBot && isOwn;
+      const canDismissMarkedReview = hasMarker;


🤖 Code Review Finding: Review dismissal expanded beyond bot-only scope

Severity: MEDIUM
Category: security

Impact: An attacker who can post a PR review containing the marker string can have their review auto-dismissed by the action on the next commit push, removing a legitimate CHANGES_REQUESTED block. This weakens the review-gate: a malicious insider or compromised account could post a blocking review with the marker, then push a commit to trigger dismissal, bypassing the review requirement.

Recommendation: Add isBot check to canDismissMarkedReview so that only Bot-authored reviews with the marker are dismissible: const canDismissMarkedReview = isBot && hasMarker;

github-actions bot previously approved these changes Feb 9, 2026

View reviewed changes

github-actions bot previously requested changes Feb 10, 2026

View reviewed changes

github-actions bot previously approved these changes Feb 10, 2026

View reviewed changes

github-actions bot requested changes Feb 10, 2026

View reviewed changes

matej added a commit that referenced this pull request Feb 13, 2026

Address remaining PR #21 review feedback

f83d53f

matej added 8 commits February 13, 2026 10:13

Refactor review action to multi-phase orchestrator with model routing

d3ae6d5

Fix prompt f-string compatibility for Python 3.10 CI

c4e9c30

Address PR review feedback on model routing and validation logic

c014c45

Use latest Claude model aliases for phase defaults

37a1142

Dismiss stale action reviews reliably with review marker

3737488

Use review marker only for stale review dismissal

1ac0c46

Address remaining PR #21 review feedback

80f1dcf

Restore mainline compatibility after rebase

df71d9f

matej force-pushed the matej/refactor-review-orchestrator branch from f83d53f to df71d9f Compare February 13, 2026 09:26

		@@ -0,0 +1,61 @@
		"""Utilities for merging and deduplicating findings from multiple phases."""


		# Check for error_during_execution that should trigger retry

		if (isinstance(parsed_result, dict) and

		@@ -0,0 +1,299 @@
		"""Multi-phase review orchestration for GitHub Action execution."""

Refactor review action to multi-phase orchestrator with model routing #21

Are you sure you want to change the base?

Refactor review action to multi-phase orchestrator with model routing #21

Uh oh!

Conversation

matej commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 10, 2026

matej commented Feb 9, 2026 •

edited

Loading