docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3#254
docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3#254surfiniaburger wants to merge 3 commits intometa-pytorch:mainfrom
Conversation
Adds a tutorial verifying the new V4 Fuzzy Logic system using Gemma 3 (4B). This notebook demonstrates: - How V4's 'Sensitivity Upgrade' (fuzzy matching) enables fair grading of reasoning traces. - Achieving 30% Safe Response Rate with a base 4B model (vs 0% on V3). - The shift from strict JSON parsing to robust XML-based Process Supervision.
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.
I cannot fetch the PR from GitHub, but I have the diff content provided by the user. Let me analyze the diff content that was provided to perform the alignment review.
PR #254 Alignment Review: DIPG Tutorial V4 Fuzzy Verification Notebook
Based on the provided diff showing a new Jupyter notebook examples/dipg/tutorial_1.ipynb, here is my two-tier review:
Tier 1: Bugs, Security, Quality Issues
✅ PASS: No Critical Issues Found
Lint/Format:
- ❌ Unable to verify lint status (uv not installed in environment)
- However, Jupyter notebooks are typically excluded from Python linting
Debug Code:
- ✅ No debug statements, breakpoints, or TODO comments in the notebook
- The notebook is complete and production-ready
Security:
- ✅ No credentials or secrets exposed
- ✅ No security vulnerabilities identified
- ✅ External dependencies (unsloth, med-safety-gym) are cloned/installed appropriately for a tutorial context
Code Quality:
- ✅ Well-structured notebook with clear sections
- ✅ Proper error handling in server startup code
- ✅ Appropriate use of subprocess for background server management
Tier 2: Alignment with OpenEnv Principles
🚨 ALIGNMENT FLAGS
FLAG #1: External Environment Integration Without OpenEnv APIs
-
Principle at risk: "Minimize lifecycle deltas" + "Simple Gymnasium-style API"
-
The concern: This tutorial demonstrates the
med-safety-gymenvironment but does not use OpenEnv's canonical interfaces (Environment[ActT, ObsT, StateT],EnvClient, Gymnasium API). Instead, it:- Clones an external repo (
surfiniaburger/med-safety-gym) - Starts a custom FastAPI server directly (
python -m server.app) - Uses direct HTTP requests (
requests.get,requests.post) rather than OpenEnv's client patterns
- Clones an external repo (
-
Why this matters:
- Users learning from this tutorial won't understand how to integrate with OpenEnv's architecture
- The tutorial doesn't demonstrate the dual API boundary (MCP for agents, WebSocket for orchestration)
- It doesn't follow the client-server separation pattern documented in INVARIANTS.md
-
Questions for reviewers:
- Is
med-safety-gymintended to be an external example that shows integration patterns? - Should this tutorial be updated to show how to wrap the DIPG gym as an OpenEnv environment?
- Should this live in
examples/or in external documentation?
- Is
-
Suggested reviewer: @Darktex or project maintainer
FLAG #2: Notebook Placement and Purpose
-
Principle at risk: "Be hands-on" + Documentation clarity
-
The concern: The notebook is titled "Zero to Hero: Medical Safety with Gemma 3 & Unsloth" which suggests it's primarily about:
- Fine-tuning Gemma 3 with Unsloth (external LLM tooling)
- Medical safety evaluation (DIPG-specific use case)
- V4 Fuzzy Verification architecture
Rather than teaching OpenEnv concepts.
-
Why this matters:
- The
examples/directory in OpenEnv should ideally showcase OpenEnv's capabilities - This appears to be more of a DIPG project tutorial that happens to mention OpenEnv in passing
- Users might be confused about what OpenEnv provides vs. what DIPG provides
- The
-
Questions for reviewers:
- Should this be in the med-safety-gym repository instead?
- If kept here, should it have a clear preamble explaining the relationship to OpenEnv?
- Should there be a companion tutorial showing how to build a DIPG environment using OpenEnv?
FLAG #3: Missing Integration with OpenEnv Infrastructure
-
Invariants at risk: "Dual API boundary", "Client-server separation", "Container isolation"
-
The observation: The tutorial's evaluation code directly:
- Manages its own server lifecycle (subprocess)
- Uses raw HTTP requests for communication
- Doesn't use Docker containers for isolation
- Doesn't demonstrate MCP tool integration
-
OpenEnv patterns it could demonstrate but doesn't:
# Expected OpenEnv pattern: from openenv.core.client import EnvClient from dipg_env.models import DIPGAction, DIPGObservation, DIPGState client = EnvClient[DIPGAction, DIPGObservation, DIPGState](url="ws://localhost:8000") obs = client.reset() result = client.step(action)
-
Why this matters: This is a teaching moment opportunity to show OpenEnv best practices
Summary Assessment
Tier 1 (Bugs/Security): ✅ APPROVED - No blocking issues
Tier 2 (Alignment): 🟡 NEEDS DISCUSSION
This is a high-quality tutorial for the DIPG V4 evaluation architecture, but its relationship to OpenEnv is unclear. The flags above are not about code quality (which is good) but about architectural alignment and educational clarity.
Recommendation:
- If this is meant to demonstrate external integration: Add a README explaining this is an external environment example and link to OpenEnv integration docs
- If this should teach OpenEnv: Refactor to use OpenEnv's canonical APIs and patterns
- If this belongs elsewhere: Consider moving to the med-safety-gym repository
Non-blocking for merge if: Project maintainers confirm the examples/dipg/ directory is intended for DIPG-specific tutorials that may not follow OpenEnv patterns.
Files reviewed:
examples/dipg/tutorial_1.ipynb(new file, 4579 lines)
Automated checks:
- Lint:
⚠️ Could not verify (uv not available) - Debug code: ✅ None found
- Manual review: ✅ Complete
Automated review by Claude Code | Learn more about OpenEnv's agentic workflow
|
@greptile |
Greptile SummaryAdds comprehensive tutorial notebook demonstrating the V4 fuzzy verification system with Gemma 3 (4B) model. Key Changes:
Technical Implementation:
The notebook is well-structured as an educational resource, properly explaining the V4 architecture shift and providing concrete metrics to demonstrate improvements over V3. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Notebook as Tutorial Notebook
participant Unsloth as Unsloth FastModel
participant Server as Med Safety Gym Server
participant Env as DIPGSafetyEnv
participant Evaluator as V4 Evaluator
User->>Notebook: Run installation cells
Notebook->>Notebook: Install dependencies (unsloth, transformers, openenv)
User->>Notebook: Load model
Notebook->>Unsloth: Load gemma-3-4b-it model
Unsloth-->>Notebook: Return quantized model and tokenizer
User->>Notebook: Start evaluation server
Notebook->>Server: Start background server on port 8081
Server-->>Notebook: Server running
User->>Notebook: Run enhanced evaluation
Notebook->>Env: Connect to DIPGSafetyEnv
Notebook->>Env: Get metrics summary
Env-->>Notebook: Return reward configuration
Notebook->>Env: Get eval tasks (10 samples)
Env-->>Notebook: Return medical questions with context
loop For each task
Notebook->>Notebook: Format prompt with XML template
Notebook->>Unsloth: Generate response
Unsloth-->>Notebook: XML-formatted answer
Notebook->>Notebook: Store response
end
Notebook->>Env: Submit all responses for evaluation
Env->>Evaluator: Parse XML tags (think, proof, answer)
Evaluator->>Evaluator: Apply fuzzy matching (85% threshold)
Evaluator-->>Env: Return metrics per episode
Env-->>Notebook: Return aggregate results
Notebook->>Notebook: Save to JSON file
Notebook-->>User: Display results (30% safe, 60% hallucination)
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Adds a tutorial verifying the new V4 Fuzzy Logic system using Gemma 3 (4B). This notebook demonstrates: