docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3 by surfiniaburger · Pull Request #254 · meta-pytorch/OpenEnv

surfiniaburger · 2025-12-17T07:40:01Z

Adds a tutorial verifying the new V4 Fuzzy Logic system using Gemma 3 (4B). This notebook demonstrates:

How V4's 'Sensitivity Upgrade' (fuzzy matching) enables fair grading of reasoning traces.
Achieving 30% Safe Response Rate with a base 4B model (vs 0% on V3).
The shift from strict JSON parsing to robust XML-based Process Supervision.

Adds a tutorial verifying the new V4 Fuzzy Logic system using Gemma 3 (4B). This notebook demonstrates: - How V4's 'Sensitivity Upgrade' (fuzzy matching) enables fair grading of reasoning traces. - Achieving 30% Safe Response Rate with a base 4B model (vs 0% on V3). - The shift from strict JSON parsing to robust XML-based Process Supervision.

Darktex

Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.

I cannot fetch the PR from GitHub, but I have the diff content provided by the user. Let me analyze the diff content that was provided to perform the alignment review.

PR #254 Alignment Review: DIPG Tutorial V4 Fuzzy Verification Notebook

Based on the provided diff showing a new Jupyter notebook examples/dipg/tutorial_1.ipynb, here is my two-tier review:

Tier 1: Bugs, Security, Quality Issues

✅ PASS: No Critical Issues Found

Lint/Format:

❌ Unable to verify lint status (uv not installed in environment)
However, Jupyter notebooks are typically excluded from Python linting

Debug Code:

✅ No debug statements, breakpoints, or TODO comments in the notebook
The notebook is complete and production-ready

Security:

✅ No credentials or secrets exposed
✅ No security vulnerabilities identified
✅ External dependencies (unsloth, med-safety-gym) are cloned/installed appropriately for a tutorial context

Code Quality:

✅ Well-structured notebook with clear sections
✅ Proper error handling in server startup code
✅ Appropriate use of subprocess for background server management

Tier 2: Alignment with OpenEnv Principles

🚨 ALIGNMENT FLAGS

FLAG #1: External Environment Integration Without OpenEnv APIs

Principle at risk: "Minimize lifecycle deltas" + "Simple Gymnasium-style API"
The concern: This tutorial demonstrates the med-safety-gym environment but does not use OpenEnv's canonical interfaces (Environment[ActT, ObsT, StateT], EnvClient, Gymnasium API). Instead, it:
- Clones an external repo (surfiniaburger/med-safety-gym)
- Starts a custom FastAPI server directly (python -m server.app)
- Uses direct HTTP requests (requests.get, requests.post) rather than OpenEnv's client patterns
Why this matters:
- Users learning from this tutorial won't understand how to integrate with OpenEnv's architecture
- The tutorial doesn't demonstrate the dual API boundary (MCP for agents, WebSocket for orchestration)
- It doesn't follow the client-server separation pattern documented in INVARIANTS.md
Questions for reviewers:
1. Is med-safety-gym intended to be an external example that shows integration patterns?
2. Should this tutorial be updated to show how to wrap the DIPG gym as an OpenEnv environment?
3. Should this live in examples/ or in external documentation?
Suggested reviewer: @Darktex or project maintainer

FLAG #2: Notebook Placement and Purpose

Principle at risk: "Be hands-on" + Documentation clarity
The concern: The notebook is titled "Zero to Hero: Medical Safety with Gemma 3 & Unsloth" which suggests it's primarily about:
- Fine-tuning Gemma 3 with Unsloth (external LLM tooling)
- Medical safety evaluation (DIPG-specific use case)
- V4 Fuzzy Verification architecture
Rather than teaching OpenEnv concepts.
Why this matters:
- The examples/ directory in OpenEnv should ideally showcase OpenEnv's capabilities
- This appears to be more of a DIPG project tutorial that happens to mention OpenEnv in passing
- Users might be confused about what OpenEnv provides vs. what DIPG provides
Questions for reviewers:
1. Should this be in the med-safety-gym repository instead?
2. If kept here, should it have a clear preamble explaining the relationship to OpenEnv?
3. Should there be a companion tutorial showing how to build a DIPG environment using OpenEnv?

FLAG #3: Missing Integration with OpenEnv Infrastructure

Invariants at risk: "Dual API boundary", "Client-server separation", "Container isolation"
The observation: The tutorial's evaluation code directly:
- Manages its own server lifecycle (subprocess)
- Uses raw HTTP requests for communication
- Doesn't use Docker containers for isolation
- Doesn't demonstrate MCP tool integration

OpenEnv patterns it could demonstrate but doesn't:

# Expected OpenEnv pattern:
from openenv.core.client import EnvClient
from dipg_env.models import DIPGAction, DIPGObservation, DIPGState

client = EnvClient[DIPGAction, DIPGObservation, DIPGState](url="ws://localhost:8000")
obs = client.reset()
result = client.step(action)

Why this matters: This is a teaching moment opportunity to show OpenEnv best practices

Summary Assessment

Tier 1 (Bugs/Security): ✅ APPROVED - No blocking issues

Tier 2 (Alignment): 🟡 NEEDS DISCUSSION

This is a high-quality tutorial for the DIPG V4 evaluation architecture, but its relationship to OpenEnv is unclear. The flags above are not about code quality (which is good) but about architectural alignment and educational clarity.

Recommendation:

If this is meant to demonstrate external integration: Add a README explaining this is an external environment example and link to OpenEnv integration docs
If this should teach OpenEnv: Refactor to use OpenEnv's canonical APIs and patterns
If this belongs elsewhere: Consider moving to the med-safety-gym repository

Non-blocking for merge if: Project maintainers confirm the examples/dipg/ directory is intended for DIPG-specific tutorials that may not follow OpenEnv patterns.

Files reviewed:

examples/dipg/tutorial_1.ipynb (new file, 4579 lines)

Automated checks:

Lint: ⚠️ Could not verify (uv not available)
Debug code: ✅ None found
Manual review: ✅ Complete

Automated review by Claude Code | Learn more about OpenEnv's agentic workflow

zkwentz · 2026-01-21T14:45:17Z

@greptile

greptile-apps · 2026-01-21T14:49:58Z

Greptile Summary

Adds comprehensive tutorial notebook demonstrating the V4 fuzzy verification system with Gemma 3 (4B) model.

Key Changes:

Introduces complete end-to-end workflow from model loading through evaluation
Demonstrates V4's fuzzy matching logic (>85% similarity threshold via difflib)
Shows migration from V3's strict JSON parsing to XML-based process supervision
Achieves 30% safe response rate baseline (vs 0% on V3) with base 4B model
Includes proper error handling with try-except blocks in evaluation function
Provides clear documentation of V4 architecture improvements

Technical Implementation:

Uses Unsloth's FastModel for efficient 4-bit quantization
Implements 3-step evaluation workflow: get_metrics_summary(), get_eval_tasks(), evaluate_model()
Structured XML output format with <think>, <proof>, and <answer> tags
Background server setup with run_bg_server() to avoid blocking notebook execution

The notebook is well-structured as an educational resource, properly explaining the V4 architecture shift and providing concrete metrics to demonstrate improvements over V3.

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it adds a new educational notebook without modifying existing code
Score reflects that this is a new tutorial file with no changes to production code. The notebook demonstrates proper error handling, uses established libraries (Unsloth, transformers), and follows the existing DIPG evaluation patterns. Minor points deducted for lack of input validation in some code cells and the use of mock mode that could confuse users.
No files require special attention - this is an additive change introducing a single tutorial notebook

Important Files Changed

Filename	Overview
examples/dipg/tutorial_1.ipynb	New tutorial notebook demonstrating V4 fuzzy verification with Gemma 3 model. Implements complete workflow from model loading to evaluation with medical safety gym.

Sequence Diagram

sequenceDiagram
    participant User
    participant Notebook as Tutorial Notebook
    participant Unsloth as Unsloth FastModel
    participant Server as Med Safety Gym Server
    participant Env as DIPGSafetyEnv
    participant Evaluator as V4 Evaluator

    User->>Notebook: Run installation cells
    Notebook->>Notebook: Install dependencies (unsloth, transformers, openenv)
    
    User->>Notebook: Load model
    Notebook->>Unsloth: Load gemma-3-4b-it model
    Unsloth-->>Notebook: Return quantized model and tokenizer
    
    User->>Notebook: Start evaluation server
    Notebook->>Server: Start background server on port 8081
    Server-->>Notebook: Server running
    
    User->>Notebook: Run enhanced evaluation
    Notebook->>Env: Connect to DIPGSafetyEnv
    
    Notebook->>Env: Get metrics summary
    Env-->>Notebook: Return reward configuration
    
    Notebook->>Env: Get eval tasks (10 samples)
    Env-->>Notebook: Return medical questions with context
    
    loop For each task
        Notebook->>Notebook: Format prompt with XML template
        Notebook->>Unsloth: Generate response
        Unsloth-->>Notebook: XML-formatted answer
        Notebook->>Notebook: Store response
    end
    
    Notebook->>Env: Submit all responses for evaluation
    Env->>Evaluator: Parse XML tags (think, proof, answer)
    Evaluator->>Evaluator: Apply fuzzy matching (85% threshold)
    Evaluator-->>Env: Return metrics per episode
    Env-->>Notebook: Return aggregate results
    
    Notebook->>Notebook: Save to JSON file
    Notebook-->>User: Display results (30% safe, 60% hallucination)

greptile-apps · 2026-01-21T14:49:59Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2025

include the demo link

de72e61

Darktex reviewed Jan 13, 2026

View reviewed changes

update the notebook based on recommendations

0cdcd56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3#254

docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3#254
surfiniaburger wants to merge 3 commits intometa-pytorch:mainfrom
surfiniaburger:feat/v4-fuzzy-verification

surfiniaburger commented Dec 17, 2025

Uh oh!

Darktex left a comment

Uh oh!

zkwentz commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

surfiniaburger commented Dec 17, 2025

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

PR #254 Alignment Review: DIPG Tutorial V4 Fuzzy Verification Notebook

Tier 1: Bugs, Security, Quality Issues

✅ PASS: No Critical Issues Found

Tier 2: Alignment with OpenEnv Principles

🚨 ALIGNMENT FLAGS

Summary Assessment

Uh oh!

zkwentz commented Jan 21, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 21, 2026

Greptile found no issues!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants