Skip to content

docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3#254

Open
surfiniaburger wants to merge 3 commits intometa-pytorch:mainfrom
surfiniaburger:feat/v4-fuzzy-verification
Open

docs(tutorial): add V4 Fuzzy Verification notebook with Gemma 3#254
surfiniaburger wants to merge 3 commits intometa-pytorch:mainfrom
surfiniaburger:feat/v4-fuzzy-verification

Conversation

@surfiniaburger
Copy link
Contributor

Adds a tutorial verifying the new V4 Fuzzy Logic system using Gemma 3 (4B). This notebook demonstrates:

  • How V4's 'Sensitivity Upgrade' (fuzzy matching) enables fair grading of reasoning traces.
  • Achieving 30% Safe Response Rate with a base 4B model (vs 0% on V3).
  • The shift from strict JSON parsing to robust XML-based Process Supervision.

Adds a tutorial verifying the new V4 Fuzzy Logic system using Gemma 3 (4B).
This notebook demonstrates:
- How V4's 'Sensitivity Upgrade' (fuzzy matching) enables fair grading of reasoning traces.
- Achieving 30% Safe Response Rate with a base 4B model (vs 0% on V3).
- The shift from strict JSON parsing to robust XML-based Process Supervision.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2025
Copy link
Contributor

@Darktex Darktex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.


I cannot fetch the PR from GitHub, but I have the diff content provided by the user. Let me analyze the diff content that was provided to perform the alignment review.

PR #254 Alignment Review: DIPG Tutorial V4 Fuzzy Verification Notebook

Based on the provided diff showing a new Jupyter notebook examples/dipg/tutorial_1.ipynb, here is my two-tier review:


Tier 1: Bugs, Security, Quality Issues

PASS: No Critical Issues Found

Lint/Format:

  • ❌ Unable to verify lint status (uv not installed in environment)
  • However, Jupyter notebooks are typically excluded from Python linting

Debug Code:

  • ✅ No debug statements, breakpoints, or TODO comments in the notebook
  • The notebook is complete and production-ready

Security:

  • ✅ No credentials or secrets exposed
  • ✅ No security vulnerabilities identified
  • ✅ External dependencies (unsloth, med-safety-gym) are cloned/installed appropriately for a tutorial context

Code Quality:

  • ✅ Well-structured notebook with clear sections
  • ✅ Proper error handling in server startup code
  • ✅ Appropriate use of subprocess for background server management

Tier 2: Alignment with OpenEnv Principles

🚨 ALIGNMENT FLAGS

FLAG #1: External Environment Integration Without OpenEnv APIs

  • Principle at risk: "Minimize lifecycle deltas" + "Simple Gymnasium-style API"

  • The concern: This tutorial demonstrates the med-safety-gym environment but does not use OpenEnv's canonical interfaces (Environment[ActT, ObsT, StateT], EnvClient, Gymnasium API). Instead, it:

    • Clones an external repo (surfiniaburger/med-safety-gym)
    • Starts a custom FastAPI server directly (python -m server.app)
    • Uses direct HTTP requests (requests.get, requests.post) rather than OpenEnv's client patterns
  • Why this matters:

    • Users learning from this tutorial won't understand how to integrate with OpenEnv's architecture
    • The tutorial doesn't demonstrate the dual API boundary (MCP for agents, WebSocket for orchestration)
    • It doesn't follow the client-server separation pattern documented in INVARIANTS.md
  • Questions for reviewers:

    1. Is med-safety-gym intended to be an external example that shows integration patterns?
    2. Should this tutorial be updated to show how to wrap the DIPG gym as an OpenEnv environment?
    3. Should this live in examples/ or in external documentation?
  • Suggested reviewer: @Darktex or project maintainer


FLAG #2: Notebook Placement and Purpose

  • Principle at risk: "Be hands-on" + Documentation clarity

  • The concern: The notebook is titled "Zero to Hero: Medical Safety with Gemma 3 & Unsloth" which suggests it's primarily about:

    • Fine-tuning Gemma 3 with Unsloth (external LLM tooling)
    • Medical safety evaluation (DIPG-specific use case)
    • V4 Fuzzy Verification architecture

    Rather than teaching OpenEnv concepts.

  • Why this matters:

    • The examples/ directory in OpenEnv should ideally showcase OpenEnv's capabilities
    • This appears to be more of a DIPG project tutorial that happens to mention OpenEnv in passing
    • Users might be confused about what OpenEnv provides vs. what DIPG provides
  • Questions for reviewers:

    1. Should this be in the med-safety-gym repository instead?
    2. If kept here, should it have a clear preamble explaining the relationship to OpenEnv?
    3. Should there be a companion tutorial showing how to build a DIPG environment using OpenEnv?

FLAG #3: Missing Integration with OpenEnv Infrastructure

  • Invariants at risk: "Dual API boundary", "Client-server separation", "Container isolation"

  • The observation: The tutorial's evaluation code directly:

    • Manages its own server lifecycle (subprocess)
    • Uses raw HTTP requests for communication
    • Doesn't use Docker containers for isolation
    • Doesn't demonstrate MCP tool integration
  • OpenEnv patterns it could demonstrate but doesn't:

    # Expected OpenEnv pattern:
    from openenv.core.client import EnvClient
    from dipg_env.models import DIPGAction, DIPGObservation, DIPGState
    
    client = EnvClient[DIPGAction, DIPGObservation, DIPGState](url="ws://localhost:8000")
    obs = client.reset()
    result = client.step(action)
  • Why this matters: This is a teaching moment opportunity to show OpenEnv best practices


Summary Assessment

Tier 1 (Bugs/Security): ✅ APPROVED - No blocking issues

Tier 2 (Alignment): 🟡 NEEDS DISCUSSION

This is a high-quality tutorial for the DIPG V4 evaluation architecture, but its relationship to OpenEnv is unclear. The flags above are not about code quality (which is good) but about architectural alignment and educational clarity.

Recommendation:

  1. If this is meant to demonstrate external integration: Add a README explaining this is an external environment example and link to OpenEnv integration docs
  2. If this should teach OpenEnv: Refactor to use OpenEnv's canonical APIs and patterns
  3. If this belongs elsewhere: Consider moving to the med-safety-gym repository

Non-blocking for merge if: Project maintainers confirm the examples/dipg/ directory is intended for DIPG-specific tutorials that may not follow OpenEnv patterns.


Files reviewed:

  • examples/dipg/tutorial_1.ipynb (new file, 4579 lines)

Automated checks:

  • Lint: ⚠️ Could not verify (uv not available)
  • Debug code: ✅ None found
  • Manual review: ✅ Complete

Automated review by Claude Code | Learn more about OpenEnv's agentic workflow

@zkwentz
Copy link
Contributor

zkwentz commented Jan 21, 2026

@greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile Summary

Adds comprehensive tutorial notebook demonstrating the V4 fuzzy verification system with Gemma 3 (4B) model.

Key Changes:

  • Introduces complete end-to-end workflow from model loading through evaluation
  • Demonstrates V4's fuzzy matching logic (>85% similarity threshold via difflib)
  • Shows migration from V3's strict JSON parsing to XML-based process supervision
  • Achieves 30% safe response rate baseline (vs 0% on V3) with base 4B model
  • Includes proper error handling with try-except blocks in evaluation function
  • Provides clear documentation of V4 architecture improvements

Technical Implementation:

  • Uses Unsloth's FastModel for efficient 4-bit quantization
  • Implements 3-step evaluation workflow: get_metrics_summary(), get_eval_tasks(), evaluate_model()
  • Structured XML output format with <think>, <proof>, and <answer> tags
  • Background server setup with run_bg_server() to avoid blocking notebook execution

The notebook is well-structured as an educational resource, properly explaining the V4 architecture shift and providing concrete metrics to demonstrate improvements over V3.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it adds a new educational notebook without modifying existing code
  • Score reflects that this is a new tutorial file with no changes to production code. The notebook demonstrates proper error handling, uses established libraries (Unsloth, transformers), and follows the existing DIPG evaluation patterns. Minor points deducted for lack of input validation in some code cells and the use of mock mode that could confuse users.
  • No files require special attention - this is an additive change introducing a single tutorial notebook

Important Files Changed

Filename Overview
examples/dipg/tutorial_1.ipynb New tutorial notebook demonstrating V4 fuzzy verification with Gemma 3 model. Implements complete workflow from model loading to evaluation with medical safety gym.

Sequence Diagram

sequenceDiagram
    participant User
    participant Notebook as Tutorial Notebook
    participant Unsloth as Unsloth FastModel
    participant Server as Med Safety Gym Server
    participant Env as DIPGSafetyEnv
    participant Evaluator as V4 Evaluator

    User->>Notebook: Run installation cells
    Notebook->>Notebook: Install dependencies (unsloth, transformers, openenv)
    
    User->>Notebook: Load model
    Notebook->>Unsloth: Load gemma-3-4b-it model
    Unsloth-->>Notebook: Return quantized model and tokenizer
    
    User->>Notebook: Start evaluation server
    Notebook->>Server: Start background server on port 8081
    Server-->>Notebook: Server running
    
    User->>Notebook: Run enhanced evaluation
    Notebook->>Env: Connect to DIPGSafetyEnv
    
    Notebook->>Env: Get metrics summary
    Env-->>Notebook: Return reward configuration
    
    Notebook->>Env: Get eval tasks (10 samples)
    Env-->>Notebook: Return medical questions with context
    
    loop For each task
        Notebook->>Notebook: Format prompt with XML template
        Notebook->>Unsloth: Generate response
        Unsloth-->>Notebook: XML-formatted answer
        Notebook->>Notebook: Store response
    end
    
    Notebook->>Env: Submit all responses for evaluation
    Env->>Evaluator: Parse XML tags (think, proof, answer)
    Evaluator->>Evaluator: Apply fuzzy matching (85% threshold)
    Evaluator-->>Env: Return metrics per episode
    Env-->>Notebook: Return aggregate results
    
    Notebook->>Notebook: Save to JSON file
    Notebook-->>User: Display results (30% safe, 60% hallucination)
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants