📘 Need a quick briefing? Open the NotebookLM notebook for this repository (and Transformer Lab references) at https://notebooklm.google.com/notebook/bda08038-5bd9-436e-8987-ac1bc91c3fa4 — it’s the fastest way to answer project or code questions.
AI-RLWHF is an open, experimentation-first workspace for building Transformer Lab plugins and reinforcement learning workflows that reward honesty, feedback capture, and multi-model collaboration. The project combines deterministic data handling, synthetic dataset generation, and targeted model fine-tuning under the Multi-Vibe Coding In Chain paradigm.
AI-RLWHF emerged as a focused research spinoff from the Knowledge3D (K3D) project, a Cognitive Operating System designed to enable human and artificial intelligence collaboration within a persistent, navigable 3D spatial universe.
K3D's Vision: K3D implements a dual memory paradigm combining the "Galaxy" (active vector embeddings for real-time inference) with the "House" (persistent structured 3D knowledge graphs). Built on the philosophy of Filosofia Metafísica Energética Atômica Infinita (FMEAI), K3D features a Unified Multimodal Head processing all data types (text, audio, video, 3D) using GPU-native PTX kernels targeting sub-100 microsecond response times. Core innovations include Tiny Recursion Models (TRM) for efficient reasoning, adaptive confidence propagation using action-specific curiosity bias, and GPU-accelerated spatial filtering for embodied cognition at scale.
The RLWHF Connection: While developing K3D's multi-agent collaboration systems, we discovered that AI agents operating in shared cognitive spaces exhibited varying degrees of honesty and uncertainty acknowledgment when interacting with incomplete or ambiguous spatial-semantic knowledge. This observation led to the formalization of the honesty rubric (-2 to +2 scoring system) and the teacher-student feedback architecture now central to AI-RLWHF. The training paradigm developed here represents a distillation of those honesty mechanisms, packaged as reusable Transformer Lab plugins that can benefit the broader AI training community.
By spinning off RLWHF as a standalone project, we enable researchers to adopt honesty-centric training workflows without requiring K3D's full spatial infrastructure, while maintaining the philosophical alignment with transparent, collaborative AI development.
- Elevate training data quality by blending user-owned corpora, open datasets, and synthetic content governed by honesty signals.
- Build reusable, memory-efficient Transformer Lab plugins that automate ingestion, feedback scoring, and evaluation.
- Operationalize reinforcement learning with honesty and feedback (RLWHF) loops across diverse foundation and adapter models.
- Enable transparent, asynchronous collaboration between Codex, Grok, Kimi K2, GLM 4.6, DeepSeek, Qwen (Max and Coder), and human contributors.
- Treat each AI collaborator as a specialist posting updates in a shared message board format.
- Log discussion prompts, decisions, and critiques in
workspace/so every contributor has high-fidelity context. - Use pairwise reviews: each AI picks up the prior message, extends the implementation, and documents outcomes in
docs/. - Capture honesty and self-critique data during every generation to enrich RLWHF reward modeling later in the cycle.
- Install the Transformer Lab AppImage (for example
chmod +x /home/daniel/Downloads/Transformer-Lab-*.AppImage). - Launch with
./Transformer-Lab-*.AppImage --portableto persist user state beside the binary. - Mirror plugin stubs and manifests from
plugins/into the Transformer Lab plugin directory or symlink the repo. - Manage connection and dataset manifests inside
configs/transformer-lab/so runs are reproducible across systems.
Reference: https://r.jina.ai/https://lab.cloud/blog/how-to-plugin
This project now includes a powerful integration with the ms-swift library to accelerate RLWHF training loops using GRPO and advanced hardware optimizations.
To get started with the new environment, run the comprehensive setup script:
bash scripts/setup/setup_ms_swift_integration.shThis script will:
- Create all necessary directories.
- Vendor the
ms-swiftlibrary locally for a self-contained environment. - Install all required Python dependencies.
- Generate sample data for testing.
- Make key scripts executable.
After setup is complete, you can run a full, end-to-end training and evaluation cycle with a single command:
python scripts/training/master_rlwhf_launcher.py launch \
--dataset_path data/test/honesty_logs_sample.jsonl \
--output_dir experiments/my_first_rlwhf/For a detailed walkthrough and verification steps, see the Integration Checklist.
AI-RLWHF/
├── configs/ # Transformer Lab profiles, prompt packs, and shared config values
├── data/ # Raw, processed, and synthetic datasets plus metadata
├── docs/ # Plans, design notes, and evaluation references
├── experiments/ # Logged experiment runs and reusable templates
├── logs/ # Training, evaluation, and plugin execution logs (git ignored)
├── models/ # Checkpoints, adapters, and exported artifacts
├── plugins/ # Transformer Lab plugins (core, experimental, templates)
├── scripts/ # Automation helpers for data, training, and reporting
├── tests/ # Automated validation suites with fixtures
└── workspace/ # Shared notebooks, scratchpads, and collaboration handoffs
- Run a teacher evaluator model in parallel with the student under training to grade prompts, answers, and critiques in real time.
- Apply the shared scoring rubric (
docs/rlwhf-framework.md) where dishonest hallucinations earn -2, unacknowledged partial answers earn -1, neutral honesty earns 0, self-aware uncertainty earns +1, and fully correct delivery earns +2. - Persist
(prompt, student_answer, teacher_feedback, reward)tuples intodata/processed/honesty_logs/to unlock deterministic replay for GRPO and adapter fine-tuning. - Configure teacher and student connectors through Transformer Lab manifests or direct endpoints (Ollama, vLLM, TGI) so a single prompt pack impacts both local and API backed training loops.
- Adopt Unsloth Standby (https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/memory-efficient-rl) for weight sharing between inference and training to stretch context windows without doubling GPU memory.
- Set
UNSLOTH_VLLM_STANDBY=1andgpu_memory_utilization≈0.95before importing Unsloth helpers to unlock 1.2–1.7x longer contexts and ~10% faster RL loops. - Standardize at least two generations per prompt during GRPO so reward normalization avoids divide-by-zero variance.
- Track GPU telemetry and reward summaries inside
logs/training/for regression spotting; integrate memory dashboards as plugins mature.
To further enhance performance and support a wider range of hardware, this project integrates the ms-swift library. This provides a production-ready framework for GRPO (Group Relative Policy Optimization) that is highly optimized for various hardware backends, including NVIDIA GPUs, Apple Silicon (MPS), and Huawei Ascend NPUs.
Key features of this integration include:
- Production-Grade Training Wrapper:
plugins/core/grpo_production_wrapper.pyprovides a robust, hardware-aware launcher for GRPO training. - Automated Hardware Detection: The
plugins/core/hardware_detector.pyautomatically profiles the system to select the optimal configuration. - Dynamic Fallback Presets: The system uses
configs/training/hardware_fallback.jsonto gracefully adapt to different hardware capabilities, from high-end multi-GPU setups to CPU-only environments. - Data Quality Gates: Before training, data is validated by
scripts/data_pipeline/data_quality_gate.pyto ensure integrity. - Unified Launcher: The entire pipeline can be executed with a single command via
scripts/training/master_rlwhf_launcher.py.
This repository is extensively documented to facilitate understanding and collaboration. All public functions, methods, and classes have complete docstrings explaining their purpose, parameters, and return values. Below is a high-level overview of the key modules.
The plugins/ directory contains the core components for integrating with Transformer Lab.
plugins/core/: This directory contains the core logic for thems-swiftintegration, including thegrpo_production_wrapper.py,hardware_detector.py, and thecustom_honesty_rmfor building a heuristic reward model.plugins/experimental/: Contains experimental plugins, such as thegrok_search_evaluatorwhich provides real-time, internet-augmented evaluation.
The scripts/ directory contains automation and utility scripts for managing the data pipeline, training, and visualization.
scripts/utils: A collection of helper modules for common tasks such as loading configurations, logging, caching, and offline scoring.scripts/training: Contains runners for various training and evaluation scenarios. The main entry point is nowmaster_rlwhf_launcher.py, which orchestrates the entirems-swiftGRPO training pipeline.scripts/data_pipeline: Includes tools for ensuring data integrity, such as thedata_quality_gate.pyand handlers for processing RLWHF data tuples.scripts/collaboration: Contains modules like thespecialist_orchestrator.pyfor managing multi-AI collaborative chains.scripts/setup: Includes the mainsetup_ms_swift_integration.shscript for easy onboarding.scripts/visualization: Includes thehonesty_dashboard.pyscript for generating reports and thelive_metrics_stream.pyfor real-time monitoring.
docs/plan.md- chronological delivery breakdown.docs/INTEGRATION_CHECKLIST.md- Step-by-step guide to verify thems-swiftintegration.docs/plugin-blueprints.md- plugin architecture references and design norms.docs/data-strategy.md- governance and acquisition plan.docs/evaluation-framework.md- scoring and reporting structure.docs/rlwhf-framework.md- teacher-student loop, memory guidance, and connector notes.docs/ollama-runtime.md- tips for memory-safe Ollama orchestration.
The current scaffold seeds the Multi-Vibe Coding In Chain workflow so Codex, partner models, and human teammates can iterate rapidly while keeping honesty and feedback centered in every deliverable.