Skip to content

Systematic review enabled - still needs validation

Latest

Choose a tag to compare

@hherb hherb released this 09 Dec 00:19
· 92 commits to master since this release

BMLibrarian v0.9.6 Release Notes

Release Date: December 2025
Previous Release: v0.9-alpha (November 11, 2025)
Total Commits: 817+ commits since v0.9-alpha

This release includes significant new features, extensive bug fixes, and substantial improvements to stability and reliability for the systematic review workflow.


Highlights

  • Complete Systematic Review Workflow - Full end-to-end systematic literature review automation with checkpoint-based resumability
  • Evidence Synthesis Engine - AI-powered citation extraction and narrative synthesis from included papers
  • Professional PDF Export - Publication-quality PDF generation from markdown reports
  • Unified Evaluations Database - PostgreSQL-backed evaluation tracking with full audit trail
  • Model Benchmarking Tool - Compare and evaluate document scoring models
  • Citation-Aware Writing Editor - Academic writing plugin with reference management

New Features

Systematic Review System

The systematic review module has been completely overhauled with production-ready capabilities:

  • Checkpoint-Based Resumability - Save and resume reviews at any phase:

    • Search strategy checkpoint
    • Initial results checkpoint
    • Scoring complete checkpoint
    • Quality assessment checkpoint
    • Full progress history displayed when resuming
  • Evidence Synthesis - New EvidenceSynthesizer component:

    • Extracts relevant citations from included papers
    • Generates narrative synthesis answering research questions
    • Configurable citation thresholds and limits
    • Real-time progress callbacks
  • Improved Search & Scoring:

    • Phased execution mode for better progress tracking
    • Per-document progress updates in UI
    • Default inclusion/exclusion criteria
    • Comprehensive excluded paper tracking with reasons
  • Quality Assessment Improvements:

    • Database caching for all quality assessments (PICO, PRISMA, Study Assessment, Paper Weight)
    • Version-tracked cache invalidation
    • Improved study type detection including narrative reviews, scoping reviews, and expert opinions
  • Systematic Review GUI (systematic_review_gui.py):

    • Tabbed interface with report preview
    • Real-time progress visualization
    • Checkpoint browser for resume selection
    • Full activity log with markdown formatting

PDF Export System

New professional PDF export using ReportLab:

  • Pure Python - No external dependencies (wkhtmltopdf, etc.)
  • Cross-Platform - Works on Windows, macOS, and Linux
  • Publication Quality - Proper fonts, page numbering, headers/footers
  • Full Markdown Support - Headings, lists, tables, code blocks, emphasis
  • Configurable - Page size (A4/Letter), fonts, margins, colors
uv run python export_to_pdf.py report.md -o report.pdf --research-report

Model Benchmarking Tool

New CLI and module for evaluating document scoring models:

uv run python model_benchmark_cli.py benchmark "research question" \
    --models gpt-oss:20b medgemma4B_it_q8:latest \
    --authoritative gpt-oss:120B
  • Compare scoring consistency across models
  • Alignment metrics and statistical analysis
  • Database-backed benchmark run history
  • Visualization of score distributions

Citation-Aware Writing Editor

New Writing plugin for academic document creation:

  • Markdown editor with live preview
  • Automatic References section management
  • Citation insertion from BMLibrarian database
  • Auto-save and document recovery
  • PDF export integration

Audit Trail Validation GUI

New interface for human review of automated evaluations:

uv run python audit_validation_gui.py --user reviewer_name
  • Review and validate AI-generated assessments
  • Incremental mode for unvalidated items only
  • Track validation decisions with explanations

Unified Evaluations Module

New database-backed evaluation tracking system:

  • PostgreSQL evaluations schema for all assessment types
  • Evaluation runs with status tracking (in_progress, completed, failed)
  • Processing time and confidence tracking
  • Full audit trail with timestamps

Improvements

Study Type Detection

  • Added new study types: narrative_review, scoping_review, expert_opinion
  • Improved LLM prompts for accurate study classification
  • Better handling of review articles that were previously classified as "unknown"

PRISMA Assessment

  • Auto-repair incomplete LLM responses instead of failing
  • Fill missing fields with sensible defaults and clear warnings
  • Track incomplete responses for quality monitoring
  • Include actual invalid values in warning messages for debugging

Database & Caching

  • Results cache for all quality assessments (study assessment, PICO, PRISMA, paper weight)
  • Version-based cache invalidation
  • Fixed N+1 query patterns in paper retrieval
  • Immediate evaluation persistence (no batch-only saves)
  • DateTimeEncoder for proper JSON serialization

GUI Improvements

  • Cross-platform font support (fixes macOS font warnings)
  • Default page size changed to A4 (international standard)
  • PDF viewer with text selection, search, and fit-width zoom
  • Improved progress bars with per-step tracking
  • Restored progress display when resuming from checkpoints

Bug Fixes

Critical Fixes

  • Fixed checkpoint resume crashes with proper error handling
  • Fixed evaluation data not being saved to database
  • Fixed datetime JSON serialization errors in cache manager
  • Fixed callback signature mismatch in EvidenceSynthesizer
  • Fixed N+1 query pattern causing performance issues

Systematic Review Fixes

  • Fixed InclusionDecision construction with required arguments
  • Fixed InclusionStatus.PENDING to use UNCERTAIN
  • Fixed InitialFilter initialization parameter errors
  • Fixed missing research_question in RelevanceScorer
  • Fixed UnboundLocalError in phased search mode
  • Fixed checkpoint files not being saved during resume
  • Fixed missing final_rank in checkpoint resume
  • Fixed quality gate statistics showing incorrect counts

Assessment Fixes

  • Fixed PaperWeightAssessmentAgent.assess_paper() parameter name
  • Fixed PRISMA None results crashing on .to_dict() calls
  • Fixed PostgreSQL type casting for evaluation functions
  • Fixed study_design field extraction for quality assessment

GUI Fixes

  • Fixed QThread crash on application close
  • Fixed validation status not updating in list views
  • Fixed pipe characters breaking markdown tables
  • Fixed report viewer attribute errors after merge

Breaking Changes

  • EvidenceSynthesizer.progress_callback now expects (message, current, total) signature
  • InclusionDecision now requires stage parameter (not exclusion_stage)
  • Relevance score range changed from (1, 5) to (0, 5) to allow marking irrelevant documents
  • Full-text documents must be chunked/embedded before paper weight assessment

Database Migrations

This release includes new database schemas:

  • evaluations schema for evaluation tracking
  • results_cache schema for quality assessment caching

Run the migration scripts before using new features:

uv run python -m bmlibrarian.database.migrations

Documentation Updates

  • New user guides for evidence synthesis, PDF export, and model benchmarking
  • Updated developer documentation for evaluations module
  • Added golden rules compliance documentation
  • Improved CLAUDE.md with comprehensive project structure

Contributors

This release was developed with significant contributions from Claude Code (Anthropic's AI coding assistant), demonstrating effective human-AI collaboration in complex software development.


Upgrade Instructions

  1. Update dependencies:

    uv sync
  2. Run database migrations:

    uv run python initial_setup_and_download.py your.env --skip-medrxiv --skip-pubmed
  3. Clear any stale caches:

    # In PostgreSQL
    TRUNCATE results_cache.study_assessments CASCADE;

Known Issues

  • Large systematic reviews (>1000 papers) may require increased PostgreSQL connection pool size
  • PRISMA assessment may return incomplete results for some document types (auto-repaired with warnings)
  • Evidence synthesis requires Ollama models with sufficient context window

What's Next

  • Enhanced multi-model query generation
  • Improved inter-rater reliability analysis tools
  • Web-based interface option
  • Enhanced counterfactual analysis for contradictory evidence detection

For detailed documentation, see the doc/ directory or visit the project repository.