Skip to content

Conversation

@tareknaser
Copy link
Collaborator

Description

The PR adds infrastructure to run courseexam benchmark evaluation using the updated data format

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds complete evaluation infrastructure for the courseexam benchmark, implementing a modernized data format and evaluation pipeline using the Inspect framework.

Key Changes:

  • Implements evaluation infrastructure with task definition, dataset loading, custom metrics, and LLM-as-judge scoring for both ExactMatch and Freeform question types
  • Migrates data format from flat structure with instance_id to nested structure with input/target/metadata fields following Inspect conventions
  • Updates test suite to validate the new data schema and adds automated dataset generation as part of test workflow

Reviewed changes

Copilot reviewed 14 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
courseexam/__init__.py Package initialization exposing main API
courseexam/courseexam.py Main task definition with conditional solver for different question types
courseexam/dataset.py Dataset loading with filtering by exam, type, and tags, plus reference material injection
courseexam/metrics.py Custom metrics for points-based evaluation (accuracy, mean, totals)
courseexam/scorer.py Hybrid scorer using exact match for multiple choice and LLM judge for freeform
courseexam/prepare.py Data preparation script to convert markdown exams to JSONL format
run_eval.py Convenience script for running evaluations with configurable parameters
prepare_dataset.py Entry point script to prepare dataset from raw markdown files
pyproject.toml Project configuration with dependencies and build settings
tests/test_data_schema.py Updated schema validation tests for new data format
data/raw/example_course_2024_midterm/exam.md Example exam updated with new format (choices field, answer as letter index)
data/raw/example_course_2024_midterm/raft_basics.md New reference material file
data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md Real exam updated with new format
data/raw/6_1810_operating_system_engineering_fall_2024_final/mmap.md New reference material file
README.md Comprehensive documentation updates covering evaluation, data format, and usage
.github/workflows/test.yml CI updates for Python 3.10 and pyproject.toml-based installations
Comments suppressed due to low confidence (8)

benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:134

  • The "choices" field contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format (see example_course_2024_midterm/exam.md), this should contain the actual text of choices like ["an error because 'b' does not exist", "an error because the symlink was already visited", "an error because 'b' points to itself", "nothing because xv6 will panic"].
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:661
  • The "choices" field contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the four choices presented in the question.
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:166
  • The "choices" field contains ["A", "B", "C", "D", "E"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the five choices presented in the question.
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:288
  • The "choices" field contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the four choices presented in the question.
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:368
  • The "choices" field contains ["A", "B", "C", "D", "E"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the five choices presented in the question.
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:465
  • The "choices" field contains ["A", "B", "C", "D", "E"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the five choices presented in the question.
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:46
  • In the exam data, the "choices" field contains ["A", "B", "C", "D"] but these are just the option letters, not the actual choice content. According to the format in other questions like Question 1 of the example exam (lines 31-32), the choices should contain the actual text options like ["Running", "Ready", "Blocked", "Terminated"]. The current format is inconsistent with the documented structure.
    benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:96
  • The same issue with "choices" exists here - it contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format, this should be ["an inode number", "a block number", "file data", "a bitmap"].

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants