Skip to content
#

ai-assessment

Here are 4 public repositories matching this topic...

Language: All
Filter by language

Comprehensive evaluation of Claude 4 Sonnet's mathematical assessment capabilities: 500 original problems revealing JSON-induced errors and systematic patterns in LLM evaluation tasks. Research demonstrates 100% accuracy on incorrect answers but 84.3% on correct ones due to premature decision-making in JSON structure.

  • Updated Jul 7, 2025
  • HTML

Improve this page

Add a description, image, and links to the ai-assessment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-assessment topic, visit your repo's landing page and select "manage topics."

Learn more