ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

ToolHaystack is the first test suites for long-term tool utilization featured by composable and noisy agent context. It provides three main scenarios that rigorously evaluate the necessary capabilities of TALMs in long-term interaction scenario.

[NOTE] The evaluation scripts and code for dataset generation pipeline will be updated soon. Stay tuned!

🚀 Overview

Recent Tool-Augmented Language Model (TALM) benchmarks have led the rapid growth of language model agents but are limited to single or short-turn scenarios. To tackle this, we propose ToolHaystack benchmark which evaluates TALMs in a long and multi-task interleaved context where contextual noise distracts TALMs significantly.

Dataset: ToolHaystack

The ToolHaystack dataset is available at 🤗 huggingface(https://huggingface.co/datasets/LangAGI-Lab/ToolHaystack)

Scenarios

ToolHaystack provides three main scenarios (Context Recall, Information Shift, Missing Context) that requires robust long-term context understanding of TALMs.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

🚀 Overview

Dataset: ToolHaystack

Scenarios

About

Uh oh!

Releases

Packages

bwookwak/ToolHaystack

Folders and files

Latest commit

History

Repository files navigation

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

🚀 Overview

Dataset: ToolHaystack

Scenarios

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages