Diff in behavior in public_eval vs master branch

Hi, thanks for your great work!

I'm running some benchmark and noticed that in master branch system sometimes goes in almost infinite continue_chaning loop performing "search_in_memory" toolcalls with slightly different inputs derived from originally requested user content (it appends the `user_message` with `type=continue_chaining` after each iteration of such a look). 

In the branch "public_evaluation" the tool "search_in_memory" is usually called once and it is followed the `user_message` of `type=heartbeat` (I guess that is the reason, which leads to final `send_message` tool)

Is there any way to affect the behavior of master branch (so that it finished bechmarks performing a limited amount of toolcalls? 

Since the code in `public_evaluation` seems to be quite obsolete.

Thank you!

UPD: the problem spotted for GPT-5-nano and GPT-5-mini models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff in behavior in public_eval vs master branch #87

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Diff in behavior in public_eval vs master branch #87

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions