-
Notifications
You must be signed in to change notification settings - Fork 2.8k
fix: commit user turn with STT and realtime #4663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis PR removes the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
davidzhao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this true when manual turn mode is used?
I see that if it's not server vad, we have to manually create them: https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer/commit
I think I found the issue: when using with STT, it actually committed the transcripts to the model and therefore triggered a response. Without STT, it will not respond as expected. |
got it, so the bug is: when STT is used with manual turn detection mode and realtime model, it should not trigger a response. |
Yep. But it is also kinda ambiguous what users want if they have an STT configured: do they want it only for the transcripts or actual model text input. |
|
In our use case we want STT with manual turn taking both as a way to know when it's safe to commit audio before a user turn has ended (as a user turn could be very long, we cannot wait until end of turn to commit) and for its improved transcripts post call versus openai server side transcription So I think ideally we'd have STT transcripts available and in the local chat context (so if there is a connection error we can restore the conversation on a fresh session) but the remote openai realtime chat context need not even be aware of it as it can function perfectly fine without a transcript of user utterances. STT also provides a neat signal to the user that the system is working during a long user turn (by allowing them to view their own transcription as they talk). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6099024 to
a3a206e
Compare
Turns out using STT with a realtime model will trigger a response when committing a turn. I didn't notice the interruption when testing so wrongly assumed it was doing just fine.
Example to reproduce:
The model will respond after 20 seconds
cc @bml1g12
Summary by CodeRabbit
commit_user_turnmethod from the Realtime Session interface. This method was previously unused and non-functional across implementations and has been eliminated to simplify the API surface and reduce unnecessary complexity. Applications that reference this method require updating for compatibility.✏️ Tip: You can customize this high-level summary in your review settings.