Skip to content

Conversation

@shellmayr
Copy link
Member

@shellmayr shellmayr commented Nov 7, 2025

  • Realized in this discussion that we don't handle the case of the single large message well at the moment
  • Truncate single messages to size limit minus overhead.
    • If the message is not the last message, this will still result in the message being truncated.
    • If it is the last message, it will be retained, but truncated

Contributes to TET-1208

@shellmayr shellmayr marked this pull request as ready for review November 7, 2025 08:40
@shellmayr shellmayr requested a review from a team as a code owner November 7, 2025 08:40
@shellmayr shellmayr requested review from a team and ArthurKnaus November 7, 2025 08:40
)

available_content_bytes = max_bytes - overhead_size - 20
message["content"] = content[:available_content_bytes] + "..."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: In-place Mutation of Message Content Causes Side Effects

The function mutates the original message dict instead of creating a copy before modification. This causes unintended side effects where the caller's input data is modified. Similar to normalize_message_roles, this function should create a copy of the message before modifying its content field.

Fix in Cursor Fix in Web

)

available_content_bytes = max_bytes - overhead_size - 20
message["content"] = content[:available_content_bytes] + "..."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: UTF-8 Truncation Mismatch Causes Encoding Errors

The truncation uses content[:available_content_bytes] which slices by character index rather than byte length. For multi-byte UTF-8 characters like emoji or non-ASCII text, this produces incorrect results since available_content_bytes is a byte count but string slicing operates on character positions. This can truncate too many or too few bytes, and may slice mid-character causing encoding errors.

Fix in Cursor Fix in Web

@linear
Copy link

linear bot commented Nov 7, 2025

Copy link
Contributor

@alexander-alderman-webb alexander-alderman-webb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in terms of the general approach.

The message mutation also to be addressed; we should not edit the return value of another library and return the edited value to the user.

return 0


def truncate_messages_by_size(messages, max_bytes=MAX_GEN_AI_MESSAGE_BYTES):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead check if the returned array has one element, and if so, truncate the single message?

Let me know if I am missing some edge case that wouldn't be covered that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants