Skip to content

GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660

Open
Reranko05 wants to merge 3 commits intoapache:mainfrom
Reranko05:fix-base64-invalid-input
Open

GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660
Reranko05 wants to merge 3 commits intoapache:mainfrom
Reranko05:fix-base64-invalid-input

Conversation

@Reranko05
Copy link
Copy Markdown

@Reranko05 Reranko05 commented Apr 4, 2026

Rationale for this change

arrow::util::base64_decode previously allowed invalid input to be processed, which could result in silently truncated or incorrect output without signaling an error. This can lead to unintended data corruption.

What changes are included in this PR?

  • Change base64_decode to return arrow::Result<std::string> instead of std::string
  • Add validation for:
    • invalid input length
    • invalid base64 characters
    • incorrect padding
  • Return an error (Status::Invalid) for invalid input instead of producing partial output
  • Update all call sites to handle Result<std::string>
  • Add unit tests covering valid and invalid inputs

Are these changes tested?

Yes. Unit tests have been added to verify:

  • valid decoding behavior
  • invalid input length
  • invalid characters
  • incorrect padding handling

Are there any user-facing changes?

  • The API now returns arrow::Result<std::string> instead of std::string
  • Invalid base64 input now results in an error (Status::Invalid) instead of returning partial or incorrect output

This PR contains a "Critical Fix".

This change fixes a correctness issue where invalid base64 input could result in silently truncated output, leading to incorrect data being produced. The fix ensures such inputs are detected and reported properly.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

⚠️ GitHub issue #49614 has been automatically assigned in GitHub to PR creator.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 4670ec5 to 5c7db64 Compare April 4, 2026 20:38
@Reranko05
Copy link
Copy Markdown
Author

Thanks for the feedback. I’ve updated the implementation and tests.

  • Added stricter validation (length, padding placement/count, allowed characters)
  • Removed early termination in the decode loop to avoid silent truncation
  • Expanded test coverage to include invalid inputs and edge cases

All tests pass locally. Please let me know if any further adjustments are needed.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 5c7db64 to 8f053b7 Compare April 4, 2026 21:07
Copy link
Copy Markdown
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use arrow::Result<std::string> return type instead of using ARROW_LOG()?

@kou
Copy link
Copy Markdown
Member

kou commented Apr 5, 2026

FYI: You can run CI on your fork by enabling GitHub Actions on your fork.

@Reranko05
Copy link
Copy Markdown
Author

Could you use arrow::Result<std::string> return type instead of using ARROW_LOG()?

Thanks for the suggestion @kou !

Just to clarify, would you prefer changing the existing base64_decode API to return arrow::Resultstd::string, or introducing a separate checked variant while keeping the current API unchanged?

I want to make sure the approach aligns with existing usage and expectations.

@kou
Copy link
Copy Markdown
Member

kou commented Apr 5, 2026

"changing the existing base64_decode API to return arrow::Resultstd::string".
But I want to know how many changes are required for existing code that use base64_decode().

@Reranko05
Copy link
Copy Markdown
Author

@kou I checked the current usages of base64_decode(), and it appears to be used in a very limited number of places (primarily in tests and one internal call site in flight_test.cc).

Updating to arrow::Result<std::string> would require adjusting those call sites to use ARROW_ASSIGN_OR_RAISE, but the impact seems quite localized and manageable.

I can proceed with the API change and update the affected call sites accordingly.

@Reranko05
Copy link
Copy Markdown
Author

Hi @kou, just following up on this.

I can proceed with updating base64_decode() to return arrow::Result<std::string> and adjust the affected call sites accordingly. Please let me know if this approach looks good, or if you'd prefer any alternative.

Happy to proceed based on your guidance.

@kou
Copy link
Copy Markdown
Member

kou commented Apr 8, 2026

Oh, sorry. I forgot to reply this...

Yes. Let's proceed with arrow::Result<std::string>.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 8f053b7 to 34a388c Compare April 8, 2026 09:53
@Reranko05
Copy link
Copy Markdown
Author

Hi @kou, thanks for confirming!

I’ve updated base64_decode() to return arrow::Result<std::string> and added validation for invalid inputs (length, padding, and non-base64 characters). I also updated the tests accordingly.

All tests are passing locally. Please let me know if you’d like any changes or adjustments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a correctness issue in Arrow’s C++ base64 decoder by ensuring malformed base64 input is detected instead of producing silently truncated/partial output.

Changes:

  • Adds pre-validation for base64 input (length, padding placement, invalid characters) in base64_decode.
  • Changes base64_decode API to return arrow::Result<std::string> with Status::Invalid on malformed input.
  • Adds unit tests covering valid/invalid decoding cases (and adjusts a couple of ToChars assertions).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
cpp/src/arrow/vendored/base64.cpp Adds base64 input validation and switches decode to return errors instead of partial output.
cpp/src/arrow/util/base64.h Updates public API signature of base64_decode to return Result<std::string>.
cpp/src/arrow/util/string_test.cc Adds tests for base64 decode validity/error cases and tweaks ToChars expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 34a388c to ed84348 Compare April 8, 2026 17:09
@Reranko05
Copy link
Copy Markdown
Author

Reranko05 commented Apr 8, 2026

Hi @kou, I have addressed all review comments:

  • Removed redundant helpers and reused existing base64_chars
  • Merged validation into decoding loop (single-pass)
  • Fixed padding validation to ensure '=' only appears at the end
  • Cleaned up includes and namespace usage
  • Updated tests to improve coverage and follow Arrow conventions

All tests pass locally.

@kou kou requested a review from Copilot April 8, 2026 21:14
@kou
Copy link
Copy Markdown
Member

kou commented Apr 8, 2026

Could you enable GitHub Actions on your fork to run CI on your fork too?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 0446f41 to e35212c Compare April 12, 2026 04:20
@Reranko05 Reranko05 requested a review from kou April 12, 2026 05:51
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 12, 2026
@kou
Copy link
Copy Markdown
Member

kou commented Apr 12, 2026

Could you push apache-arrow-* tags to your fork to fix the CI failure?

e.g.:

git remote add https://github.com/apache/arrow.git
git fetch --all --tags --prune --force
git push --tags origin

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch 3 times, most recently from e658d8f to 22e6c2c Compare April 12, 2026 12:01
@Reranko05
Copy link
Copy Markdown
Author

Hi @kou, I have addressed all review comments. All checks are passing now

@Reranko05 Reranko05 requested a review from kou April 12, 2026 12:51
@kou kou requested a review from Copilot April 12, 2026 20:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 22e6c2c to 98e24de Compare April 13, 2026 01:34
@Reranko05
Copy link
Copy Markdown
Author

Hi @kou, I have addressed the remaining feedback by using PARQUET_ASSIGN_OR_THROW and updating error handling to return deterministic, case-specific messages. All CI checks are passing.

@Reranko05 Reranko05 requested a review from kou April 13, 2026 03:05
@kou kou requested a review from Copilot April 13, 2026 04:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@dmitry-chirkov-dremio Do you want to review this before we merge this?

@dmitry-chirkov-dremio
Copy link
Copy Markdown
Contributor

dmitry-chirkov-dremio commented Apr 13, 2026

@dmitry-chirkov-dremio Do you want to review this before we merge this?

Yes, give me 24h.

Copy link
Copy Markdown
Contributor

@dmitry-chirkov-dremio dmitry-chirkov-dremio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me however won't approve until conversations are resolved (including Copilot's comments). Leaving couple of comments of my own as well.

Good job, @Reranko05 - almost there.
cc @kou

@Reranko05
Copy link
Copy Markdown
Author

Good job, @Reranko05 - almost there. cc @kou

Thank you @dmitry-chirkov-dremio, I've addressed all review comments and resolved outstanding threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants