Add PostgreSQL-Specific Tokenizer for Query Digest Generation #5254

rahim-kanji · 2025-12-16T15:18:18Z

Summary

This PR introduces a PostgreSQL-aware tokenizer to generate stable and accurate query digests. Previously, PostgreSQL queries were tokenized using MySQL-centric logic, which led to incorrect normalization for many PostgreSQL-specific syntax constructs. The new implementation adds dedicated parsing logic for PostgreSQL while keeping MySQL behavior fully isolated.

Motivation

PostgreSQL syntax differs significantly from MySQL in areas such as type casting, array literals, string literal formats, identifier quoting, and comment handling. Using a MySQL tokenizer for PostgreSQL queries resulted in unstable digests and incorrect query grouping.

This change ensures PostgreSQL queries are tokenized according to PostgreSQL grammar rules, producing consistent digests for semantically equivalent queries.

Key Changes

PostgreSQL-Specific Tokenization

Type casts (::)
- Added process_pg_typecast() to correctly recognize and normalize PostgreSQL type cast syntax.
Array literals
- Added process_array_literal() to handle:
  - ARRAY[...] constructors
  - {...} literal syntax
  - Nested and multi-dimensional arrays
Prefixed and binary literals
- Added process_literal_prefix_type() to support:
  - Escape string constants (E'...')
  - Unicode strings (U&'...' with optional UESCAPE)
  - SQL-standard hex literals (x'...')
  - Bit string literals (b'...', B'...')
  - PostgreSQL bytea hex format ('\x...', E'\\x...')
Boolean normalization
- Added process_replace_boolean() to replace TRUE / FALSE (case-insensitive) with parameter placeholders.
Dollar-quoted strings
- Added full support for $$...$$ and $tag$...$tag$ literals.
- Entire literal content is normalized to a placeholder.
Identifier quoting
- Correct handling of PostgreSQL double-quoted identifiers ("identifier").
- Avoids MySQL-specific backtick (```) assumptions.
Comment handling
- Added support for PostgreSQL comment rules:
  - -- comments (no trailing space required)
  - /* ... */ block comments
  - Nested block comments

Closes #5253
Closes #5231

…ting, and dialect-specific comment rules This change introduces PostgreSQL-aware tokenization by adding support for dollar-quoted strings, PostgreSQL’s double-quoted identifiers, and its comment rules. The tokenizer now correctly parses $$…$$ and $tag$…$tag$, treats " as an identifier delimiter in PostgreSQL, disables MySQL-only # comments, and accepts -- as a comment starter without requiring a trailing space. All new behavior is fully isolated behind the dialect flag to avoid impacting MySQL parsing. Add PostgreSQL dollar-quoted strings * New parser state: st_dollar_quote_string. * Recognizes $$ … $$ and $tag$ … $tag$ sequences. * Tracks opening tag and searches for matching terminator. * Normalizes entire literal to ?. * Integrated into get_next_st() and stage_1_parsing().

- Added `process_pg_typecast()` to handle PostgreSQL type cast syntax (::) - Recognizes type casts in various contexts: 'value'::type, column::type, etc. - Added `process_array_literal()` for PostgreSQL array processing - Handles both ARRAY[] constructor and {} literal syntax - Processes multi-dimensional arrays and nested array structures - Added `process_literal_prefix_type()` for PostgreSQL prefixed literals - Processes E'' escape string constants with backslash escapes - Handles U&'' Unicode string literals with optional UESCAPE clauses - Supports x'' hex string literals and b'' bit string literals - Manages B'' bit strings and bytea literals (\\xDEADBEEF format) - Added `process_replace_boolean()` for boolean literal replacement - Replaces TRUE and FALSE literals with parameter placeholders - Maintains case-insensitive matching (true, True, TRUE, etc.) - Preserves boolean context in expressions and WHERE clauses

gemini-code-assist · 2025-12-16T15:18:48Z

Summary of Changes

Hello @rahim-kanji, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a long-standing issue where PostgreSQL queries were incorrectly normalized due to being processed by a tokenizer designed for MySQL. By introducing a dedicated tokenizer tailored to PostgreSQL's unique syntax, the system can now generate stable and accurate query digests for PostgreSQL, leading to more consistent query grouping and improved monitoring capabilities. This change significantly enhances the system's compatibility and reliability when working with PostgreSQL databases.

Highlights

Dedicated PostgreSQL Tokenizer: Introduced a new, PostgreSQL-aware tokenizer to accurately parse and normalize PostgreSQL-specific SQL syntax, replacing the previous MySQL-centric logic.
Enhanced PostgreSQL Syntax Support: Added dedicated parsing logic for PostgreSQL features including type casts (::), array literals (ARRAY[...] and {...}), prefixed/binary literals (E'...', U&'...', x'...', b'...'), dollar-quoted strings ($$...$$), and correct handling of double-quoted identifiers.
Improved Comment Handling: Implemented support for PostgreSQL comment rules, including both -- style (without requiring a trailing space) and nested /* ... */ block comments.
Isolated MySQL Behavior: Ensured that the new PostgreSQL tokenization logic is fully isolated, preserving the existing and correct behavior for MySQL queries.
Comprehensive Testing: A new test suite has been added specifically for the PostgreSQL tokenizer, covering a wide range of syntax, edge cases, and digest compression scenarios to ensure accuracy and stability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a dedicated tokenizer for PostgreSQL to generate more accurate and stable query digests, which is a significant improvement over the previous MySQL-centric logic. The changes are well-structured, with a clear separation between MySQL and PostgreSQL implementations. The use of if constexpr for tokenizer dispatch is a clean and modern approach. The refactoring in PgSQL_Monitor.cpp to build connection strings more robustly is also a welcome enhancement. The new test suite for the PostgreSQL tokenizer is comprehensive, though I've found a couple of minor issues in the new test file that I've commented on. Overall, this is a high-quality contribution that greatly improves PostgreSQL support.

test/tap/tests/pgsql-query_digests_stages_test-t.cpp

gemini-code-assist

Code Review

This pull request introduces a dedicated tokenizer for PostgreSQL, which is a significant improvement over the previous MySQL-centric logic. The changes are well-structured, with a clear separation between MySQL and PostgreSQL tokenization, enhancing maintainability. The use of if constexpr for dispatching to the correct tokenizer is a great example of modern C++ usage. The accompanying test suite for the new PostgreSQL tokenizer is extensive and covers a wide range of PostgreSQL-specific syntax, which gives high confidence in the correctness of the implementation. Additionally, the changes in PgSQL_Monitor.cpp improve security by properly escaping parameters in the connection string. Overall, this is a high-quality contribution.

test/tap/tests/pgsql-query_digests_stages_test-t.cpp

* Added crash payload testing * Fixed unterminated comments handling

* Removed is_cmd (/*!) handling * Proper handling of Keep Comments and First comment extraction * Proper handling for nested comments

Signed-off-by: René Cannaò <[email protected]>

sonarqubecloud · 2025-12-19T09:57:26Z

Quality Gate failed

Failed conditions
1 Security Hotspot
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

rahim-kanji added 5 commits December 8, 2025 14:58

Added utility functions to support pgsql query digest testing

895c814

Added nested comments support for PostgreSQL

f507903

Added TAP test

1b16a39

Add missing pgsql_tokenizer.cpp

39728b2

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

test/tap/tests/pgsql-query_digests_stages_test-t.cpp Outdated Show resolved Hide resolved

test/tap/tests/pgsql-query_digests_stages_test-t.cpp Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

test/tap/tests/pgsql-query_digests_stages_test-t.cpp Outdated Show resolved Hide resolved

rahim-kanji added 2 commits December 16, 2025 20:44

Fixed TAP test

f9b9ede

Added pgsql-query_digests_stages_test-t to groups.json

fd53642

rahim-kanji force-pushed the v3.0_pgsql-query-digest-gen-5253 branch from a49da0e to fd53642 Compare December 16, 2025 19:56

rahim-kanji added 3 commits December 18, 2025 13:06

* Add dedicated handling for double-quoted PostgreSQL identifiers

e70fcbf

* Added crash payload testing * Fixed unterminated comments handling

Refactored comment handling

5b3805a

* Removed is_cmd (/*!) handling * Proper handling of Keep Comments and First comment extraction * Proper handling for nested comments

Updated TAP test

5e75264

rahim-kanji force-pushed the v3.0_pgsql-query-digest-gen-5253 branch from 239223b to 5e75264 Compare December 18, 2025 21:24

Added Multiple Keep Comments test

febb650

rahim-kanji force-pushed the v3.0_pgsql-query-digest-gen-5253 branch from 25b51ba to febb650 Compare December 19, 2025 06:47

rahim-kanji marked this pull request as ready for review December 19, 2025 07:58

Merge branch 'v3.0' into v3.0_pgsql-query-digest-gen-5253

0f7ff1f

Signed-off-by: René Cannaò <[email protected]>

renecannao merged commit efd87ae into v3.0 Dec 19, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PostgreSQL-Specific Tokenizer for Query Digest Generation #5254

Add PostgreSQL-Specific Tokenizer for Query Digest Generation #5254

rahim-kanji commented Dec 16, 2025

Uh oh!

gemini-code-assist bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add PostgreSQL-Specific Tokenizer for Query Digest Generation #5254

Add PostgreSQL-Specific Tokenizer for Query Digest Generation #5254

Conversation

rahim-kanji commented Dec 16, 2025

Summary

Motivation

Key Changes

PostgreSQL-Specific Tokenization

Uh oh!

gemini-code-assist bot commented Dec 16, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Dec 19, 2025

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants