Skip to content

Conversation

@rahim-kanji
Copy link
Collaborator

Summary

This PR introduces a PostgreSQL-aware tokenizer to generate stable and accurate query digests. Previously, PostgreSQL queries were tokenized using MySQL-centric logic, which led to incorrect normalization for many PostgreSQL-specific syntax constructs. The new implementation adds dedicated parsing logic for PostgreSQL while keeping MySQL behavior fully isolated.

Motivation

PostgreSQL syntax differs significantly from MySQL in areas such as type casting, array literals, string literal formats, identifier quoting, and comment handling. Using a MySQL tokenizer for PostgreSQL queries resulted in unstable digests and incorrect query grouping.

This change ensures PostgreSQL queries are tokenized according to PostgreSQL grammar rules, producing consistent digests for semantically equivalent queries.

Key Changes

PostgreSQL-Specific Tokenization

  • Type casts (::)

    • Added process_pg_typecast() to correctly recognize and normalize PostgreSQL type cast syntax.
  • Array literals

    • Added process_array_literal() to handle:
      • ARRAY[...] constructors
      • {...} literal syntax
      • Nested and multi-dimensional arrays
  • Prefixed and binary literals

    • Added process_literal_prefix_type() to support:
      • Escape string constants (E'...')
      • Unicode strings (U&'...' with optional UESCAPE)
      • SQL-standard hex literals (x'...')
      • Bit string literals (b'...', B'...')
      • PostgreSQL bytea hex format ('\x...', E'\\x...')
  • Boolean normalization

    • Added process_replace_boolean() to replace TRUE / FALSE (case-insensitive) with parameter placeholders.
  • Dollar-quoted strings

    • Added full support for $$...$$ and $tag$...$tag$ literals.
    • Entire literal content is normalized to a placeholder.
  • Identifier quoting

    • Correct handling of PostgreSQL double-quoted identifiers ("identifier").
    • Avoids MySQL-specific backtick (```) assumptions.
  • Comment handling

    • Added support for PostgreSQL comment rules:
      • -- comments (no trailing space required)
      • /* ... */ block comments
      • Nested block comments

Closes #5253
Closes #5231

…ting, and dialect-specific comment rules

This change introduces PostgreSQL-aware tokenization by adding support for dollar-quoted strings, PostgreSQL’s double-quoted identifiers, and its comment rules. The tokenizer now correctly parses $$…$$ and $tag$…$tag$, treats " as an identifier delimiter in PostgreSQL, disables MySQL-only # comments, and accepts -- as a comment starter without requiring a trailing space. All new behavior is fully isolated behind the dialect flag to avoid impacting MySQL parsing.

Add PostgreSQL dollar-quoted strings
* New parser state: st_dollar_quote_string.
* Recognizes $$ … $$ and $tag$ … $tag$ sequences.
* Tracks opening tag and searches for matching terminator.
* Normalizes entire literal to ?.
* Integrated into get_next_st() and stage_1_parsing().
- Added `process_pg_typecast()` to handle PostgreSQL type cast syntax (::)
  - Recognizes type casts in various contexts: 'value'::type, column::type, etc.

- Added `process_array_literal()` for PostgreSQL array processing
  - Handles both ARRAY[] constructor and {} literal syntax
  - Processes multi-dimensional arrays and nested array structures

- Added `process_literal_prefix_type()` for PostgreSQL prefixed literals
  - Processes E'' escape string constants with backslash escapes
  - Handles U&'' Unicode string literals with optional UESCAPE clauses
  - Supports x'' hex string literals and b'' bit string literals
  - Manages B'' bit strings and bytea literals (\\xDEADBEEF format)

- Added `process_replace_boolean()` for boolean literal replacement
  - Replaces TRUE and FALSE literals with parameter placeholders
  - Maintains case-insensitive matching (true, True, TRUE, etc.)
  - Preserves boolean context in expressions and WHERE clauses
@gemini-code-assist
Copy link

Summary of Changes

Hello @rahim-kanji, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a long-standing issue where PostgreSQL queries were incorrectly normalized due to being processed by a tokenizer designed for MySQL. By introducing a dedicated tokenizer tailored to PostgreSQL's unique syntax, the system can now generate stable and accurate query digests for PostgreSQL, leading to more consistent query grouping and improved monitoring capabilities. This change significantly enhances the system's compatibility and reliability when working with PostgreSQL databases.

Highlights

  • Dedicated PostgreSQL Tokenizer: Introduced a new, PostgreSQL-aware tokenizer to accurately parse and normalize PostgreSQL-specific SQL syntax, replacing the previous MySQL-centric logic.
  • Enhanced PostgreSQL Syntax Support: Added dedicated parsing logic for PostgreSQL features including type casts (::), array literals (ARRAY[...] and {...}), prefixed/binary literals (E'...', U&'...', x'...', b'...'), dollar-quoted strings ($$...$$), and correct handling of double-quoted identifiers.
  • Improved Comment Handling: Implemented support for PostgreSQL comment rules, including both -- style (without requiring a trailing space) and nested /* ... */ block comments.
  • Isolated MySQL Behavior: Ensured that the new PostgreSQL tokenization logic is fully isolated, preserving the existing and correct behavior for MySQL queries.
  • Comprehensive Testing: A new test suite has been added specifically for the PostgreSQL tokenizer, covering a wide range of syntax, edge cases, and digest compression scenarios to ensure accuracy and stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dedicated tokenizer for PostgreSQL to generate more accurate and stable query digests, which is a significant improvement over the previous MySQL-centric logic. The changes are well-structured, with a clear separation between MySQL and PostgreSQL implementations. The use of if constexpr for tokenizer dispatch is a clean and modern approach. The refactoring in PgSQL_Monitor.cpp to build connection strings more robustly is also a welcome enhancement. The new test suite for the PostgreSQL tokenizer is comprehensive, though I've found a couple of minor issues in the new test file that I've commented on. Overall, this is a high-quality contribution that greatly improves PostgreSQL support.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dedicated tokenizer for PostgreSQL, which is a significant improvement over the previous MySQL-centric logic. The changes are well-structured, with a clear separation between MySQL and PostgreSQL tokenization, enhancing maintainability. The use of if constexpr for dispatching to the correct tokenizer is a great example of modern C++ usage. The accompanying test suite for the new PostgreSQL tokenizer is extensive and covers a wide range of PostgreSQL-specific syntax, which gives high confidence in the correctness of the implementation. Additionally, the changes in PgSQL_Monitor.cpp improve security by properly escaping parameters in the connection string. Overall, this is a high-quality contribution.

@rahim-kanji rahim-kanji force-pushed the v3.0_pgsql-query-digest-gen-5253 branch from a49da0e to fd53642 Compare December 16, 2025 19:56
* Added crash payload testing
* Fixed unterminated comments handling
* Removed is_cmd (/*!) handling
* Proper handling of Keep Comments and First comment extraction
* Proper handling for nested comments
@rahim-kanji rahim-kanji force-pushed the v3.0_pgsql-query-digest-gen-5253 branch from 239223b to 5e75264 Compare December 18, 2025 21:24
@rahim-kanji rahim-kanji force-pushed the v3.0_pgsql-query-digest-gen-5253 branch from 25b51ba to febb650 Compare December 19, 2025 06:47
@rahim-kanji rahim-kanji marked this pull request as ready for review December 19, 2025 07:58
@renecannao renecannao merged commit efd87ae into v3.0 Dec 19, 2025
1 check passed
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants