-
Notifications
You must be signed in to change notification settings - Fork 1k
Add PostgreSQL-Specific Tokenizer for Query Digest Generation #5254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ting, and dialect-specific comment rules This change introduces PostgreSQL-aware tokenization by adding support for dollar-quoted strings, PostgreSQL’s double-quoted identifiers, and its comment rules. The tokenizer now correctly parses $$…$$ and $tag$…$tag$, treats " as an identifier delimiter in PostgreSQL, disables MySQL-only # comments, and accepts -- as a comment starter without requiring a trailing space. All new behavior is fully isolated behind the dialect flag to avoid impacting MySQL parsing. Add PostgreSQL dollar-quoted strings * New parser state: st_dollar_quote_string. * Recognizes $$ … $$ and $tag$ … $tag$ sequences. * Tracks opening tag and searches for matching terminator. * Normalizes entire literal to ?. * Integrated into get_next_st() and stage_1_parsing().
- Added `process_pg_typecast()` to handle PostgreSQL type cast syntax (::)
- Recognizes type casts in various contexts: 'value'::type, column::type, etc.
- Added `process_array_literal()` for PostgreSQL array processing
- Handles both ARRAY[] constructor and {} literal syntax
- Processes multi-dimensional arrays and nested array structures
- Added `process_literal_prefix_type()` for PostgreSQL prefixed literals
- Processes E'' escape string constants with backslash escapes
- Handles U&'' Unicode string literals with optional UESCAPE clauses
- Supports x'' hex string literals and b'' bit string literals
- Manages B'' bit strings and bytea literals (\\xDEADBEEF format)
- Added `process_replace_boolean()` for boolean literal replacement
- Replaces TRUE and FALSE literals with parameter placeholders
- Maintains case-insensitive matching (true, True, TRUE, etc.)
- Preserves boolean context in expressions and WHERE clauses
Summary of ChangesHello @rahim-kanji, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a long-standing issue where PostgreSQL queries were incorrectly normalized due to being processed by a tokenizer designed for MySQL. By introducing a dedicated tokenizer tailored to PostgreSQL's unique syntax, the system can now generate stable and accurate query digests for PostgreSQL, leading to more consistent query grouping and improved monitoring capabilities. This change significantly enhances the system's compatibility and reliability when working with PostgreSQL databases. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dedicated tokenizer for PostgreSQL to generate more accurate and stable query digests, which is a significant improvement over the previous MySQL-centric logic. The changes are well-structured, with a clear separation between MySQL and PostgreSQL implementations. The use of if constexpr for tokenizer dispatch is a clean and modern approach. The refactoring in PgSQL_Monitor.cpp to build connection strings more robustly is also a welcome enhancement. The new test suite for the PostgreSQL tokenizer is comprehensive, though I've found a couple of minor issues in the new test file that I've commented on. Overall, this is a high-quality contribution that greatly improves PostgreSQL support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dedicated tokenizer for PostgreSQL, which is a significant improvement over the previous MySQL-centric logic. The changes are well-structured, with a clear separation between MySQL and PostgreSQL tokenization, enhancing maintainability. The use of if constexpr for dispatching to the correct tokenizer is a great example of modern C++ usage. The accompanying test suite for the new PostgreSQL tokenizer is extensive and covers a wide range of PostgreSQL-specific syntax, which gives high confidence in the correctness of the implementation. Additionally, the changes in PgSQL_Monitor.cpp improve security by properly escaping parameters in the connection string. Overall, this is a high-quality contribution.
a49da0e to
fd53642
Compare
* Added crash payload testing * Fixed unterminated comments handling
* Removed is_cmd (/*!) handling * Proper handling of Keep Comments and First comment extraction * Proper handling for nested comments
239223b to
5e75264
Compare
25b51ba to
febb650
Compare
Signed-off-by: René Cannaò <[email protected]>
|




Summary
This PR introduces a PostgreSQL-aware tokenizer to generate stable and accurate query digests. Previously, PostgreSQL queries were tokenized using MySQL-centric logic, which led to incorrect normalization for many PostgreSQL-specific syntax constructs. The new implementation adds dedicated parsing logic for PostgreSQL while keeping MySQL behavior fully isolated.
Motivation
PostgreSQL syntax differs significantly from MySQL in areas such as type casting, array literals, string literal formats, identifier quoting, and comment handling. Using a MySQL tokenizer for PostgreSQL queries resulted in unstable digests and incorrect query grouping.
This change ensures PostgreSQL queries are tokenized according to PostgreSQL grammar rules, producing consistent digests for semantically equivalent queries.
Key Changes
PostgreSQL-Specific Tokenization
Type casts (
::)process_pg_typecast()to correctly recognize and normalize PostgreSQL type cast syntax.Array literals
process_array_literal()to handle:ARRAY[...]constructors{...}literal syntaxPrefixed and binary literals
process_literal_prefix_type()to support:E'...')U&'...'with optionalUESCAPE)x'...')b'...',B'...')byteahex format ('\x...',E'\\x...')Boolean normalization
process_replace_boolean()to replaceTRUE/FALSE(case-insensitive) with parameter placeholders.Dollar-quoted strings
$$...$$and$tag$...$tag$literals.Identifier quoting
"identifier").Comment handling
--comments (no trailing space required)/* ... */block commentsCloses #5253
Closes #5231