Skip to content

Conversation

@Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Jan 18, 2026

Responding to reviews in #148438

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 18, 2026
Kmeakin and others added 10 commits January 18, 2026 02:32
The `ucd_parse` crate offers a function to get the Unicode version from
the readme, so we don't need to reimplement it ourselves.
Instead of `include_str!()`ing `range_search.rs`, just make it a normal
module under `core::unicode`. This means the same source code doesn't
have to be checked in twice, and it plays nicer with IDEs.

Also rename it to `rt` because it will also include runtime functions
for case foldings in the next commit.
Instead of writing the body of `to_lower()` and `to_upper()` in a string
literal and pasting into the final`unicode_data.rs`, extract the common
logic into a helper function in the `rt` module and then call the helper
from the generated code.

Same motivation as previous commit (better IDE support, less duplicate
code checked into git).
Remove `#[rustfmt::skip]` from all the generated modules in
`unicode_data.rs`. This means we won't have to worry so much about
getting indentation and formatting right when generating code.

Exempted for now the case folding tables which would be too long when
formatted by `rustfmt`.
This check was made redundant (it will always be true) when we removed
all ASCII characters from the tables
(a8c6694).
To make the final output code easier to see:
* Get rid of the unnecessary line-noise of `.unwrap()`ing calls to
  `write!()` by moving the `.unwrap()` into a macro.
* Join consecutive `write!()` calls using a single multiline format
  string.
* Replace `.push()` and `.push_str(format!())` with `write!()`.
* If after doing all of the above, there is only a single `write!()`
  call in the function, just construct the string directly with
  `format!()`.
In the case_mapping tables, print the data in hexadecimal.
This makes the relationship between each character and character it is mapped to more obvious.

Do the same for cascading_map, because we are inspecting the higher byte of the
input character, so it makes more sense to compare against a hexadecimal
literal.
The preferred way to run `assert`s at compile-time is to put them in an
unnamed constant (`const _: () = { ... };`). This avoids the asserts
being evaluated on every call by tools like MIRI.
Instead of generating a standalone executable to test `unicode_data`,
generate normal tests in `coretests`. This ensures tests are always
generated, and will be run as part of the normal testsuite.

Also change the generated tests to loop over lookup tables, rather than
generating a separate `assert_eq!()` statement for every codepoint. The
old approach produced a massive (20,000 lines plus) file which took
minutes to compile!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants