paste: support multi-byte delimiters and GNU escape sequences#10840
paste: support multi-byte delimiters and GNU escape sequences#10840sylvestre merged 2 commits intouutils:mainfrom
Conversation
5865c10 to
321cb08
Compare
|
Seems like github is down? |
This comment was marked as outdated.
This comment was marked as outdated.
321cb08 to
b084aaa
Compare
|
I think you can re-run 1 job only instead of force-pushing, |
b084aaa to
1650f0e
Compare
| grep -rl '\$abs_path_dir_' tests/*/*.sh | xargs -r "${SED}" -i "s|\$abs_path_dir_|${UU_BUILD_DIR//\//\\/}|g" | ||
| # Some tests use $abs_top_builddir/src for shebangs: point them to the uutils build dir. | ||
| grep -rl '\$abs_top_builddir/src' tests/*/*.sh tests/*/*.pl | xargs -r "${SED}" -i "s|\$abs_top_builddir/src|${UU_BUILD_DIR//\//\\/}|g" | ||
| grep -rl '\$ENV{abs_top_builddir}/src' tests/*/*.pl | xargs -r "${SED}" -i "s|\$ENV{abs_top_builddir}/src|${UU_BUILD_DIR//\//\\/}|g" |
There was a problem hiding this comment.
Whoops that is unrelated but it changes two skips to passes, was working on that locally and it got added
There was a problem hiding this comment.
But does not mean that we are not using uutils binaries at here if we don't sed?
There was a problem hiding this comment.
According to the logs I think we deleted the gnu coreutils binaries from that env so it means that it just skips because its unable to find a binary.
There was a problem hiding this comment.
Can we simply symlink our bins to abs_top_builddir for all tests at once?
|
GNU testsuite comparison: |
1650f0e to
143fec3
Compare
|
GNU testsuite comparison: |
|
oh, nice |
| /// Returns 1 for empty, invalid, or incomplete sequences. | ||
| pub fn mb_char_len(bytes: &[u8]) -> usize { | ||
| if bytes.is_empty() { | ||
| return 1; |
There was a problem hiding this comment.
returning 1 when empty is confusing
can you please document why ?
src/uu/paste/src/paste.rs
Outdated
| translate!("paste-error-delimiter-unescaped-backslash", "delimiters" => delimiters), | ||
| )); | ||
| fn parse_delimiters(delimiters: &OsString) -> UResult<Box<[Box<[u8]>]>> { | ||
| let bytes = uucore::os_string_to_vec(delimiters.clone())?; |
There was a problem hiding this comment.
can we use OsStr::as_bytes() on Unix or borrowing instead of cloning the OsString ?
| _ => { | ||
| // Unknown escape: strip backslash, use the following character(s) | ||
| let len = mb_char_len(&bytes[i..]); | ||
| vec.push(Box::from(&bytes[i..i + len])); |
There was a problem hiding this comment.
Potential panic if mb_char_len returns non-zero but bytes[i..] is shorter than returned length, no?
|
GNU testsuite comparison: |
…#10840) * paste: support multi-byte delimiters and GNU escape sequences * paste: address review comments - avoid OsString clone and guard mb_char_len
This PR is fairly large in scope and has a few design decisions that probably warrant some additional discussion. I have been prototyping many approaches to how to deal with the issue of getting the multi-byte character sequence lengths and all of the solutions I could think of that would call the libc implementation ultimately would take a bunch of unsafe code and require FFI's and still create compatibility issues when it comes to different platforms not supporting that API.
To work backwards from what the goal was, I went through all of the glibc locales to see which ones actually had multi-byte decoding rules that were unique from UTF-8 and how many of those overlapped with one another. That led to to discover that currently there are only 5 different rule sets for this calculation and that you can determine which encoding to use from the env variables for which locale to use. Then it appears that some of the locales are hardcoded to specific encodings and the only two that I could find when searching all of the glibc locales was
"zh_CN" | "zh_SG"and"zh_TW" | "zh_HK"There were two new GNU tests that were related to this type of locale stuff: paste.pl and multi-byte.sh that this PR addresses.
This can help with the other issues in the queue:
#10184
#9712
#7600
#3075
#5831
#3997