Skip to content

require-ascii doesn’t do what it says on the tin #104

@Jayman2000

Description

@Jayman2000

According to the README:

require-ascii

What it does

Requires that text files have ascii-encoding, including the
extended ascii set.
This is useful to detect files that have unicode characters.

require-ascii will fail on files that are encoded in extended ASCII if:

  1. the file uses characters in the 128–255 range, and
  2. those characters aren’t followed by other characters that coincidentally make the sequence valid UTF-8 (see this table).

This script will generate a bunch of files that contain valid extended ASCII but fail when tested by require-ascii:

# The README links to <https://theasciicode.com.ar/>. There's many different
# ways you could extend ASCII, but that site in particular says "In 1981,
# IBM developed an extension of 8-bit ASCII code, called 'code page 437'..."
extended_ascii = "cp437"

for code_point in range(128, 256):
	# Create a file that should pass require-ascii, but won't.
	with open(f"{code_point}.cp437.txt", mode='wb') as file:
		file.write(code_point.to_bytes(1, 'little'))
	# Make sure that that file really does contain valid extended ASCII.
	with open(f"{code_point}.cp437.txt", mode='rt', encoding=extended_ascii) as file:
		# This should cause a UnicodeDecodeError if file contains
		# invalid extended ASCII.
		file.read()

A more accurate description of require-ascii would be:

require-ascii

What it does

Requires that text files use UTF-8 and only use code points ≤ 255.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions