QWK.NET ‐ Encoding and CP437

Encoding and CP437

This page explains character encoding in QWK packets, focusing on CP437 (Code Page 437), the historical DOS character set used by BBS systems. Understanding encoding is crucial for correctly reading and preserving QWK packet content.

What Is CP437 and Why It Matters

CP437 (Code Page 437, also known as "DOS Latin US" or "OEM-US") is the original IBM PC character encoding from the DOS era. It was the standard encoding for IBM PC-compatible systems and most BBS software from the 1980s and 1990s.

Why CP437 Matters in QWK

QWK packets were created by DOS-based BBS systems that used CP437 as their native encoding. This means:

BBS names often contain box-drawing characters (borders, frames)
Message content may include accented characters, mathematical symbols, or Greek letters
Line terminators use byte 0xE3, which is the π (pi) character in CP437
Extended ASCII characters (0x80-0xFF) have specific meanings in CP437 that differ from other encodings

Using the wrong encoding when reading QWK packets can result in:

Box-drawing characters appearing as accented letters or symbols
Message line breaks not being recognised
Accented characters displaying incorrectly
Loss of visual formatting in BBS names and messages

CP437, Extended ASCII, and Unicode

CP437 Character Set

CP437 is a single-byte encoding with 256 characters:

0x00-0x7F: Standard ASCII characters (same as Unicode U+0000-U+007F)
0x80-0xFF: Extended characters unique to CP437:
- Box-drawing characters (single and double lines, corners)
- Block elements (full blocks, shades)
- Accented characters (Western European)
- Mathematical symbols
- Greek letters (including π at 0xE3)
- Special symbols

Key CP437 Characters in QWK:

Byte	CP437 Character	Unicode	Common Usage
0xC4	`─`	U+2500	Horizontal line (box drawing)
0xB3	`│`	U+2502	Vertical line (box drawing)
0xDB	`█`	U+2588	Full block
0xE3	`π`	U+03C0	QWK line terminator
0xB0	`░`	U+2591	Light shade
0xB1	`▒`	U+2592	Medium shade
0xB2	`▓`	U+2593	Dark shade

Extended ASCII

"Extended ASCII" is a general term referring to any byte with the high bit set (0x80-0xFF). However, the meaning of these bytes depends on the encoding:

CP437: Byte 0xE3 = π (Greek pi)
Latin-1 (ISO-8859-1): Byte 0xE3 = ã (a-tilde, U+00E3)
Windows-1252: Byte 0xE3 = ã (a-tilde, U+00E3)

This is why encoding matters: the same byte value represents different characters in different encodings.

Unicode

Unicode is a multi-byte encoding that can represent all characters from all languages. CP437 characters map to specific Unicode codepoints:

CP437 byte 0xE3 → Unicode U+03C0 (π)
CP437 byte 0xC4 → Unicode U+2500 (─)
CP437 byte 0xB3 → Unicode U+2502 (│)

When QWK.NET decodes CP437 bytes to strings, it converts them to their Unicode equivalents. When encoding strings back to bytes, it converts Unicode characters back to CP437 byte values.

Line Endings: The 0xE3 Byte

QWK message bodies use byte 0xE3 as the line terminator, not CR (\r) or LF (\n). This is a unique characteristic of the QWK format.

The 0xE3 Character Mapping Issue

Byte 0xE3 has different Unicode representations depending on encoding interpretation:

CP437: 0xE3 → π (U+03C0, Greek pi) - Historically correct for DOS BBS systems
Latin-1/Windows-1252: 0xE3 → ã (U+00E3, a-tilde) - Byte identity mapping

Both approaches produce byte 0xE3 in the file, which meets the QWK specification requirement. However, using the correct CP437 mapping (π) preserves the historical character representation.

QWK.NET Line Ending Handling

QWK.NET supports both approaches for compatibility:

CP437 mode: Uses π (U+03C0) which encodes to 0xE3 in CP437
ASCII/Latin-1 mode: Uses character 0xE3 (ã, U+00E3) for byte identity

The library automatically detects and handles 0xE3 terminators when parsing message bodies, converting them to standard line breaks for the Lines property whilst preserving the original bytes.

Line Ending Corruption Patterns

Common issues seen in real-world packets:

Mixed line endings: Some messages use 0xE3, others use CR/LF or LF
Missing terminators: Last line of message may lack 0xE3
Extra terminators: Multiple 0xE3 bytes in sequence
QWKE variation: QWKE extensions may use CR (\r) instead of 0xE3

QWK.NET handles these variations gracefully:

Detects QWKE-style CR terminators automatically
Preserves mixed line endings in raw text
Converts 0xE3 to platform-native line endings for Lines property

How QWK.NET Handles Encoding

Default Behaviour

QWK.NET uses CP437 encoding by default for all text fields:

CONTROL.DAT fields (BBS name, sysop name, etc.)
Message headers (From, To, Subject)
Message bodies
Optional files (WELCOME, NEWS, GOODBYE)
QWKE extension files (TOREADER.EXT, TODOOR.EXT)

Encoding During Parsing

When reading QWK packets:

Raw bytes are read from the archive without conversion
CP437 decoding converts bytes to Unicode strings for API access
Original bytes are preserved in raw properties for round-trip fidelity
Line terminators (0xE3) are detected and converted to line breaks

Example:

// Bytes: [0xC4, 0xC4, 0x20, 0x53, 0x74, 0x61, 0x72, 0x4C, 0x69, 0x6E, 0x6B, 0xC4, 0xC4]
// CP437 decode: "── StarLink ──"
// Unicode: U+2500, U+2500, U+0020, U+0053, U+0074, U+0061, U+0072, U+004C, U+0069, U+006E, U+006B, U+2500, U+2500

Encoding During Preservation

When writing REP packets or generating QWK output:

Unicode strings are encoded back to CP437 bytes
Line terminators are converted to 0xE3 bytes
128-byte records are padded with spaces (0x20)
Round-trip fidelity ensures QWK → REP → QWK preserves all bytes

Fallback Policies

QWK.NET provides configurable fallback policies for unmappable bytes or characters:

Strict: Throws exception on unmappable content (prevents silent data loss)
ReplacementQuestion: Replaces unmappable bytes with ?
ReplacementUnicode: Uses encoding's default replacement character
BestEffort: Uses encoding's default fallback behaviour

Default is Strict to prevent silent corruption. Applications can choose more permissive policies when processing damaged packets.

Symptoms of Encoding Issues

Incorrect Character Display

Symptom: Box-drawing characters appear as accented letters or symbols.

Example:

Expected: ─══ StarLink BBS ══─
Seen: ã══ StarLink BBS ══ã or ÃÃ StarLink BBS ÃÃ

Cause: Packet read with wrong encoding (e.g., Latin-1 instead of CP437)

Solution: Ensure CP437 encoding is used when reading packets

Line Breaks Not Recognised

Symptom: Message body appears as one long line or lines split incorrectly.

Example:

Expected: Multi-line message with proper breaks
Seen: Single line or breaks in wrong places

Cause: 0xE3 terminators not being recognised, or wrong line ending mode

Solution: QWK.NET handles this automatically, but check LineEndingMode if custom processing is used

Accented Characters Wrong

Symptom: Names with accents display incorrectly.

Example:

Expected: José Rodriguez

Cause: CP437 accented characters decoded with wrong encoding

Solution: Use CP437 encoding for all text fields

Validation Warnings

Symptom: Validation reports warnings about encoding or character issues.

Common warnings:

"Unmappable byte detected" - Byte cannot be decoded with current encoding
"Invalid character in field" - Character cannot be encoded back to CP437

Solution: Check validation report for specific issues. Consider using Lenient mode for packets with encoding variations.

Round-Trip Failures

Symptom: QWK → REP → QWK cycle produces different bytes.

Cause: Encoding conversion during round-trip (e.g., UTF-8 normalisation)

Solution: QWK.NET preserves bytes by default. Ensure no intermediate encoding conversions occur.

Common Corruption Patterns

Encoding-Related Corruption

UTF-8 BOM: Some tools add UTF-8 BOM (0xEF 0xBB 0xBF) to text files
- Impact: Breaks CP437 parsing, causes validation errors
- QWK.NET: Handles by detecting and skipping BOM if present
Line Ending Conversion: Tools that convert CR/LF to LF or vice versa
- Impact: May corrupt 0xE3 terminators or introduce CR/LF where 0xE3 expected
- QWK.NET: Preserves original line endings, handles mixed formats
Character Normalisation: Unicode normalisation (NFC/NFD) applied to text
- Impact: Changes byte representation, breaks round-trip fidelity
- QWK.NET: No normalisation applied by default

Storage Media Issues

Floppy Disk Corruption: Bit flips in extended ASCII range
- Impact: Box-drawing characters become other extended ASCII
- QWK.NET: Preserves bytes as-is, validation may report issues
Transfer Errors: Incomplete file transfers or network errors
- Impact: Truncated messages, missing line terminators
- QWK.NET: Salvage mode attempts recovery

Best Practices

Reading Packets

Use default CP437 encoding - QWK.NET handles this automatically
Check validation reports - Encoding issues may appear as warnings
Preserve raw bytes - Access RawText or raw properties when byte fidelity is critical

Writing Packets

Encode to CP437 - Use Cp437Encoding.Encode() for all text output
Use 0xE3 terminators - Convert line endings to QWK format before encoding
Avoid UTF-8 conversion - Do not convert CP437 strings to UTF-8 for storage

Handling Encoding Variations

Detect encoding - Use ByteClassifier to detect extended ASCII presence
Choose fallback policy - Strict for preservation, BestEffort for damaged packets
Preserve original - Keep raw bytes alongside decoded strings when possible

QWK.NET ‐ Encoding and CP437

Encoding and CP437

What Is CP437 and Why It Matters

Why CP437 Matters in QWK

CP437, Extended ASCII, and Unicode

CP437 Character Set

Extended ASCII

Unicode

Line Endings: The 0xE3 Byte

The 0xE3 Character Mapping Issue

QWK.NET Line Ending Handling

Line Ending Corruption Patterns

How QWK.NET Handles Encoding

Default Behaviour

Encoding During Parsing

Encoding During Preservation

Fallback Policies

Symptoms of Encoding Issues

Incorrect Character Display

Line Breaks Not Recognised

Accented Characters Wrong

Validation Warnings

Round-Trip Failures

Common Corruption Patterns

Encoding-Related Corruption

Storage Media Issues

Best Practices

Reading Packets

Writing Packets

Handling Encoding Variations

Further Reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally