Skip to content

QWK.NET ‐ Encoding and CP437

Agent 57951 edited this page Jan 19, 2026 · 4 revisions

Encoding and CP437

This page explains character encoding in QWK packets, focusing on CP437 (Code Page 437), the historical DOS character set used by BBS systems. Understanding encoding is crucial for correctly reading and preserving QWK packet content.

What Is CP437 and Why It Matters

CP437 (Code Page 437, also known as "DOS Latin US" or "OEM-US") is the original IBM PC character encoding from the DOS era. It was the standard encoding for IBM PC-compatible systems and most BBS software from the 1980s and 1990s.

Why CP437 Matters in QWK

QWK packets were created by DOS-based BBS systems that used CP437 as their native encoding. This means:

  • BBS names often contain box-drawing characters (borders, frames)
  • Message content may include accented characters, mathematical symbols, or Greek letters
  • Line terminators use byte 0xE3, which is the π (pi) character in CP437
  • Extended ASCII characters (0x80-0xFF) have specific meanings in CP437 that differ from other encodings

Using the wrong encoding when reading QWK packets can result in:

  • Box-drawing characters appearing as accented letters or symbols
  • Message line breaks not being recognised
  • Accented characters displaying incorrectly
  • Loss of visual formatting in BBS names and messages

CP437, Extended ASCII, and Unicode

CP437 Character Set

CP437 is a single-byte encoding with 256 characters:

  • 0x00-0x7F: Standard ASCII characters (same as Unicode U+0000-U+007F)
  • 0x80-0xFF: Extended characters unique to CP437:
    • Box-drawing characters (single and double lines, corners)
    • Block elements (full blocks, shades)
    • Accented characters (Western European)
    • Mathematical symbols
    • Greek letters (including π at 0xE3)
    • Special symbols

Key CP437 Characters in QWK:

Byte CP437 Character Unicode Common Usage
0xC4 U+2500 Horizontal line (box drawing)
0xB3 U+2502 Vertical line (box drawing)
0xDB U+2588 Full block
0xE3 π U+03C0 QWK line terminator
0xB0 U+2591 Light shade
0xB1 U+2592 Medium shade
0xB2 U+2593 Dark shade

Extended ASCII

"Extended ASCII" is a general term referring to any byte with the high bit set (0x80-0xFF). However, the meaning of these bytes depends on the encoding:

  • CP437: Byte 0xE3 = π (Greek pi)
  • Latin-1 (ISO-8859-1): Byte 0xE3 = ã (a-tilde, U+00E3)
  • Windows-1252: Byte 0xE3 = ã (a-tilde, U+00E3)

This is why encoding matters: the same byte value represents different characters in different encodings.

Unicode

Unicode is a multi-byte encoding that can represent all characters from all languages. CP437 characters map to specific Unicode codepoints:

  • CP437 byte 0xE3 → Unicode U+03C0 (π)
  • CP437 byte 0xC4 → Unicode U+2500 (─)
  • CP437 byte 0xB3 → Unicode U+2502 (│)

When QWK.NET decodes CP437 bytes to strings, it converts them to their Unicode equivalents. When encoding strings back to bytes, it converts Unicode characters back to CP437 byte values.

Line Endings: The 0xE3 Byte

QWK message bodies use byte 0xE3 as the line terminator, not CR (\r) or LF (\n). This is a unique characteristic of the QWK format.

The 0xE3 Character Mapping Issue

Byte 0xE3 has different Unicode representations depending on encoding interpretation:

  • CP437: 0xE3 → π (U+03C0, Greek pi) - Historically correct for DOS BBS systems
  • Latin-1/Windows-1252: 0xE3 → ã (U+00E3, a-tilde) - Byte identity mapping

Both approaches produce byte 0xE3 in the file, which meets the QWK specification requirement. However, using the correct CP437 mapping (π) preserves the historical character representation.

QWK.NET Line Ending Handling

QWK.NET supports both approaches for compatibility:

  • CP437 mode: Uses π (U+03C0) which encodes to 0xE3 in CP437
  • ASCII/Latin-1 mode: Uses character 0xE3 (ã, U+00E3) for byte identity

The library automatically detects and handles 0xE3 terminators when parsing message bodies, converting them to standard line breaks for the Lines property whilst preserving the original bytes.

Line Ending Corruption Patterns

Common issues seen in real-world packets:

  1. Mixed line endings: Some messages use 0xE3, others use CR/LF or LF
  2. Missing terminators: Last line of message may lack 0xE3
  3. Extra terminators: Multiple 0xE3 bytes in sequence
  4. QWKE variation: QWKE extensions may use CR (\r) instead of 0xE3

QWK.NET handles these variations gracefully:

  • Detects QWKE-style CR terminators automatically
  • Preserves mixed line endings in raw text
  • Converts 0xE3 to platform-native line endings for Lines property

How QWK.NET Handles Encoding

Default Behaviour

QWK.NET uses CP437 encoding by default for all text fields:

  • CONTROL.DAT fields (BBS name, sysop name, etc.)
  • Message headers (From, To, Subject)
  • Message bodies
  • Optional files (WELCOME, NEWS, GOODBYE)
  • QWKE extension files (TOREADER.EXT, TODOOR.EXT)

Encoding During Parsing

When reading QWK packets:

  1. Raw bytes are read from the archive without conversion
  2. CP437 decoding converts bytes to Unicode strings for API access
  3. Original bytes are preserved in raw properties for round-trip fidelity
  4. Line terminators (0xE3) are detected and converted to line breaks

Example:

// Bytes: [0xC4, 0xC4, 0x20, 0x53, 0x74, 0x61, 0x72, 0x4C, 0x69, 0x6E, 0x6B, 0xC4, 0xC4]
// CP437 decode: "── StarLink ──"
// Unicode: U+2500, U+2500, U+0020, U+0053, U+0074, U+0061, U+0072, U+004C, U+0069, U+006E, U+006B, U+2500, U+2500

Encoding During Preservation

When writing REP packets or generating QWK output:

  1. Unicode strings are encoded back to CP437 bytes
  2. Line terminators are converted to 0xE3 bytes
  3. 128-byte records are padded with spaces (0x20)
  4. Round-trip fidelity ensures QWK → REP → QWK preserves all bytes

Fallback Policies

QWK.NET provides configurable fallback policies for unmappable bytes or characters:

  • Strict: Throws exception on unmappable content (prevents silent data loss)
  • ReplacementQuestion: Replaces unmappable bytes with ?
  • ReplacementUnicode: Uses encoding's default replacement character
  • BestEffort: Uses encoding's default fallback behaviour

Default is Strict to prevent silent corruption. Applications can choose more permissive policies when processing damaged packets.

Symptoms of Encoding Issues

Incorrect Character Display

Symptom: Box-drawing characters appear as accented letters or symbols.

Example:

  • Expected: ─══ StarLink BBS ══─
  • Seen: ã══ StarLink BBS ══ã or Ãà StarLink BBS ÃÃ

Cause: Packet read with wrong encoding (e.g., Latin-1 instead of CP437)

Solution: Ensure CP437 encoding is used when reading packets

Line Breaks Not Recognised

Symptom: Message body appears as one long line or lines split incorrectly.

Example:

  • Expected: Multi-line message with proper breaks
  • Seen: Single line or breaks in wrong places

Cause: 0xE3 terminators not being recognised, or wrong line ending mode

Solution: QWK.NET handles this automatically, but check LineEndingMode if custom processing is used

Accented Characters Wrong

Symptom: Names with accents display incorrectly.

Example:

  • Expected: José Rodriguez
  • Seen: José Rodriguez or Jos? Rodriguez

Cause: CP437 accented characters decoded with wrong encoding

Solution: Use CP437 encoding for all text fields

Validation Warnings

Symptom: Validation reports warnings about encoding or character issues.

Common warnings:

  • "Unmappable byte detected" - Byte cannot be decoded with current encoding
  • "Invalid character in field" - Character cannot be encoded back to CP437

Solution: Check validation report for specific issues. Consider using Lenient mode for packets with encoding variations.

Round-Trip Failures

Symptom: QWK → REP → QWK cycle produces different bytes.

Cause: Encoding conversion during round-trip (e.g., UTF-8 normalisation)

Solution: QWK.NET preserves bytes by default. Ensure no intermediate encoding conversions occur.

Common Corruption Patterns

Encoding-Related Corruption

  1. UTF-8 BOM: Some tools add UTF-8 BOM (0xEF 0xBB 0xBF) to text files

    • Impact: Breaks CP437 parsing, causes validation errors
    • QWK.NET: Handles by detecting and skipping BOM if present
  2. Line Ending Conversion: Tools that convert CR/LF to LF or vice versa

    • Impact: May corrupt 0xE3 terminators or introduce CR/LF where 0xE3 expected
    • QWK.NET: Preserves original line endings, handles mixed formats
  3. Character Normalisation: Unicode normalisation (NFC/NFD) applied to text

    • Impact: Changes byte representation, breaks round-trip fidelity
    • QWK.NET: No normalisation applied by default

Storage Media Issues

  1. Floppy Disk Corruption: Bit flips in extended ASCII range

    • Impact: Box-drawing characters become other extended ASCII
    • QWK.NET: Preserves bytes as-is, validation may report issues
  2. Transfer Errors: Incomplete file transfers or network errors

    • Impact: Truncated messages, missing line terminators
    • QWK.NET: Salvage mode attempts recovery

Best Practices

Reading Packets

  • Use default CP437 encoding - QWK.NET handles this automatically
  • Check validation reports - Encoding issues may appear as warnings
  • Preserve raw bytes - Access RawText or raw properties when byte fidelity is critical

Writing Packets

  • Encode to CP437 - Use Cp437Encoding.Encode() for all text output
  • Use 0xE3 terminators - Convert line endings to QWK format before encoding
  • Avoid UTF-8 conversion - Do not convert CP437 strings to UTF-8 for storage

Handling Encoding Variations

  • Detect encoding - Use ByteClassifier to detect extended ASCII presence
  • Choose fallback policy - Strict for preservation, BestEffort for damaged packets
  • Preserve original - Keep raw bytes alongside decoded strings when possible

Further Reading

Clone this wiki locally