-
Notifications
You must be signed in to change notification settings - Fork 0
QWK.NET ‐ Encoding and CP437
This page explains character encoding in QWK packets, focusing on CP437 (Code Page 437), the historical DOS character set used by BBS systems. Understanding encoding is crucial for correctly reading and preserving QWK packet content.
CP437 (Code Page 437, also known as "DOS Latin US" or "OEM-US") is the original IBM PC character encoding from the DOS era. It was the standard encoding for IBM PC-compatible systems and most BBS software from the 1980s and 1990s.
QWK packets were created by DOS-based BBS systems that used CP437 as their native encoding. This means:
- BBS names often contain box-drawing characters (borders, frames)
- Message content may include accented characters, mathematical symbols, or Greek letters
- Line terminators use byte 0xE3, which is the π (pi) character in CP437
- Extended ASCII characters (0x80-0xFF) have specific meanings in CP437 that differ from other encodings
Using the wrong encoding when reading QWK packets can result in:
- Box-drawing characters appearing as accented letters or symbols
- Message line breaks not being recognised
- Accented characters displaying incorrectly
- Loss of visual formatting in BBS names and messages
CP437 is a single-byte encoding with 256 characters:
- 0x00-0x7F: Standard ASCII characters (same as Unicode U+0000-U+007F)
-
0x80-0xFF: Extended characters unique to CP437:
- Box-drawing characters (single and double lines, corners)
- Block elements (full blocks, shades)
- Accented characters (Western European)
- Mathematical symbols
- Greek letters (including π at 0xE3)
- Special symbols
Key CP437 Characters in QWK:
| Byte | CP437 Character | Unicode | Common Usage |
|---|---|---|---|
| 0xC4 | ─ |
U+2500 | Horizontal line (box drawing) |
| 0xB3 | │ |
U+2502 | Vertical line (box drawing) |
| 0xDB | █ |
U+2588 | Full block |
| 0xE3 | π |
U+03C0 | QWK line terminator |
| 0xB0 | ░ |
U+2591 | Light shade |
| 0xB1 | ▒ |
U+2592 | Medium shade |
| 0xB2 | ▓ |
U+2593 | Dark shade |
"Extended ASCII" is a general term referring to any byte with the high bit set (0x80-0xFF). However, the meaning of these bytes depends on the encoding:
- CP437: Byte 0xE3 = π (Greek pi)
- Latin-1 (ISO-8859-1): Byte 0xE3 = ã (a-tilde, U+00E3)
- Windows-1252: Byte 0xE3 = ã (a-tilde, U+00E3)
This is why encoding matters: the same byte value represents different characters in different encodings.
Unicode is a multi-byte encoding that can represent all characters from all languages. CP437 characters map to specific Unicode codepoints:
- CP437 byte 0xE3 → Unicode U+03C0 (π)
- CP437 byte 0xC4 → Unicode U+2500 (─)
- CP437 byte 0xB3 → Unicode U+2502 (│)
When QWK.NET decodes CP437 bytes to strings, it converts them to their Unicode equivalents. When encoding strings back to bytes, it converts Unicode characters back to CP437 byte values.
QWK message bodies use byte 0xE3 as the line terminator, not CR (\r) or LF (\n). This is a unique characteristic of the QWK format.
Byte 0xE3 has different Unicode representations depending on encoding interpretation:
- CP437: 0xE3 → π (U+03C0, Greek pi) - Historically correct for DOS BBS systems
- Latin-1/Windows-1252: 0xE3 → ã (U+00E3, a-tilde) - Byte identity mapping
Both approaches produce byte 0xE3 in the file, which meets the QWK specification requirement. However, using the correct CP437 mapping (π) preserves the historical character representation.
QWK.NET supports both approaches for compatibility:
- CP437 mode: Uses π (U+03C0) which encodes to 0xE3 in CP437
- ASCII/Latin-1 mode: Uses character 0xE3 (ã, U+00E3) for byte identity
The library automatically detects and handles 0xE3 terminators when parsing message bodies, converting them to standard line breaks for the Lines property whilst preserving the original bytes.
Common issues seen in real-world packets:
- Mixed line endings: Some messages use 0xE3, others use CR/LF or LF
- Missing terminators: Last line of message may lack 0xE3
- Extra terminators: Multiple 0xE3 bytes in sequence
-
QWKE variation: QWKE extensions may use CR (
\r) instead of 0xE3
QWK.NET handles these variations gracefully:
- Detects QWKE-style CR terminators automatically
- Preserves mixed line endings in raw text
- Converts 0xE3 to platform-native line endings for
Linesproperty
QWK.NET uses CP437 encoding by default for all text fields:
-
CONTROL.DATfields (BBS name, sysop name, etc.) - Message headers (From, To, Subject)
- Message bodies
- Optional files (WELCOME, NEWS, GOODBYE)
- QWKE extension files (TOREADER.EXT, TODOOR.EXT)
When reading QWK packets:
- Raw bytes are read from the archive without conversion
- CP437 decoding converts bytes to Unicode strings for API access
- Original bytes are preserved in raw properties for round-trip fidelity
- Line terminators (0xE3) are detected and converted to line breaks
Example:
// Bytes: [0xC4, 0xC4, 0x20, 0x53, 0x74, 0x61, 0x72, 0x4C, 0x69, 0x6E, 0x6B, 0xC4, 0xC4]
// CP437 decode: "── StarLink ──"
// Unicode: U+2500, U+2500, U+0020, U+0053, U+0074, U+0061, U+0072, U+004C, U+0069, U+006E, U+006B, U+2500, U+2500When writing REP packets or generating QWK output:
- Unicode strings are encoded back to CP437 bytes
- Line terminators are converted to 0xE3 bytes
- 128-byte records are padded with spaces (0x20)
- Round-trip fidelity ensures QWK → REP → QWK preserves all bytes
QWK.NET provides configurable fallback policies for unmappable bytes or characters:
- Strict: Throws exception on unmappable content (prevents silent data loss)
-
ReplacementQuestion: Replaces unmappable bytes with
? - ReplacementUnicode: Uses encoding's default replacement character
- BestEffort: Uses encoding's default fallback behaviour
Default is Strict to prevent silent corruption. Applications can choose more permissive policies when processing damaged packets.
Symptom: Box-drawing characters appear as accented letters or symbols.
Example:
- Expected:
─══ StarLink BBS ══─ - Seen:
ã══ StarLink BBS ══ãorÃà StarLink BBS ÃÃ
Cause: Packet read with wrong encoding (e.g., Latin-1 instead of CP437)
Solution: Ensure CP437 encoding is used when reading packets
Symptom: Message body appears as one long line or lines split incorrectly.
Example:
- Expected: Multi-line message with proper breaks
- Seen: Single line or breaks in wrong places
Cause: 0xE3 terminators not being recognised, or wrong line ending mode
Solution: QWK.NET handles this automatically, but check LineEndingMode if custom processing is used
Symptom: Names with accents display incorrectly.
Example:
- Expected:
José Rodriguez - Seen:
José RodriguezorJos? Rodriguez
Cause: CP437 accented characters decoded with wrong encoding
Solution: Use CP437 encoding for all text fields
Symptom: Validation reports warnings about encoding or character issues.
Common warnings:
- "Unmappable byte detected" - Byte cannot be decoded with current encoding
- "Invalid character in field" - Character cannot be encoded back to CP437
Solution: Check validation report for specific issues. Consider using Lenient mode for packets with encoding variations.
Symptom: QWK → REP → QWK cycle produces different bytes.
Cause: Encoding conversion during round-trip (e.g., UTF-8 normalisation)
Solution: QWK.NET preserves bytes by default. Ensure no intermediate encoding conversions occur.
-
UTF-8 BOM: Some tools add UTF-8 BOM (0xEF 0xBB 0xBF) to text files
- Impact: Breaks CP437 parsing, causes validation errors
- QWK.NET: Handles by detecting and skipping BOM if present
-
Line Ending Conversion: Tools that convert CR/LF to LF or vice versa
- Impact: May corrupt 0xE3 terminators or introduce CR/LF where 0xE3 expected
- QWK.NET: Preserves original line endings, handles mixed formats
-
Character Normalisation: Unicode normalisation (NFC/NFD) applied to text
- Impact: Changes byte representation, breaks round-trip fidelity
- QWK.NET: No normalisation applied by default
-
Floppy Disk Corruption: Bit flips in extended ASCII range
- Impact: Box-drawing characters become other extended ASCII
- QWK.NET: Preserves bytes as-is, validation may report issues
-
Transfer Errors: Incomplete file transfers or network errors
- Impact: Truncated messages, missing line terminators
- QWK.NET: Salvage mode attempts recovery
- Use default CP437 encoding - QWK.NET handles this automatically
- Check validation reports - Encoding issues may appear as warnings
-
Preserve raw bytes - Access
RawTextor raw properties when byte fidelity is critical
-
Encode to CP437 - Use
Cp437Encoding.Encode()for all text output - Use 0xE3 terminators - Convert line endings to QWK format before encoding
- Avoid UTF-8 conversion - Do not convert CP437 strings to UTF-8 for storage
-
Detect encoding - Use
ByteClassifierto detect extended ASCII presence - Choose fallback policy - Strict for preservation, BestEffort for damaged packets
- Preserve original - Keep raw bytes alongside decoded strings when possible
- Validation Modes - How validation modes handle encoding issues
- QWK Format Notes - Real-world encoding variations and quirks
- API Overview - Encoding-related API methods and options