Skip to content

UTF-8 number parsing fails for cultures using NBSP or Narrow NBSP as group separators #124016

@artl93

Description

@artl93

Description

UTF-8 number parsing fails for cultures that use NBSP (U+00A0) or Narrow NBSP (U+202F) as group/decimal separators. The IsSpaceReplacingChar function in Number.Parsing.Common.cs operates on Unicode codepoints, but UTF-8 encodes these characters as multi-byte sequences that the parser cannot recognize.

This issue was identified while investigating #120283 and #123783, which fix the UTF-16 (char) parsing path but cannot address the UTF-8 issue without architectural changes.


Root Cause

The IsSpaceReplacingChar function operates on Unicode codepoints:

private static bool IsSpaceReplacingChar(uint c) => (c == u00a0) || (c == u202f);

But IUtfChar<byte>.CastToUInt32(byte) returns the raw byte value (0-255), not decoded Unicode codepoints. UTF-8 encodes these characters as multi-byte sequences:

Character Unicode Codepoint UTF-8 Bytes Parser Sees Result
NBSP U+00A0 160 C2 A0 194, then 160 ❌ Fails (194 ≠ 160)
Narrow NBSP U+202F 8239 E2 80 AF 226, 128, 175 ❌ Fails (226 ≠ 8239)
Regular Space U+0020 32 20 32 ✅ Works

Suggested Fix

The MatchChars function needs to detect UTF-8 mode (TChar = byte) and recognize:

  • Byte sequence C2 A0 as equivalent to 20 (NBSP → space)
  • Byte sequence E2 80 AF as equivalent to 20 (NNBSP → space)

This requires careful handling to maintain byte-by-byte iteration performance.


Reproduction

using System.Globalization;
using System.Numerics;
using System.Text;

var culture = new CultureInfo("uk-UA");
string input = "1\u00a0234\u00a0567"; // NBSP as thousands separator
byte[] utf8Input = Encoding.UTF8.GetBytes(input);

// UTF-16 parsing works (after #123783)
BigInteger.Parse(input, NumberStyles.AllowThousands, culture); // ✅

// UTF-8 parsing fails
BigInteger.Parse(utf8Input, NumberStyles.AllowThousands, culture); // ❌ FormatException

Exhaustive List of Affected Cultures

Enumerated via CultureInfo.GetCultures(CultureTypes.AllCultures) on .NET 10.

Cultures using NBSP (U+00A0) as NumberGroupSeparator (177 cultures)

Click to expand full list
Culture Code Language
af Afrikaans
af-NA Afrikaans (Namibia)
af-ZA Afrikaans (South Africa)
agq Aghem
agq-CM Aghem (Cameroon)
bas Basaa
bas-CM Basaa (Cameroon)
be Belarusian
be-BY Belarusian (Belarus)
bg Bulgarian
bg-BG Bulgarian (Bulgaria)
blo Anii
blo-BJ Anii (Benin)
br Breton
br-FR Breton (France)
cs Czech
cs-CZ Czech (Czechia)
cv Chuvash
cv-RU Chuvash (Russia)
dje Zarma
dje-NE Zarma (Niger)
dua Duala
dua-CM Duala (Cameroon)
dyo Jola-Fonyi
dyo-SN Jola-Fonyi (Senegal)
en-AL English (Albania)
en-BG English (Bulgaria)
en-CV English (Cape Verde)
en-CZ English (Czechia)
en-EE English (Estonia)
en-FI English (Finland)
en-HU English (Hungary)
en-LT English (Lithuania)
en-LV English (Latvia)
en-NO English (Norway)
en-PL English (Poland)
en-PT English (Portugal)
en-RU English (Russia)
en-SE English (Sweden)
en-SK English (Slovakia)
en-UA English (Ukraine)
en-ZA English (South Africa)
eo Esperanto
eo-001 Esperanto (world)
et Estonian
et-EE Estonian (Estonia)
ewo Ewondo
ewo-CM Ewondo (Cameroon)
ff Fula
ff-Latn Fula (Latin)
ff-Latn-BF Fula (Latin, Burkina Faso)
ff-Latn-CM Fula (Latin, Cameroon)
ff-Latn-GH Fula (Latin, Ghana)
ff-Latn-GM Fula (Latin, Gambia)
ff-Latn-GN Fula (Latin, Guinea)
ff-Latn-GW Fula (Latin, Guinea-Bissau)
ff-Latn-LR Fula (Latin, Liberia)
ff-Latn-MR Fula (Latin, Mauritania)
ff-Latn-NE Fula (Latin, Niger)
ff-Latn-NG Fula (Latin, Nigeria)
ff-Latn-SL Fula (Latin, Sierra Leone)
ff-Latn-SN Fula (Latin, Senegal)
fi Finnish
fi-FI Finnish (Finland)
fr-CA French (Canada)
hu Hungarian
hu-HU Hungarian (Hungary)
hy Armenian
hy-AM Armenian (Armenia)
ie Interlingue
ie-EE Interlingue (Estonia)
ka Georgian
ka-GE Georgian (Georgia)
kab Kabyle
kab-DZ Kabyle (Algeria)
kea Kabuverdianu
kea-CV Kabuverdianu (Cape Verde)
khq Koyra Chiini
khq-ML Koyra Chiini (Mali)
kk Kazakh
kk-Cyrl Kazakh (Cyrillic)
kk-Cyrl-KZ Kazakh (Cyrillic, Kazakhstan)
kk-KZ Kazakh (Kazakhstan)
ksf Bafia
ksf-CM Bafia (Cameroon)
ksh Colognian
ksh-DE Colognian (Germany)
ky Kyrgyz
ky-KG Kyrgyz (Kyrgyzstan)
lt Lithuanian
lt-LT Lithuanian (Lithuania)
lv Latvian
lv-LV Latvian (Latvia)
mfe Morisyen
mfe-MU Morisyen (Mauritius)
nb Norwegian Bokmål
nb-NO Norwegian Bokmål (Norway)
nb-SJ Norwegian Bokmål (Svalbard & Jan Mayen)
nmg Kwasio
nmg-CM Kwasio (Cameroon)
nn Norwegian Nynorsk
nn-NO Norwegian Nynorsk (Norway)
no Norwegian
nr South Ndebele
nr-ZA South Ndebele (South Africa)
nso Northern Sotho
nso-ZA Northern Sotho (South Africa)
oc Occitan
oc-ES Occitan (Spain)
oc-FR Occitan (France)
os Ossetic
os-GE Ossetic (Georgia)
os-RU Ossetic (Russia)
pl Polish
pl-PL Polish (Poland)
prg Prussian
prg-PL Prussian (Poland)
pt-AO Portuguese (Angola)
pt-CH Portuguese (Switzerland)
pt-CV Portuguese (Cape Verde)
pt-FR Portuguese (France)
pt-GQ Portuguese (Equatorial Guinea)
pt-GW Portuguese (Guinea-Bissau)
pt-LU Portuguese (Luxembourg)
pt-MO Portuguese (Macao)
pt-MZ Portuguese (Mozambique)
pt-PT Portuguese (Portugal)
pt-ST Portuguese (São Tomé & Príncipe)
pt-TL Portuguese (Timor-Leste)
ru Russian
ru-BY Russian (Belarus)
ru-KG Russian (Kyrgyzstan)
ru-KZ Russian (Kazakhstan)
ru-MD Russian (Moldova)
ru-RU Russian (Russia)
ru-UA Russian (Ukraine)
sah Sakha
sah-RU Sakha (Russia)
se North Sámi
se-FI North Sámi (Finland)
se-NO North Sámi (Norway)
se-SE North Sámi (Sweden)
ses Koyraboro Senni
ses-ML Koyraboro Senni (Mali)
shi Tachelhit
shi-Latn Tachelhit (Latin)
shi-Latn-MA Tachelhit (Latin, Morocco)
shi-Tfng Tachelhit (Tifinagh)
shi-Tfng-MA Tachelhit (Tifinagh, Morocco)
sk Slovak
sk-SK Slovak (Slovakia)
smn Inari Sami
smn-FI Inari Sami (Finland)
sq Albanian
sq-AL Albanian (Albania)
sq-MK Albanian (North Macedonia)
sq-XK Albanian (Kosovo)
ss Swati
ss-SZ Swati (Eswatini)
ss-ZA Swati (South Africa)
sv Swedish
sv-AX Swedish (Åland Islands)
sv-FI Swedish (Finland)
sv-SE Swedish (Sweden)
szl Silesian
szl-PL Silesian (Poland)
tg Tajik
tg-TJ Tajik (Tajikistan)
tk Turkmen
tk-TM Turkmen (Turkmenistan)
tok Toki Pona
tok-001 Toki Pona (world)
ts Tsonga
ts-ZA Tsonga (South Africa)
tt Tatar
tt-RU Tatar (Russia)
twq Tasawaq
twq-NE Tasawaq (Niger)
tzm Central Atlas Tamazight
tzm-MA Central Atlas Tamazight (Morocco)
uk Ukrainian
uk-UA Ukrainian (Ukraine)
uz Uzbek
uz-Cyrl Uzbek (Cyrillic)
uz-Cyrl-UZ Uzbek (Cyrillic, Uzbekistan)
uz-Latn Uzbek (Latin)
uz-Latn-UZ Uzbek (Latin, Uzbekistan)
ve Venda
ve-ZA Venda (South Africa)
xh Xhosa
xh-ZA Xhosa (South Africa)
yav Yangben
yav-CM Yangben (Cameroon)
zgh Tamazight, Standard Moroccan
zgh-MA Tamazight, Standard Moroccan (Morocco)

Cultures using Narrow NBSP (U+202F) as NumberGroupSeparator (47 cultures)

Click to expand full list
Culture Code Language
en-FR English (France)
es-HT Spanish (Haiti)
fr French
fr-BE French (Belgium)
fr-BF French (Burkina Faso)
fr-BI French (Burundi)
fr-BJ French (Benin)
fr-BL French (St. Barthélemy)
fr-CD French (Congo - Kinshasa)
fr-CF French (Central African Republic)
fr-CG French (Congo - Brazzaville)
fr-CI French (Côte d'Ivoire)
fr-CM French (Cameroon)
fr-DJ French (Djibouti)
fr-DZ French (Algeria)
fr-FR French (France)
fr-GA French (Gabon)
fr-GF French (French Guiana)
fr-GN French (Guinea)
fr-GP French (Guadeloupe)
fr-GQ French (Equatorial Guinea)
fr-HT French (Haiti)
fr-KM French (Comoros)
fr-MC French (Monaco)
fr-MF French (St. Martin)
fr-MG French (Madagascar)
fr-ML French (Mali)
fr-MQ French (Martinique)
fr-MR French (Mauritania)
fr-MU French (Mauritius)
fr-NC French (New Caledonia)
fr-NE French (Niger)
fr-PF French (French Polynesia)
fr-PM French (St. Pierre & Miquelon)
fr-RE French (Réunion)
fr-RW French (Rwanda)
fr-SC French (Seychelles)
fr-SN French (Senegal)
fr-SY French (Syria)
fr-TD French (Chad)
fr-TG French (Togo)
fr-TN French (Tunisia)
fr-VU French (Vanuatu)
fr-WF French (Wallis & Futuna)
fr-YT French (Mayotte)
vec Venetian
vec-IT Venetian (Italy)

Related Issues

Credit

Issue analysis and affected culture enumeration by @artl93 with GitHub Copilot assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions