-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description
UTF-8 number parsing fails for cultures that use NBSP (U+00A0) or Narrow NBSP (U+202F) as group/decimal separators. The IsSpaceReplacingChar function in Number.Parsing.Common.cs operates on Unicode codepoints, but UTF-8 encodes these characters as multi-byte sequences that the parser cannot recognize.
This issue was identified while investigating #120283 and #123783, which fix the UTF-16 (char) parsing path but cannot address the UTF-8 issue without architectural changes.
Root Cause
The IsSpaceReplacingChar function operates on Unicode codepoints:
private static bool IsSpaceReplacingChar(uint c) => (c == u00a0) || (c == u202f);But IUtfChar<byte>.CastToUInt32(byte) returns the raw byte value (0-255), not decoded Unicode codepoints. UTF-8 encodes these characters as multi-byte sequences:
| Character | Unicode | Codepoint | UTF-8 Bytes | Parser Sees | Result |
|---|---|---|---|---|---|
| NBSP | U+00A0 | 160 | C2 A0 |
194, then 160 | ❌ Fails (194 ≠ 160) |
| Narrow NBSP | U+202F | 8239 | E2 80 AF |
226, 128, 175 | ❌ Fails (226 ≠ 8239) |
| Regular Space | U+0020 | 32 | 20 |
32 | ✅ Works |
Suggested Fix
The MatchChars function needs to detect UTF-8 mode (TChar = byte) and recognize:
- Byte sequence
C2 A0as equivalent to20(NBSP → space) - Byte sequence
E2 80 AFas equivalent to20(NNBSP → space)
This requires careful handling to maintain byte-by-byte iteration performance.
Reproduction
using System.Globalization;
using System.Numerics;
using System.Text;
var culture = new CultureInfo("uk-UA");
string input = "1\u00a0234\u00a0567"; // NBSP as thousands separator
byte[] utf8Input = Encoding.UTF8.GetBytes(input);
// UTF-16 parsing works (after #123783)
BigInteger.Parse(input, NumberStyles.AllowThousands, culture); // ✅
// UTF-8 parsing fails
BigInteger.Parse(utf8Input, NumberStyles.AllowThousands, culture); // ❌ FormatExceptionExhaustive List of Affected Cultures
Enumerated via CultureInfo.GetCultures(CultureTypes.AllCultures) on .NET 10.
Cultures using NBSP (U+00A0) as NumberGroupSeparator (177 cultures)
Click to expand full list
| Culture Code | Language |
|---|---|
| af | Afrikaans |
| af-NA | Afrikaans (Namibia) |
| af-ZA | Afrikaans (South Africa) |
| agq | Aghem |
| agq-CM | Aghem (Cameroon) |
| bas | Basaa |
| bas-CM | Basaa (Cameroon) |
| be | Belarusian |
| be-BY | Belarusian (Belarus) |
| bg | Bulgarian |
| bg-BG | Bulgarian (Bulgaria) |
| blo | Anii |
| blo-BJ | Anii (Benin) |
| br | Breton |
| br-FR | Breton (France) |
| cs | Czech |
| cs-CZ | Czech (Czechia) |
| cv | Chuvash |
| cv-RU | Chuvash (Russia) |
| dje | Zarma |
| dje-NE | Zarma (Niger) |
| dua | Duala |
| dua-CM | Duala (Cameroon) |
| dyo | Jola-Fonyi |
| dyo-SN | Jola-Fonyi (Senegal) |
| en-AL | English (Albania) |
| en-BG | English (Bulgaria) |
| en-CV | English (Cape Verde) |
| en-CZ | English (Czechia) |
| en-EE | English (Estonia) |
| en-FI | English (Finland) |
| en-HU | English (Hungary) |
| en-LT | English (Lithuania) |
| en-LV | English (Latvia) |
| en-NO | English (Norway) |
| en-PL | English (Poland) |
| en-PT | English (Portugal) |
| en-RU | English (Russia) |
| en-SE | English (Sweden) |
| en-SK | English (Slovakia) |
| en-UA | English (Ukraine) |
| en-ZA | English (South Africa) |
| eo | Esperanto |
| eo-001 | Esperanto (world) |
| et | Estonian |
| et-EE | Estonian (Estonia) |
| ewo | Ewondo |
| ewo-CM | Ewondo (Cameroon) |
| ff | Fula |
| ff-Latn | Fula (Latin) |
| ff-Latn-BF | Fula (Latin, Burkina Faso) |
| ff-Latn-CM | Fula (Latin, Cameroon) |
| ff-Latn-GH | Fula (Latin, Ghana) |
| ff-Latn-GM | Fula (Latin, Gambia) |
| ff-Latn-GN | Fula (Latin, Guinea) |
| ff-Latn-GW | Fula (Latin, Guinea-Bissau) |
| ff-Latn-LR | Fula (Latin, Liberia) |
| ff-Latn-MR | Fula (Latin, Mauritania) |
| ff-Latn-NE | Fula (Latin, Niger) |
| ff-Latn-NG | Fula (Latin, Nigeria) |
| ff-Latn-SL | Fula (Latin, Sierra Leone) |
| ff-Latn-SN | Fula (Latin, Senegal) |
| fi | Finnish |
| fi-FI | Finnish (Finland) |
| fr-CA | French (Canada) |
| hu | Hungarian |
| hu-HU | Hungarian (Hungary) |
| hy | Armenian |
| hy-AM | Armenian (Armenia) |
| ie | Interlingue |
| ie-EE | Interlingue (Estonia) |
| ka | Georgian |
| ka-GE | Georgian (Georgia) |
| kab | Kabyle |
| kab-DZ | Kabyle (Algeria) |
| kea | Kabuverdianu |
| kea-CV | Kabuverdianu (Cape Verde) |
| khq | Koyra Chiini |
| khq-ML | Koyra Chiini (Mali) |
| kk | Kazakh |
| kk-Cyrl | Kazakh (Cyrillic) |
| kk-Cyrl-KZ | Kazakh (Cyrillic, Kazakhstan) |
| kk-KZ | Kazakh (Kazakhstan) |
| ksf | Bafia |
| ksf-CM | Bafia (Cameroon) |
| ksh | Colognian |
| ksh-DE | Colognian (Germany) |
| ky | Kyrgyz |
| ky-KG | Kyrgyz (Kyrgyzstan) |
| lt | Lithuanian |
| lt-LT | Lithuanian (Lithuania) |
| lv | Latvian |
| lv-LV | Latvian (Latvia) |
| mfe | Morisyen |
| mfe-MU | Morisyen (Mauritius) |
| nb | Norwegian Bokmål |
| nb-NO | Norwegian Bokmål (Norway) |
| nb-SJ | Norwegian Bokmål (Svalbard & Jan Mayen) |
| nmg | Kwasio |
| nmg-CM | Kwasio (Cameroon) |
| nn | Norwegian Nynorsk |
| nn-NO | Norwegian Nynorsk (Norway) |
| no | Norwegian |
| nr | South Ndebele |
| nr-ZA | South Ndebele (South Africa) |
| nso | Northern Sotho |
| nso-ZA | Northern Sotho (South Africa) |
| oc | Occitan |
| oc-ES | Occitan (Spain) |
| oc-FR | Occitan (France) |
| os | Ossetic |
| os-GE | Ossetic (Georgia) |
| os-RU | Ossetic (Russia) |
| pl | Polish |
| pl-PL | Polish (Poland) |
| prg | Prussian |
| prg-PL | Prussian (Poland) |
| pt-AO | Portuguese (Angola) |
| pt-CH | Portuguese (Switzerland) |
| pt-CV | Portuguese (Cape Verde) |
| pt-FR | Portuguese (France) |
| pt-GQ | Portuguese (Equatorial Guinea) |
| pt-GW | Portuguese (Guinea-Bissau) |
| pt-LU | Portuguese (Luxembourg) |
| pt-MO | Portuguese (Macao) |
| pt-MZ | Portuguese (Mozambique) |
| pt-PT | Portuguese (Portugal) |
| pt-ST | Portuguese (São Tomé & Príncipe) |
| pt-TL | Portuguese (Timor-Leste) |
| ru | Russian |
| ru-BY | Russian (Belarus) |
| ru-KG | Russian (Kyrgyzstan) |
| ru-KZ | Russian (Kazakhstan) |
| ru-MD | Russian (Moldova) |
| ru-RU | Russian (Russia) |
| ru-UA | Russian (Ukraine) |
| sah | Sakha |
| sah-RU | Sakha (Russia) |
| se | North Sámi |
| se-FI | North Sámi (Finland) |
| se-NO | North Sámi (Norway) |
| se-SE | North Sámi (Sweden) |
| ses | Koyraboro Senni |
| ses-ML | Koyraboro Senni (Mali) |
| shi | Tachelhit |
| shi-Latn | Tachelhit (Latin) |
| shi-Latn-MA | Tachelhit (Latin, Morocco) |
| shi-Tfng | Tachelhit (Tifinagh) |
| shi-Tfng-MA | Tachelhit (Tifinagh, Morocco) |
| sk | Slovak |
| sk-SK | Slovak (Slovakia) |
| smn | Inari Sami |
| smn-FI | Inari Sami (Finland) |
| sq | Albanian |
| sq-AL | Albanian (Albania) |
| sq-MK | Albanian (North Macedonia) |
| sq-XK | Albanian (Kosovo) |
| ss | Swati |
| ss-SZ | Swati (Eswatini) |
| ss-ZA | Swati (South Africa) |
| sv | Swedish |
| sv-AX | Swedish (Åland Islands) |
| sv-FI | Swedish (Finland) |
| sv-SE | Swedish (Sweden) |
| szl | Silesian |
| szl-PL | Silesian (Poland) |
| tg | Tajik |
| tg-TJ | Tajik (Tajikistan) |
| tk | Turkmen |
| tk-TM | Turkmen (Turkmenistan) |
| tok | Toki Pona |
| tok-001 | Toki Pona (world) |
| ts | Tsonga |
| ts-ZA | Tsonga (South Africa) |
| tt | Tatar |
| tt-RU | Tatar (Russia) |
| twq | Tasawaq |
| twq-NE | Tasawaq (Niger) |
| tzm | Central Atlas Tamazight |
| tzm-MA | Central Atlas Tamazight (Morocco) |
| uk | Ukrainian |
| uk-UA | Ukrainian (Ukraine) |
| uz | Uzbek |
| uz-Cyrl | Uzbek (Cyrillic) |
| uz-Cyrl-UZ | Uzbek (Cyrillic, Uzbekistan) |
| uz-Latn | Uzbek (Latin) |
| uz-Latn-UZ | Uzbek (Latin, Uzbekistan) |
| ve | Venda |
| ve-ZA | Venda (South Africa) |
| xh | Xhosa |
| xh-ZA | Xhosa (South Africa) |
| yav | Yangben |
| yav-CM | Yangben (Cameroon) |
| zgh | Tamazight, Standard Moroccan |
| zgh-MA | Tamazight, Standard Moroccan (Morocco) |
Cultures using Narrow NBSP (U+202F) as NumberGroupSeparator (47 cultures)
Click to expand full list
| Culture Code | Language |
|---|---|
| en-FR | English (France) |
| es-HT | Spanish (Haiti) |
| fr | French |
| fr-BE | French (Belgium) |
| fr-BF | French (Burkina Faso) |
| fr-BI | French (Burundi) |
| fr-BJ | French (Benin) |
| fr-BL | French (St. Barthélemy) |
| fr-CD | French (Congo - Kinshasa) |
| fr-CF | French (Central African Republic) |
| fr-CG | French (Congo - Brazzaville) |
| fr-CI | French (Côte d'Ivoire) |
| fr-CM | French (Cameroon) |
| fr-DJ | French (Djibouti) |
| fr-DZ | French (Algeria) |
| fr-FR | French (France) |
| fr-GA | French (Gabon) |
| fr-GF | French (French Guiana) |
| fr-GN | French (Guinea) |
| fr-GP | French (Guadeloupe) |
| fr-GQ | French (Equatorial Guinea) |
| fr-HT | French (Haiti) |
| fr-KM | French (Comoros) |
| fr-MC | French (Monaco) |
| fr-MF | French (St. Martin) |
| fr-MG | French (Madagascar) |
| fr-ML | French (Mali) |
| fr-MQ | French (Martinique) |
| fr-MR | French (Mauritania) |
| fr-MU | French (Mauritius) |
| fr-NC | French (New Caledonia) |
| fr-NE | French (Niger) |
| fr-PF | French (French Polynesia) |
| fr-PM | French (St. Pierre & Miquelon) |
| fr-RE | French (Réunion) |
| fr-RW | French (Rwanda) |
| fr-SC | French (Seychelles) |
| fr-SN | French (Senegal) |
| fr-SY | French (Syria) |
| fr-TD | French (Chad) |
| fr-TG | French (Togo) |
| fr-TN | French (Tunisia) |
| fr-VU | French (Vanuatu) |
| fr-WF | French (Wallis & Futuna) |
| fr-YT | French (Mayotte) |
| vec | Venetian |
| vec-IT | Venetian (Italy) |
Related Issues
- System.Numerics.Tests.parseTest.RunParseToStringTests(culture: uk-UA) test failures #120283 - Original test failure report for uk-UA
- Fix BigInteger char parsing with Ukrainian culture NBSP handling (bidirectional) #123783 - Fixes UTF-16 parsing (char path) for NBSP handling
Credit
Issue analysis and affected culture enumeration by @artl93 with GitHub Copilot assistance.