Skip to content

A Unicode escape/unescape library for JavaScript and TypeScript. Supports multiple formats.

License

Notifications You must be signed in to change notification settings

Jeong-Min-Cho/unicode-escaper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

unicode-escaper

A robust, zero-dependency Unicode escape/unescape library for JavaScript and TypeScript. Supports multiple escape formats, bidirectional conversion, and streaming for large files.

npm version License

Features

  • Multiple escape formats: \uXXXX, \u{XXXXX}, \xNN, &#xNNNN;, &#NNNN;, U+XXXX
  • Bidirectional: Both escape and unescape in one package
  • Streaming support: Process large files efficiently with Node.js and Web Streams
  • Full Unicode support: Handles BMP, supplementary planes, surrogate pairs, and emoji
  • Zero dependencies: Lightweight and fast
  • TypeScript-first: Written in TypeScript with strict types
  • Dual ESM/CJS: Works with both module systems
  • Customizable filters: Control exactly which characters to escape

Installation

npm install unicode-escaper
# or
pnpm add unicode-escaper
# or
yarn add unicode-escaper

Quick Start

import { escape, unescape } from "unicode-escaper";

// Escape non-ASCII characters
escape("Hello 世界");
// => 'Hello \u4E16\u754C'

// Unescape back to original
unescape("Hello \\u4E16\\u754C");
// => 'Hello 世界'

Escape Formats

Format Example Description
unicode \u4E16 Standard JavaScript Unicode escape (default)
unicode-es6 \u{4E16} ES6 Unicode escape (supports full range)
hex \xE9 Hex escape (0x00-0xFF only)
html-hex 世 HTML hexadecimal entity
html-decimal 世 HTML decimal entity
codepoint U+4E16 Unicode code point notation

API Reference

Core Functions

escape(input, options?)

Escapes Unicode characters in a string.

import { escape } from "unicode-escaper";

// Default: preserve ASCII, escape everything else
escape("Café 世界 😀");
// => 'Caf\u00E9 \u4E16\u754C \uD83D\uDE00'

// Use ES6 format for emoji (cleaner output)
escape("Hello 😀", { format: "unicode-es6" });
// => 'Hello \u{1F600}'

// HTML entities
escape("Café", { format: "html-hex" });
// => 'Café'

escape("Café", { format: "html-decimal" });
// => 'Café'

// Escape everything (including ASCII)
escape("Hi", { preserveAscii: false });
// => '\u0048\u0069'

// Preserve Latin-1 characters
escape("Café 世界", { preserveLatin1: true });
// => 'Café \u4E16\u754C'

// Lowercase hex digits
escape("世", { uppercase: false });
// => '\u4e16'

unescape(input, options?)

Unescapes Unicode sequences back to characters.

import { unescape } from "unicode-escaper";

// Automatically detects and unescapes all formats
unescape("\\u4E16"); // => '世'
unescape("\\u{1F600}"); // => '😀'
unescape("\\xE9"); // => 'é'
unescape("世"); // => '世'
unescape("世"); // => '世'
unescape("U+4E16"); // => '世'

// Handle surrogate pairs
unescape("\\uD83D\\uDE00"); // => '😀'

// Only unescape specific formats
unescape("\\u4E16 世", { formats: ["unicode"] });
// => '世 世'

// Strict mode (throws on invalid sequences)
unescape("\\uZZZZ", { lenient: false });
// => throws Error

Convenience Functions

import {
  escapeToUnicode, // \uXXXX format
  escapeToUnicodeES6, // \u{XXXXX} format
  escapeToHex, // \xNN format
  escapeToHtmlHex, // &#xNNNN; format
  escapeToHtmlDecimal, // &#NNNN; format
  escapeToCodePoint, // U+XXXX format
  escapeAll, // Escape all characters
  escapeNonPrintable, // Escape control chars and non-ASCII
} from "unicode-escaper";

escapeToUnicodeES6("😀"); // => '\u{1F600}'
escapeToHtmlHex("世"); // => '世'
escapeAll("Hi"); // => '\u0048\u0069'
import {
  unescapeUnicode, // Only \uXXXX
  unescapeUnicodeES6, // Only \u{XXXXX}
  unescapeHex, // Only \xNN
  unescapeHtmlHex, // Only &#xNNNN;
  unescapeHtmlDecimal, // Only &#NNNN;
  unescapeCodePoint, // Only U+XXXX
  unescapeHtml, // Both HTML formats
  unescapeJs, // All JavaScript formats
} from "unicode-escaper";

Custom Filters

Control which characters to escape using filter functions:

import { escape, isNotAscii, isNotBmp, and, or, oneOf } from "unicode-escaper";

// Escape only non-ASCII (default behavior)
escape("Hello 世界", { filter: isNotAscii });

// Escape only emoji (non-BMP characters)
escape("Hello 世界 😀", { filter: isNotBmp });
// => 'Hello 世界 \uD83D\uDE00'

// Escape vowels
escape("Hello", { filter: oneOf("aeiouAEIOU") });
// => 'H\u0065ll\u006F'

// Combine filters
escape("Test", { filter: and(isNotAscii, isNotBmp) });

Available filters:

  • isAscii / isNotAscii - ASCII range (0x00-0x7F)
  • isLatin1 / isNotLatin1 - Latin-1 range (0x00-0xFF)
  • isBmp / isNotBmp - Basic Multilingual Plane (0x0000-0xFFFF)
  • isPrintableAscii / isNotPrintableAscii - Printable ASCII (0x20-0x7E)
  • isControl - Control characters
  • isWhitespace - Whitespace characters
  • isSurrogate / isHighSurrogate / isLowSurrogate - Surrogate code points
  • inRange(start, end) / notInRange(start, end) - Custom range
  • oneOf(chars) / noneOf(chars) - Character set
  • and(...filters) / or(...filters) / not(filter) - Combinators
  • all / none - Always true/false

Utility Functions

import {
  getCodePoint, // Get code point of a character
  fromCodePoint, // Create character from code point
  getCharInfo, // Get detailed character information
  toCodePoints, // Convert string to code point array
  fromCodePoints, // Convert code point array to string
  codePointLength, // Get length in code points (not UTF-16)
  toHex, // Convert code point to hex string
  parseHex, // Parse hex string to code point
  isValidUnicode, // Check for unpaired surrogates
  normalizeNFC, // Normalize to NFC
  normalizeNFD, // Normalize to NFD
  unicodeEquals, // Compare Unicode equivalence
} from "unicode-escaper";

// Get code point
getCodePoint("😀"); // => 128512 (0x1F600)

// Character info
getCharInfo("😀");
// => {
//   char: '😀',
//   codePoint: 128512,
//   hex: '1F600',
//   isAscii: false,
//   isBmp: false,
//   isLatin1: false,
//   isHighSurrogate: false,
//   isLowSurrogate: false,
//   utf16Length: 2
// }

// Code point length (differs from string.length for emoji)
"😀".length; // => 2 (UTF-16 code units)
codePointLength("😀"); // => 1 (actual characters)

// Parse various formats
parseHex("U+1F600"); // => 128512
parseHex("0x4E16"); // => 19990
parseHex("\\u{4E16}"); // => 19990

Streaming Support

Process large files efficiently without loading everything into memory:

Node.js Streams

import { createReadStream, createWriteStream } from "fs";
import { pipeline } from "stream/promises";
import { EscapeStream, UnescapeStream } from "unicode-escaper";

// Escape a file
await pipeline(
  createReadStream("input.txt", "utf8"),
  new EscapeStream({ escapeOptions: { format: "unicode-es6" } }),
  createWriteStream("escaped.txt")
);

// Unescape a file
await pipeline(
  createReadStream("escaped.txt", "utf8"),
  new UnescapeStream(),
  createWriteStream("output.txt")
);

Web Streams API

import {
  createWebEscapeStream,
  createWebUnescapeStream,
} from "unicode-escaper";

// Works in browsers and modern Node.js
const response = await fetch("data.txt");
const escaped = response.body
  .pipeThrough(new TextDecoderStream())
  .pipeThrough(createWebEscapeStream({ format: "html-hex" }))
  .pipeThrough(new TextEncoderStream());

Detection Utilities

import { hasEscapeSequences, countEscapeSequences } from "unicode-escaper";

hasEscapeSequences("\\u4E16"); // => true
hasEscapeSequences("Hello"); // => false

countEscapeSequences("\\u4E16\\u754C"); // => 2

// Filter by format
hasEscapeSequences("\\u4E16", ["unicode"]); // => true
hasEscapeSequences("\\u4E16", ["html-hex"]); // => false

TypeScript Support

Full TypeScript support with strict types:

import type {
  EscapeFormat,
  EscapeOptions,
  UnescapeOptions,
  FilterFunction,
  CharacterInfo,
  EscapeResult,
} from "unicode-escaper";

// Type-safe options
const options: EscapeOptions = {
  format: "unicode-es6",
  preserveAscii: true,
  uppercase: true,
};

// Custom filter with proper typing
const myFilter: FilterFunction = (char, codePoint) => {
  return codePoint > 0x7f;
};

Comparison with escape-unicode

Feature escape-unicode unicode-escaper
Escape formats \uXXXX only 6 formats
Unescape Separate package Built-in
Streaming No Yes
Web Streams No Yes
ESM + CJS CJS only Both
Browser support Node only Both
TypeScript Yes Yes (strict)
Zero deps Yes Yes

International Language Support

Fully tested with diverse Unicode scripts:

Language Script Example Escaped
Korean Hangul 안녕하세요 \uC548\uB155\uD558\uC138\uC694
Japanese Hiragana/Katakana/Kanji こんにちは \u3053\u3093\u306B\u3061\u306F
Arabic Arabic مرحبا \u0645\u0631\u062D\u0628\u0627
Thai Thai สวัสดี \u0E2A\u0E27\u0E31\u0E2A\u0E14\u0E35
Russian Cyrillic Привет \u041F\u0440\u0438\u0432\u0435\u0442
Hindi Devanagari नमस्ते \u0928\u092E\u0938\u094D\u0924\u0947
Chinese Han 你好 \u4F60\u597D
Vietnamese Latin Extended Xin chào Xin ch\u00E0o
French Latin Extended Café Caf\u00E9
Turkish Latin Extended Türkçe T\u00FCrk\u00E7e
Spanish Latin Extended ¡Hola! \u00A1Hola!
Portuguese Latin Extended São Paulo S\u00E3o Paulo
import { escape, unescape } from "unicode-escaper";

// Korean
escape("안녕하세요"); // => '\uC548\uB155\uD558\uC138\uC694'

// Japanese (mixed scripts)
escape("東京 とうきょう トウキョウ");

// Arabic (RTL)
escape("مرحبا"); // => '\u0645\u0631\u062D\u0628\u0627'

// Thai (with tone marks)
escape("สวัสดี");

// Russian
escape("Привет"); // => '\u041F\u0440\u0438\u0432\u0435\u0442'

// Hindi (with combining marks)
escape("नमस्ते"); // => '\u0928\u092E\u0938\u094D\u0924\u0947'

// Chinese
escape("你好世界"); // => '\u4F60\u597D\u4E16\u754C'

// Vietnamese (with diacritics)
escape("Xin chào"); // => 'Xin ch\u00E0o'

// Turkish (special i variants)
escape("İstanbul"); // => '\u0130stanbul'

// Spanish (inverted punctuation)
escape("¡Hola!"); // => '\u00A1Hola!'

// Portuguese (tildes and cedilla)
escape("São Paulo"); // => 'S\u00E3o Paulo'

// Mixed multi-language content
const mixed = "Hello 안녕 こんにちは 你好 مرحبا สวัสดี Привет नमस्ते";
unescape(escape(mixed)) === mixed; // => true

Supported Features

  • Combining characters: Thai tone marks, Arabic diacritics, Hindi matras/virama, Vietnamese diacritics
  • Bidirectional text: RTL markers, mixed LTR/RTL content
  • Native numerals: Thai ๒๐๒๔, Arabic ٢٠٢٤, Devanagari २०२४
  • Conjunct consonants: Hindi samyuktakshar (क्ष, त्र, ज्ञ)
  • Supplementary planes: Emoji, ancient scripts, mathematical symbols
  • Normalization: Handles NFC/NFD forms correctly
  • Extended Latin: French accents, Turkish special i (ı İ), Spanish ñ, Portuguese ã/õ

Browser Support

Works in all modern browsers that support ES2022. For older browsers, you may need polyfills for:

  • String.prototype.codePointAt
  • String.fromCodePoint
  • Web Streams API (if using streaming)

License

Apache-2.0