-
-
Notifications
You must be signed in to change notification settings - Fork 1k
实现txt词典注释语法和排序规则 #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
实现txt词典注释语法和排序规则 #1014
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Add WASM demo scaffold and project notes * Add OpenCC WASM demo with converter UI and test runner - 补充 WASM 编译结果在前端 JS 中的用法 * Polish WASM demo UI and paths, run tests, and streamline converter export * Add wasm-based OpenCC package and update demo to consume it * Add wasm-based OpenCC package, static demo bundle, and benchmarking page * Add copyright notice and LICENSE
…eparation This commit enhances the opencc-wasm library with TypeScript support and implements a cleaner build architecture with semantic separation between intermediate build artifacts and publishable distribution. TypeScript Support: - Add comprehensive type definitions (index.d.ts) with full JSDoc documentation - Define interfaces: ConverterOptions, ConverterFunction, OpenCCNamespace, etc. - Provide complete type safety for better IDE support and developer experience Build Architecture Redesign (semantic separation): - build/ - Intermediate WASM artifacts (gitignored, for tests/development) * build/opencc-wasm.esm.js - ESM WASM glue * build/opencc-wasm.cjs - CJS WASM glue * build/opencc-wasm.wasm - WASM binary - dist/ - Publishable distribution (committed, for npm) * dist/esm/ - ESM package entry * dist/cjs/ - CJS package entry * dist/data/ - OpenCC config and dictionary files Invariants and Semantics: - Tests import source (index.js) → loads from build/ - Published package exports dist/ only - build/ = internal intermediate artifacts - dist/ = publishable artifacts - Clear separation ensures tests validate actual build output Enhanced .gitignore: - Add build/ to gitignore (intermediate artifacts) - Add node_modules/, logs, OS-specific files (.DS_Store, Thumbs.db) - Exclude editor configurations (.vscode/, .idea/) - Add cache and temporary file exclusions Two-Stage Build Process: Stage 1 (build.sh): - Compiles C++ to WASM using Emscripten - Outputs to build/ directory Stage 2 (build-api.js): - Copies WASM artifacts from build/ to dist/ - Transforms source paths for production - Generates API wrappers for ESM and CJS - Copies data files Package Configuration (package.json): - Add "types" field pointing to index.d.ts - Update "main" and "module" to point to API wrappers in dist/ - Add comprehensive "exports" map: * "." - Main API (ESM/CJS wrappers) * "./wasm" - Direct access to WASM glue for advanced users * "./dist/*" - Wildcard for flexible file access - Include LICENSE and NOTICE in published files Documentation: - Add comprehensive README section explaining build architecture - Document project structure with invariants - Explain semantic separation between build/ and dist/ Benefits: - Better TypeScript integration and IDE autocomplete - Cleaner, more maintainable directory structure - Tests validate actual build output, not stale dist files - Clear semantic separation between internal and publishable artifacts - Professional project setup following modern npm best practices - Long-term maintainability through clear invariants
…cases.json (#10) - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format - update bundled .ocd2 dictionaries and testcases.json fixtures ---- * wasm-lib: refresh assets script and switch tests to consolidated testcases.json - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format - update bundled .ocd2 dictionaries and testcases.json fixtures * Rebuild the wasm-lib and update the documentations
新增完整的貢獻指南文檔,包含: - 如何新增詞典條目(強調使用 Tab 字元分隔) - 如何使用排序工具確保詞典正確排序 - 如何安裝 Bazel 並執行測試 - 如何撰寫測試案例(測試驅動開發流程) - 簡轉繁轉換的特殊注意事項(需測試多個配置) 使用台灣繁體中文撰寫。
1. 新增演算法與理論局限性分析文件 - 詳細說明最大正向匹配分詞演算法 - 分析轉換鏈機制與詞典系統 - 探討理論局限性(一對多歧義、缺乏上下文理解、維護負擔) - 與現代方法(統計模型、神經網路)的比較 2. 更新 AGENTS.md - 新增「延伸閱讀」章節 - 連結到技術文件和貢獻指南 3. 新增 Claude Code 配置 - .claude/hooks/session_start.sh - 會話啟動時顯示專案資訊 - .claude/skills/opencc-dict-edit.md - 詞典編輯技能 - .claude/skills/opencc-algorithm-explain.md - 演算法解釋技能 這些配置幫助 AI 代理更好地理解 OpenCC 專案架構與開發流程。
🚨 BREAKING CHANGE: New distribution layout
The .wasm files have been moved to be co-located with their corresponding
glue code files, fixing loading issues and enabling proper CDN usage.
New layout:
dist/
esm/
opencc-wasm.js
opencc-wasm.wasm ← Now here (same directory)
cjs/
opencc-wasm.cjs
opencc-wasm.wasm ← Now here (same directory)
opencc-wasm.wasm ← Kept for legacy compatibility
Features:
- ✅ CDN support: Can now import directly from jsDelivr/unpkg
- ✅ Fixed WASM loading in various bundlers and environments
- ✅ Comprehensive test suite with CDN usage tests
- ✅ Complete documentation (CDN_USAGE.md, TESTING.md, CHANGELOG.md)
Test suite:
- npm test → Run all tests (core + CDN)
- npm run test:core → Run 56 core functionality tests
- npm run test:cdn → Run CDN usage tests
All 56 core tests + CDN tests pass successfully.
Usage example:
```js
import OpenCC from "https://cdn.jsdelivr.net/npm/[email protected]/dist/esm/index.js";
const converter = OpenCC.Converter({ from: "cn", to: "t" });
const result = await converter("简体中文");
```
Co-authored-by: Claude <[email protected]>
- 在頭部新增「專案說明」章節,說明本項目為 BYVoid/OpenCC 的 fork - 闡述兩個主要目的:WASM 實現與詞表擴充 - 新增「背景」小節,說明現有第三方實作的維護狀況與本專案定位 - 原有 README 內容完整保留在分隔線下方作為參考
This commit adds significant improvements to opencc-wasm:
**API Enhancements:**
- Add `config` parameter to Converter() as intuitive alternative to `from`/`to`
- Support direct OpenCC config file names (e.g., `{ config: "s2twp" }`)
- Expand CONFIG_MAP to support all conversion types and aliases
- Maintain backward compatibility with `from`/`to` parameters
**Documentation Improvements:**
- Consolidate all API documentation into comprehensive README.md
- Add Traditional Chinese README (README.zh-TW.md) with Taiwan localization
- Emphasize "zero configuration" and "3-line start" features
- Include practical examples for React, Vue, Node.js, and Web Workers
- Add best practices and FAQ sections
- Create interactive demo (test/demo-out-of-box.html)
**User Experience:**
- Clarify auto-loading of configs and dictionaries from CDN
- Show both API methods side-by-side for user choice
- Provide TypeScript usage examples
All 56 core tests + new config parameter tests passing.
…'方程式' See [ByVoid Issue BYVoid#714](BYVoid#714).
添加基於《通用規範漢字表》(2013) 的繁簡轉換模式,支持將各種繁體標準 轉換為中國政府規範繁體字。 1. **t2cngov.json** - 繁體到政府標準(全轉換) - 繁體異體標準化:溼 → 濕 - 簡體轉標準繁體:湿 → 濕 - 部分繁簡轉換:淨 → 净 2. **t2cngov_keep_simp.json** - 繁體到政府標準(保留簡體) - 保留原文中有意使用的簡體字 - 僅轉換繁體異體字 第三方字典來源: - 作者:TerryTian-tech - 許可證:Apache License 2.0 - 參考標準:《通用規範漢字表》(2013) 字典文件: - TGCharacters.txt (37KB → 45KB ocd2) - 約 4000 個字符映射 - TGCharacters_keep_simp.txt (13KB → 21KB ocd2) - 保留簡體變體 - TGPhrases.txt (1.1MB → 911KB ocd2) - 約 7000 個詞組映射 - data/CMakeLists.txt: 構建 cngov 字典(扁平命名,分層安裝) - test/CMakeLists.txt: 整合測試用例 - data/dictionary/cngov/BUILD.bazel: cngov 字典構建規則 - data/config/BUILD.bazel: 新增 cngov_validation_test - test/testcases/BUILD.bazel: 新增 cngov_testcases filegroup - test/CommandLineConvertTest.cpp: 新增 ConvertCNGovFromJson 測試函數 - test/testcases/cngov_testcases.json: 5 個專屬測試用例 - data/config/CNGovValidationTest.cpp: 獨立的 Bazel 測試 - 測試命令: * bazel test //data/config:cngov_validation_test * bazel test //data/... - wasm-lib/data/dict/cngov/*.ocd2: 編譯後的字典 - wasm-lib/test/cngov_testcases.json: 測試用例 - wasm-lib/test/cngov.test.js: Node.js 測試代碼 - wasm-lib/scripts/refresh_assets.sh: 更新以支持子目錄和 cngov - README.md: 新增 CN Government Standard Mode 使用說明 - wasm-lib/README.md & README.zh.md: 配置表新增 t2cngov 條目 - data/dictionary/cngov/README.txt: 字典來源和版權聲明 ```bash echo "盫" | opencc -c t2cngov.json # → 盦 echo "简体混杂繁體" | opencc -c t2cngov.json # → 簡體混雜繁體 echo "潮溼的露臺" | opencc -c t2cngov.json # → 潮濕的露臺 echo "一乾二淨" | opencc -c t2cngov.json # → 一乾二净 ``` - 子目錄隔離:第三方字典放在 data/dictionary/cngov/ - 獨立測試:避免與上游 testcases.json 合併衝突 - 雙構建系統:同時支持 CMake 和 Bazel - 完整元數據:JSON 配置包含作者、許可證、貢獻者信息 - 字典壓縮:ocd2 格式體積減少 70-80% 基於 TerryTian-tech 的研究成果,整合時遵循 Apache License 2.0。 貢獻者:TerryTian-tech, Yi Jianpeng, Hu Xinmei, Duan Yatong
Ensures that the build is always run before publishing to npm, preventing the publication of stale build artifacts.
This commit adds detailed Chinese-language documentation analyzing the two critical security vulnerabilities fixed in the previous commit. ## Documentation Files ### 1. doc/ISSUE_997_ANALYSIS.md Comprehensive analysis of the MaxMatchSegmentation buffer overflow (GitHub Issue BYVoid#997): - Problem description and crash location - Root cause analysis with step-by-step execution trace - Detailed explanation of integer underflow mechanism - Comparison: why normal text doesn't trigger vs. malicious input - Solution design and correctness proof - Test case documentation - Security impact assessment (CVSS ~7.5) - Best practices and lessons learned - Prevention strategies for similar issues Key sections: - Actual demonstration of the bug with hex output - Multi-layer defense architecture explanation - Reference to related CVE/CWE entries ### 2. doc/CONVERSION_INFORMATION_DISCLOSURE.md In-depth security analysis of the Conversion.cpp information disclosure vulnerability (more severe than BYVoid#997): - Complete vulnerability description - Attack scenario with memory layout diagrams - Step-by-step exploit demonstration showing heap data leakage - Direct comparison with Issue BYVoid#997 (why this is worse) - Exploitability analysis with test results - Information that could be leaked (keys, passwords, etc.) - Security impact: CWE-125, CWE-200, CVSS ~8.6 - Detailed fix explanation with multi-layer defense - Why normal usage was not affected - CVE recommendation and scoring rationale Key highlights: - Demonstrates actual heap memory leakage (0xAA bytes, "ABC" strings) - Shows that leaked data IS OUTPUT to conversion result - Explains ASLR bypass potential - Documents test cases that would fail with old code - Provides defensive programming recommendations ## Documentation Quality Both documents include: - Complete technical analysis in Chinese - Code snippets with annotations - Before/after comparisons - Security risk assessments - Prevention recommendations - References to standards (CWE, CVSS, OWASP) These documents serve as: - Security disclosure materials - Educational resources for similar vulnerability patterns - Reference for CVE submission - Internal security audit documentation Total additions: ~860 lines of detailed security analysis
* Refresh wasm-lib assets before build * Install Bazel before refreshing wasm assets
update package.json add --provenance to wasm-lib-publish.yml
本提交完整实现了txt词典的注释语法与排序规则,包括向后兼容的API设计和命令行工具支持。 ## 注释语法支持 **基本语法:** - 注释行:以 # 开头的整行 - 词典记录行:以tab分隔的 key/value pair - 空行:不包含任何可见字符 **注释块分类:** - Header block:文件开头注释块(在第一个词典记录前的最后一个空行之前) - Footer block:文件结尾注释块(在最后一条词典记录之后) - Attached block:紧贴词典记录行的注释块(中间无空行) - Floating block:游离注释块(不满足attach条件的注释块) **排序规则:** - 排序最小单位为词典记录 + 其附加的注释块 - Header/Footer block固定在文件开头/结尾 - 仅对词典记录的key进行稳定排序 - Floating block在排序后插入到其锚点位置 ## 向后兼容设计 **默认行为(preserveComments=false):** - 完全兼容旧版本 - 遇到 # 开头的行会抛出异常(原行为) - 不解析和保存注释结构 **新行为(preserveComments=true):** - # 开头的行被识别为注释,不报错 - 保存注释块结构用于排序和序列化 ## API修改 **核心API:** - Lexicon::ParseLexiconFromFile(FILE* fp, bool preserveComments = false) - TextDict::NewFromFile(FILE* fp, bool preserveComments = false) - TextDict::NewFromSortedFile(FILE* fp, bool preserveComments = false) - ConvertDictionary(..., bool preserveComments = false) **命令行工具:** opencc_dict 添加了 -p, --preserve-comments 参数 使用示例: ```bash # 默认行为(向后兼容)- 会对带注释的文件报错 opencc_dict -i input.txt -o output.txt -f text -t text # 保留注释并排序 opencc_dict -i input.txt -o output.txt -f text -t text --preserve-comments ``` ## 实现细节 **数据结构:** - CommentBlock:注释块结构 - AnnotatedEntry:带注释的词条 - 在Lexicon中添加了header/footer/annotated/floating blocks的存储 **核心逻辑:** - 重写ParseLexiconFromFile,支持两种解析模式 - 实现SortWithAnnotations,确保注释块随词条移动 - 修改TextDict::SerializeToFile,正确输出注释块和空行 ## 测试 添加了完整的测试覆盖(LexiconAnnotationTest): - ParseCommentLines:解析注释行 - ParseAttachedComment:解析附加注释 - ParseFloatingComment:解析游离注释 - ParseFooterComment:解析尾部注释 - SerializeWithAnnotations:带注释的序列化 - SortWithAnnotations:带注释的排序 - DefaultBehaviorIgnoresComments:验证默认行为 - DefaultBehaviorRejectsCommentLines:验证向后兼容 所有8个测试通过。手动测试命令行工具功能正常。 ## 其他改进 更新了.gitignore以正确忽略构建产物(CMake生成文件、编译二进制等)。
- Keep annotated text dictionaries when converting text->text with --preserve-comments - Re-anchor floating comment blocks after sorting - Add missing Bazel test targets and enable annotation parsing in dictionary tests - Use UTF-8/Unicode-safe open and FileNotFound for text dict input - Avoid RTTI by static casting when formatFrom is text - Update dictionary scripts to follow the txt comment/format spec
Add standardized headers listing the official config usage for each top-level dictionary file, and note that TWPhrasesIT/Name/Other are merged into TWPhrases.txt.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.