A comprehensive text analysis and cleaning toolkit for maintaining code quality and text hygiene across your projects.
- Analyze code for security vulnerabilities, code quality issues, and technical debt
- Detect encoding problems, mojibake, and Unicode normalization issues
- Custom pattern detection with configurable severity levels
- Generate detailed reports in HTML, JSON, Markdown, CSV, or text formats
- Scoring system with health grades (A-F) for your codebase
- Normalize line endings (CRLF, LF, CR, Mixed) with auto-detection
- Remove trailing whitespace and fix indentation (tabs vs spaces)
- Smart quote normalization (10+ Unicode quote types)
- Dash standardization (6+ Unicode dash variants)
- Empty line normalization with configurable limits
- Detect hardcoded passwords, API keys, and sensitive data
- Find database connection strings and credentials
- Identify security vulnerabilities with pattern matching
- Configurable ignore patterns to reduce false positives
- Encoding Detection: Auto-detect UTF-8, UTF-16, UTF-32, Latin-1, Windows-1252
- Mojibake Detection: Find and auto-fix character encoding corruption
- Normalization: Analyze and convert between NFC, NFD, NFKC, NFKD forms
- Invisible Characters: Detect zero-width and control characters
- Character Frequency: Analyze entropy and detect unusual patterns
- Check for debug code, TODOs, FIXMEs, and technical debt markers
- Detect magic numbers and hardcoded values
- Line length validation and complexity analysis
- Naming convention checks
- Git integration for pre-commit hooks
go install github.com/marcho78/text-janitor/cmd/text-janitor@latestOr build from source:
git clone https://github.com/marcho78/text-janitor
cd text-janitor
go build ./cmd/text-janitorAudit your project for all types of issues:
# Comprehensive audit of current directory
text-janitor audit .
# Audit with detailed report
text-janitor audit . --detailed
# Generate HTML report (default)
text-janitor audit . --output-file report.html
# Generate JSON for CI/CD integration
text-janitor audit . --output-format json --output-file report.json
# Check specific issue types only
text-janitor audit . --checks security,encoding,mojibakeClean and normalize text files:
# Preview changes without modifying files
text-janitor clean . --dry-run
# Clean with backup creation
text-janitor clean . --backup
# Selective cleaning
text-janitor clean . --normalize-line-endings --remove-trailing-whitespace
# Convert tabs to spaces
text-janitor clean . --tabs-to-spaces --tab-width 4Search for patterns across files:
# Search with regex
text-janitor scan . --pattern "TODO|FIXME"
# Find hardcoded passwords
text-janitor scan . --find-passwords
# Find URLs and emails
text-janitor scan . --find-urls --find-emails
# Case-insensitive search
text-janitor scan . --case-insensitive "error"The most powerful command, performing thorough analysis of your codebase:
text-janitor audit [path] [flags]Available Checks:
security- Detect security vulnerabilities (CRITICAL)code_quality/hygiene- Analyze code quality issues (HIGH)encoding- Find encoding inconsistencies (HIGH)unicode- Detect Unicode problems (MEDIUM)mojibake- Find character encoding corruption (MEDIUM)whitespace- Check whitespace issues (LOW)line_endings- Verify line ending consistency (LOW)invisible- Find invisible characters (MEDIUM)emoji- Detect emojis (CRITICAL in code, LOW in docs)frequency- Analyze character frequency anomalies (LOW)normalization- Check Unicode normalization (LOW)custom_pattern- Search for user-defined patterns (CONFIGURABLE)
Key Flags:
--checks- Specific checks to run--all-checks- Run all available checks (default: true)--detailed- Show all issues with full details--output-format- Output format: html, json, markdown, csv, text--output-file- Save report to file--min-severity- Minimum severity to report--show-passed- Include passed checks in report--top-issues- Show only top N issues--group-by-file- Group issues by file instead of category--max-per-category- Maximum issues per category--min-score- Minimum acceptable score (for CI/CD)--max-critical- Maximum critical issues allowed--max-high- Maximum high severity issues allowed
Clean and normalize text files with 28+ types of fixes:
text-janitor clean [path] [flags]Key Flags:
--dry-run- Preview changes without modifying--backup- Create backup files--normalize-line-endings- Fix line endings--line-ending- Target line ending (lf, crlf, cr, auto)--remove-trailing-whitespace- Remove trailing spaces--normalize-tabs- Fix tabs and spaces--tabs-to-spaces- Convert tabs to spaces--tab-width- Tab width for conversion--reduce-multiple-spaces- Reduce consecutive spaces--normalize-empty-lines- Fix consecutive empty lines--max-empty-lines- Maximum consecutive empty lines--normalize-quotes- Fix quotation marks--straighten-quotes- Convert smart to straight quotes--normalize-dashes- Fix dashes and hyphens--standardize-dashes- Convert to ASCII dashes
Analyze code hygiene and quality metrics:
text-janitor hygiene [path] [flags]Key Flags:
--check-debug- Check for debug code--check-secrets- Check for hardcoded secrets--check-todos- Check for TODO/FIXME comments--check-magic-numbers- Check for magic numbers--check-line-length- Check line length limits--auto-fix- Automatically fix simple issues--report-format- Output format (text, html, csv, markdown)--install-hooks- Install Git pre-commit hooks
Detect and convert file encodings:
text-janitor encoding [path] [flags]Key Flags:
--from- Source encoding--to- Target encoding--convert- Perform conversion--mixed-detection- Detect mixed encodings--detailed- Show detailed analysis
Detect and fix mojibake (character encoding corruption):
text-janitor mojibake [path] [flags]Key Flags:
--auto-fix- Automatically fix detected issues--show-context- Show context around issues--backup- Create backup files
emoji- Find emoji usage in codefrequency- Analyze character frequency and entropyinvisible- Detect invisible and zero-width charactersnormalize- Analyze and convert Unicode normalizationscan- Search for patterns with smart detection
Create a .text-janitor.yml file for complex configurations:
# Audit configuration
audit:
enabled_checks:
- security
- code_quality
- encoding
- mojibake
- unicode
# Check-specific configurations
check_configs:
security:
enabled: true
ignore_patterns:
- '\\(\\?i\\)jdbc:' # Ignore regex patterns
mojibake:
enabled: true
ignore_patterns:
- '%[0-9]*[dxXsfF]' # Printf format specifiers
# Performance settings
workers: 8
max_file_size: 10485760 # 10MB
max_depth: 5
# File filters
extensions: [".go", ".js", ".py", ".java"]
exclude_patterns: ["vendor/", "node_modules/", ".git/"]
# CI/CD thresholds
min_score: 70
max_critical: 0
max_high: 5
# Custom patterns
patterns:
- pattern: "password\\s*=\\s*[\"'][^\"']+[\"']"
severity: critical
message: "Hardcoded password detected"
- pattern: "TODO|FIXME|HACK"
severity: medium
message: "Technical debt marker"
# Clean configuration
clean:
normalize_line_endings: true
line_ending: "lf"
remove_trailing_whitespace: true
normalize_quotes: true
tabs_to_spaces: true
tab_width: 4
max_empty_lines: 2
# Hygiene configuration
hygiene:
max_line_length: 100
check_todos: true
check_secrets: true
check_magic_numbers: truename: Code Audit
on: [push, pull_request]
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Go
uses: actions/setup-go@v4
with:
go-version: '1.21'
- name: Install Text Janitor
run: go install github.com/marcho78/text-janitor/cmd/text-janitor@latest
- name: Run Audit
run: |
text-janitor audit . \
--output-format json \
--output-file audit-report.json \
--max-critical 0 \
--max-high 5
- name: Upload Report
uses: actions/upload-artifact@v3
if: always()
with:
name: audit-report
path: audit-report.json#!/bin/sh
# .git/hooks/pre-commit
# Run audit on staged files
text-janitor audit . --max-critical 0 --max-high 0
if [ $? -ne 0 ]; then
echo "❌ Commit blocked: Critical or high severity issues found"
echo "Run 'text-janitor audit . --detailed' for details"
exit 1
fiInteractive reports with:
- Overall health score and grade (A-F)
- Issues grouped by file, severity, and category
- Search and filter functionality
- Detailed descriptions and remediation advice
- Summary dashboard with statistics
Machine-readable format:
{
"summary": {
"overall_score": 84.6,
"grade": "B",
"total_issues": 45,
"critical_count": 0,
"high_count": 5
},
"file_results": [...],
"categories": {...}
}- Markdown: GitHub-friendly format with tables
- CSV: Spreadsheet-compatible for analysis
- Text: Simple text output for terminals
text-janitor audit . \
--checks security \
--max-critical 0 \
--output-format html \
--output-file security-report.htmltext-janitor clean ./src \
--extensions .go,.js,.py \
--normalize-line-endings \
--remove-trailing-whitespace \
--tabs-to-spaces \
--backuptext-janitor audit . \
--pattern "TODO|FIXME|HACK|XXX" \
--pattern-severity medium \
--pattern-message "Technical debt" \
--group-by-file \
--output-format markdowntext-janitor audit ./src \
--checks security,code_quality,encoding \
--min-score 80 \
--max-critical 0 \
--max-high 5The project maintains high test coverage:
- Overall: 84.6%
- Audit Package: 84.4%
- Scanner Package: 83.5%
- Unicode Package: 84.2%
- Patterns Package: 85.4%
- Mojibake Package: 92.2%
text-janitor/
├── cmd/text-janitor/ # CLI application
├── internal/ # Core packages
│ ├── audit/ # Comprehensive auditing
│ ├── cleaner/ # Text cleaning engine
│ ├── config/ # Configuration management
│ ├── encoding/ # Encoding detection
│ ├── hygiene/ # Code quality checks
│ ├── mojibake/ # Encoding corruption fixes
│ ├── patterns/ # Pattern matching
│ ├── scanner/ # File scanning
│ └── unicode/ # Unicode analysis
├── pkg/types/ # Shared types
├── command-reference.html # Complete documentation
└── CHANGELOG.md # Version history
MIT
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
For issues and feature requests, please use the GitHub issue tracker.
Developed by: Martin Mkrtchian
Text Janitor is built with Go and leverages the power of concurrent processing for fast, efficient text analysis across large codebases.