Skip to content

Text Janitor is a a comprehensive text analysis and cleaning toolkit developed by Martin Mkrtchian for maintaining code quality and text hygiene across your projects.

Notifications You must be signed in to change notification settings

marcho78/text-janitor

Repository files navigation

Text Janitor

A comprehensive text analysis and cleaning toolkit for maintaining code quality and text hygiene across your projects.

Version Go Version Coverage License

Features

🔍 Comprehensive Auditing

  • Analyze code for security vulnerabilities, code quality issues, and technical debt
  • Detect encoding problems, mojibake, and Unicode normalization issues
  • Custom pattern detection with configurable severity levels
  • Generate detailed reports in HTML, JSON, Markdown, CSV, or text formats
  • Scoring system with health grades (A-F) for your codebase

🧹 Text Cleaning Engine

  • Normalize line endings (CRLF, LF, CR, Mixed) with auto-detection
  • Remove trailing whitespace and fix indentation (tabs vs spaces)
  • Smart quote normalization (10+ Unicode quote types)
  • Dash standardization (6+ Unicode dash variants)
  • Empty line normalization with configurable limits

🔐 Security Scanning

  • Detect hardcoded passwords, API keys, and sensitive data
  • Find database connection strings and credentials
  • Identify security vulnerabilities with pattern matching
  • Configurable ignore patterns to reduce false positives

📊 Unicode Analysis Suite

  • Encoding Detection: Auto-detect UTF-8, UTF-16, UTF-32, Latin-1, Windows-1252
  • Mojibake Detection: Find and auto-fix character encoding corruption
  • Normalization: Analyze and convert between NFC, NFD, NFKC, NFKD forms
  • Invisible Characters: Detect zero-width and control characters
  • Character Frequency: Analyze entropy and detect unusual patterns

💯 Code Quality Metrics

  • Check for debug code, TODOs, FIXMEs, and technical debt markers
  • Detect magic numbers and hardcoded values
  • Line length validation and complexity analysis
  • Naming convention checks
  • Git integration for pre-commit hooks

Installation

go install github.com/marcho78/text-janitor/cmd/text-janitor@latest

Or build from source:

git clone https://github.com/marcho78/text-janitor
cd text-janitor
go build ./cmd/text-janitor

Quick Start

Basic Audit

Audit your project for all types of issues:

# Comprehensive audit of current directory
text-janitor audit .

# Audit with detailed report
text-janitor audit . --detailed

# Generate HTML report (default)
text-janitor audit . --output-file report.html

# Generate JSON for CI/CD integration
text-janitor audit . --output-format json --output-file report.json

# Check specific issue types only
text-janitor audit . --checks security,encoding,mojibake

Text Cleaning

Clean and normalize text files:

# Preview changes without modifying files
text-janitor clean . --dry-run

# Clean with backup creation
text-janitor clean . --backup

# Selective cleaning
text-janitor clean . --normalize-line-endings --remove-trailing-whitespace

# Convert tabs to spaces
text-janitor clean . --tabs-to-spaces --tab-width 4

Pattern Search

Search for patterns across files:

# Search with regex
text-janitor scan . --pattern "TODO|FIXME"

# Find hardcoded passwords
text-janitor scan . --find-passwords

# Find URLs and emails
text-janitor scan . --find-urls --find-emails

# Case-insensitive search
text-janitor scan . --case-insensitive "error"

Commands

audit - Comprehensive Code Analysis

The most powerful command, performing thorough analysis of your codebase:

text-janitor audit [path] [flags]

Available Checks:

  • security - Detect security vulnerabilities (CRITICAL)
  • code_quality/hygiene - Analyze code quality issues (HIGH)
  • encoding - Find encoding inconsistencies (HIGH)
  • unicode - Detect Unicode problems (MEDIUM)
  • mojibake - Find character encoding corruption (MEDIUM)
  • whitespace - Check whitespace issues (LOW)
  • line_endings - Verify line ending consistency (LOW)
  • invisible - Find invisible characters (MEDIUM)
  • emoji - Detect emojis (CRITICAL in code, LOW in docs)
  • frequency - Analyze character frequency anomalies (LOW)
  • normalization - Check Unicode normalization (LOW)
  • custom_pattern - Search for user-defined patterns (CONFIGURABLE)

Key Flags:

  • --checks - Specific checks to run
  • --all-checks - Run all available checks (default: true)
  • --detailed - Show all issues with full details
  • --output-format - Output format: html, json, markdown, csv, text
  • --output-file - Save report to file
  • --min-severity - Minimum severity to report
  • --show-passed - Include passed checks in report
  • --top-issues - Show only top N issues
  • --group-by-file - Group issues by file instead of category
  • --max-per-category - Maximum issues per category
  • --min-score - Minimum acceptable score (for CI/CD)
  • --max-critical - Maximum critical issues allowed
  • --max-high - Maximum high severity issues allowed

clean - Text Normalization

Clean and normalize text files with 28+ types of fixes:

text-janitor clean [path] [flags]

Key Flags:

  • --dry-run - Preview changes without modifying
  • --backup - Create backup files
  • --normalize-line-endings - Fix line endings
  • --line-ending - Target line ending (lf, crlf, cr, auto)
  • --remove-trailing-whitespace - Remove trailing spaces
  • --normalize-tabs - Fix tabs and spaces
  • --tabs-to-spaces - Convert tabs to spaces
  • --tab-width - Tab width for conversion
  • --reduce-multiple-spaces - Reduce consecutive spaces
  • --normalize-empty-lines - Fix consecutive empty lines
  • --max-empty-lines - Maximum consecutive empty lines
  • --normalize-quotes - Fix quotation marks
  • --straighten-quotes - Convert smart to straight quotes
  • --normalize-dashes - Fix dashes and hyphens
  • --standardize-dashes - Convert to ASCII dashes

hygiene - Code Quality Check

Analyze code hygiene and quality metrics:

text-janitor hygiene [path] [flags]

Key Flags:

  • --check-debug - Check for debug code
  • --check-secrets - Check for hardcoded secrets
  • --check-todos - Check for TODO/FIXME comments
  • --check-magic-numbers - Check for magic numbers
  • --check-line-length - Check line length limits
  • --auto-fix - Automatically fix simple issues
  • --report-format - Output format (text, html, csv, markdown)
  • --install-hooks - Install Git pre-commit hooks

encoding - Encoding Detection and Conversion

Detect and convert file encodings:

text-janitor encoding [path] [flags]

Key Flags:

  • --from - Source encoding
  • --to - Target encoding
  • --convert - Perform conversion
  • --mixed-detection - Detect mixed encodings
  • --detailed - Show detailed analysis

mojibake - Fix Encoding Corruption

Detect and fix mojibake (character encoding corruption):

text-janitor mojibake [path] [flags]

Key Flags:

  • --auto-fix - Automatically fix detected issues
  • --show-context - Show context around issues
  • --backup - Create backup files

Other Commands

  • emoji - Find emoji usage in code
  • frequency - Analyze character frequency and entropy
  • invisible - Detect invisible and zero-width characters
  • normalize - Analyze and convert Unicode normalization
  • scan - Search for patterns with smart detection

Configuration File

Create a .text-janitor.yml file for complex configurations:

# Audit configuration
audit:
  enabled_checks:
    - security
    - code_quality
    - encoding
    - mojibake
    - unicode

  # Check-specific configurations
  check_configs:
    security:
      enabled: true
      ignore_patterns:
        - '\\(\\?i\\)jdbc:'  # Ignore regex patterns

    mojibake:
      enabled: true
      ignore_patterns:
        - '%[0-9]*[dxXsfF]'  # Printf format specifiers

  # Performance settings
  workers: 8
  max_file_size: 10485760  # 10MB
  max_depth: 5

  # File filters
  extensions: [".go", ".js", ".py", ".java"]
  exclude_patterns: ["vendor/", "node_modules/", ".git/"]

  # CI/CD thresholds
  min_score: 70
  max_critical: 0
  max_high: 5

  # Custom patterns
  patterns:
    - pattern: "password\\s*=\\s*[\"'][^\"']+[\"']"
      severity: critical
      message: "Hardcoded password detected"

    - pattern: "TODO|FIXME|HACK"
      severity: medium
      message: "Technical debt marker"

# Clean configuration
clean:
  normalize_line_endings: true
  line_ending: "lf"
  remove_trailing_whitespace: true
  normalize_quotes: true
  tabs_to_spaces: true
  tab_width: 4
  max_empty_lines: 2

# Hygiene configuration
hygiene:
  max_line_length: 100
  check_todos: true
  check_secrets: true
  check_magic_numbers: true

CI/CD Integration

GitHub Actions

name: Code Audit
on: [push, pull_request]

jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Go
        uses: actions/setup-go@v4
        with:
          go-version: '1.21'

      - name: Install Text Janitor
        run: go install github.com/marcho78/text-janitor/cmd/text-janitor@latest

      - name: Run Audit
        run: |
          text-janitor audit . \
            --output-format json \
            --output-file audit-report.json \
            --max-critical 0 \
            --max-high 5

      - name: Upload Report
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: audit-report
          path: audit-report.json

Pre-commit Hook

#!/bin/sh
# .git/hooks/pre-commit

# Run audit on staged files
text-janitor audit . --max-critical 0 --max-high 0

if [ $? -ne 0 ]; then
    echo "❌ Commit blocked: Critical or high severity issues found"
    echo "Run 'text-janitor audit . --detailed' for details"
    exit 1
fi

Report Formats

HTML Report

Interactive reports with:

  • Overall health score and grade (A-F)
  • Issues grouped by file, severity, and category
  • Search and filter functionality
  • Detailed descriptions and remediation advice
  • Summary dashboard with statistics

JSON Report

Machine-readable format:

{
  "summary": {
    "overall_score": 84.6,
    "grade": "B",
    "total_issues": 45,
    "critical_count": 0,
    "high_count": 5
  },
  "file_results": [...],
  "categories": {...}
}

Other Formats

  • Markdown: GitHub-friendly format with tables
  • CSV: Spreadsheet-compatible for analysis
  • Text: Simple text output for terminals

Examples

Security Audit

text-janitor audit . \
  --checks security \
  --max-critical 0 \
  --output-format html \
  --output-file security-report.html

Clean Code Files

text-janitor clean ./src \
  --extensions .go,.js,.py \
  --normalize-line-endings \
  --remove-trailing-whitespace \
  --tabs-to-spaces \
  --backup

Find Technical Debt

text-janitor audit . \
  --pattern "TODO|FIXME|HACK|XXX" \
  --pattern-severity medium \
  --pattern-message "Technical debt" \
  --group-by-file \
  --output-format markdown

Pre-deployment Check

text-janitor audit ./src \
  --checks security,code_quality,encoding \
  --min-score 80 \
  --max-critical 0 \
  --max-high 5

Test Coverage

The project maintains high test coverage:

  • Overall: 84.6%
  • Audit Package: 84.4%
  • Scanner Package: 83.5%
  • Unicode Package: 84.2%
  • Patterns Package: 85.4%
  • Mojibake Package: 92.2%

Project Structure

text-janitor/
├── cmd/text-janitor/     # CLI application
├── internal/             # Core packages
│   ├── audit/           # Comprehensive auditing
│   ├── cleaner/         # Text cleaning engine
│   ├── config/          # Configuration management
│   ├── encoding/        # Encoding detection
│   ├── hygiene/         # Code quality checks
│   ├── mojibake/        # Encoding corruption fixes
│   ├── patterns/        # Pattern matching
│   ├── scanner/         # File scanning
│   └── unicode/         # Unicode analysis
├── pkg/types/           # Shared types
├── command-reference.html # Complete documentation
└── CHANGELOG.md         # Version history

License

MIT

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Support

For issues and feature requests, please use the GitHub issue tracker.

Developer

Developed by: Martin Mkrtchian

X (Twitter) LinkedIn

Acknowledgments

Text Janitor is built with Go and leverages the power of concurrent processing for fast, efficient text analysis across large codebases.

About

Text Janitor is a a comprehensive text analysis and cleaning toolkit developed by Martin Mkrtchian for maintaining code quality and text hygiene across your projects.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published