Skip to content

pgEdge/pgedge-anonymizer

Repository files navigation

pgEdge Anonymizer

CI Release

Documentation:

pgEdge Anonymizer is a command-line tool for anonymizing personally identifiable information (PII) in PostgreSQL databases. The tool replaces sensitive data with realistic fake values that you can use for development and testing, while maintaining data consistency and referential integrity.

Features

  • 100+ built-in patterns for common PII types across 19 countries
  • Consistent replacement - same input produces same output within a run
  • Foreign key awareness - automatically handles CASCADE relationships
  • Large database support - efficient batch processing with server-side cursors
  • Format preservation - maintains original data formatting where possible
  • Single transaction - all changes committed atomically or rolled back
  • Extensible - define custom patterns using date, number, or mask formats

Quick Start

Anonymizer lets you create an experimental data set that preserves the shape and integrity of a Postgres database in just three steps:

  1. Create a configuration file that specifies the replacement patterns for your columns.
  2. Build and run the pgedge-anonymizer to convert your columns.
  3. Review the results.

Before running pgedge-anonymizer, you need to create a configuration file named pgedge-anonymizer.yaml; the file should contain:

  • a database section, with connection details for your database.
  • a columns section, listing the fully-qualified columns that you wish to anonymize (in schema_name.table_name.column_name format).
  • patterns properties for each column that specifies the form that replacement content will take.

For example:

database:
  host: localhost
  port: 5432
  database: myapp
  user: anonymizer

columns:
  - column: public.users.email
    pattern: EMAIL

  - column: public.users.phone
    pattern: US_PHONE

  - column: public.users.ssn
    pattern: US_SSN

After creating a configuration file, run the anonymizer:

pgedge-anonymizer run

Review the list of changes as pgedge-anonymizer runs, displaying statistics:

Processing public.users.email (est. 50000 rows)...
  10000 rows processed
  20000 rows processed
  30000 rows processed
  40000 rows processed
  50000 rows processed
  Completed: 50000 rows, 48234 values anonymized

=== Anonymization Statistics ===
Total columns processed: 1
Total rows processed:    50000
Total values anonymized: 48234
Total duration:          2.34s
Throughput:              21367 rows/sec

Developer Notes

Prerequisites

  • Go 1.24 or later
  • PostgreSQL (for integration tests)
  • Python 3.12+ (for documentation)

Use the following command to build pgedge-anonymizer:

make build        # Build binary

Use the following command to run the Anonymizer test suite:

make test

Use the following command to run the Go Linter:

make lint

Use the following command to format the code:

make fmt

Support

License

This project is licensed under the PostgreSQL License.

About

An anonymizer tool for replacing PII and similar data in dev/test databases copied from production

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •