PAQ8PX – Experimental Lossless Data Compressor & Entropy Estimator

About

PAQ is a family of experimental, high-end lossless data compression programs. paq8px is one of the longest-running branches of PAQ, started by Jan Ondrus in 2009 with major contributions from Márcio Pais and Zoltán Gotthardt (see Contribution Timeline).

paq8px consistently achieves state-of-the-art compression ratios on various data compression benchmarks (see Benchmark Results). This performance comes at the cost of speed and memory usage, which makes it impractical for production use or long-term storage. However, it is particularly well-suited for file entropy estimation and as a reference for compression research.

For detailed history and ongoing development discussions, see the paq8px thread on encode.su.

Quick start

paq8px is portable software – no installation required.

Get the latest binary for Windows (x64) from the paq8px thread on encode.su, or build it from source for your platform – see below.

Command line interface

paq8px does not include a graphical user interface (GUI). All operations are performed from the command line.

Open a terminal and run paq8px with the desired options to compress your file (such as paq8px -8 file.txt).
Start with a small file – compression takes time.

Example output (on Windows):

c:\>paq8px.exe -8 file.txt
paq8px archiver v210 (c) 2026, Matt Mahoney et al.

Creating archive file.txt.paq8px210 in single file mode...

Filename: file.txt (111261 bytes)
Block segmentation:
 0           | text             |    111261 bytes [0 - 111260]
-----------------------
Total input size     : 111261
Total archive size   : 19595

Time 19.58 sec, used 2163 MB (2268982538 bytes) of memory

Note

The output archive extension is versioned (e.g., .paq8px210).

Note

You can place the binary anywhere and reference inputs/outputs by path.

Some examples

Compress a file at level 8 (balanced speed and compression ratio):

paq8px.exe -8 filename_to_compress

Compress at the maximum level with LSTM modeling included (-12L):

paq8px.exe -12L filename_to_compress

Warning

This mode is extremely slow and memory-intensive. Make sure you have 32 GB+ RAM.

Getting help

To view available options, run paq8px without arguments. To view available options + detailed help pages, run paq8px -help.

Click to expand: full paq8px help

paq8px archiver v210 (c) 2026, Matt Mahoney et al.
Free under GPL, http://www.gnu.org/licenses/gpl.txt

Usage:
  to compress       ->   paq8px -LEVEL[FLAGS] [OPTIONS] INPUT [OUTPUT]
  to decompress     ->   paq8px -d INPUT.paq8px210 [OUTPUT]
  to test           ->   paq8px -t INPUT.paq8px210 [OUTPUT]
  to list contents  ->   paq8px -l INPUT.paq8px210

LEVEL:
  -1 -2 -3 -4          | Compress using less memory (529, 543, 572, 630 MB)
  -5 -6 -7 -8          | Use more memory (747, 980, 1446, 2377 MB)
  -9 -10 -11 -12       | Use even more memory (4241, 7968, 15421, 29305 MB)
  -0                   | Segment and transform only, no compression
  -0L                  | Segment and transform then LSTM-only compression (alternative: -lstmonly)

FLAGS:
  L                    | Enable LSTM model (+24 MB per block type)
  A                    | Use adaptive learning rate
  S                    | Skip RGB color transform (images)
  B                    | Brute-force DEFLATE detection
  E                    | Pre-train x86/x64 model
  T                    | Pre-train text models (dictionary-based)

  Example: paq8px -8LA file.txt   <- Level 8 + LSTM + adaptive learning rate

Block detection control (compression-only):
  -forcebinary         | Force generic (binary) mode
  -forcetext           | Force text mode

LSTM snapshots (expert-only):
  -savelstm:text FILE  | Save learned LSTM model weights after compression
  -loadlstm:text FILE  | Load LSTM model weights before compression/decompression

Misc options:
  -v                   | Verbose output
  -log FILE            | Append compression results to log file
  -simd MODE           | Override SIMD detection - expert only (NONE|SSE2|AVX2|AVX512|NEON)

Notes:
  INPUT may be FILE, PATH/FILE, or @FILELIST
  OUTPUT is optional: FILE, PATH, PATH/FILE
  The archive is created in the current folder with .paq8px210 extension if OUTPUT omitted
  FLAGS are case-insensitive and only needed for compression; they may appear in any order
  INPUT must precede OUTPUT; all other OPTIONS may appear anywhere

=============
Detailed Help
=============

---------------
 1. Compression
---------------

  Compression levels control the amount of memory used during both compression and decompression.
  Higher levels generally improve compression ratio at the cost of higher memory usage and slower speed.
  Specifying the compression level is needed only for compression - no need to specify it for decompression.
  Approximately the same amount of memory will be used during compression and decompression.

  The listed memory usage for each LEVEL (-1 = 529 MB .. -12 = 29305 MB) is typical/indicative for compressing binary
  files with no preprocessing. Actual memory use is lower for text files and higher when a preprocessing step
  (segmentation and transformations) requires temporary memory. When special file types are detected, special models
  (image, jpg, audio) will be used and thus will require extra RAM.

------------------
 2. Special Levels
------------------

  -0   Only block type detection (segmentation) and block transformations are performed.
       The data is copied (verbatim or transformed); no compression happens.
       This mode is similar to a preprocessing-only tool like precomp.
       Uses approximately 3-7 MB total.

  -0L  Uses only a single LSTM model for prediction which is shared across all block types.
       Uses approximately 20-24 MB total RAM.
       Alternative: -lstmonly

---------------------
 3. Compression Flags
---------------------

  Compression flags are single-letter, case-insensitive, and appended directly to the level.
  They are valid only during compression. No need to specify them for decompression.

  L   Enable the LSTM (Long Short-Term Memory) model.
      Uses a fixed-size model, independent of compression level.

      At level -0L (also: -lstmonly) a single LSTM model is used for prediction for all detected block types.
      Block detection and segmentation are still performed, but no context mixing or Secondary Symbol
      Estimation (SSE) stage is used.

      At higher levels (-1L .. -12L) the LSTM model is included as a submodel in Context Mixing and its predictions
      are mixed with the other models.
      When special block types are detected, for each block type an individual LSTM model is created dynamically and
      used within that block type. Each such LSTM model adds approximately 24 MB to the total memory use.

  A   Enable adaptive learning rate in the CM mixer.
      May improve compression for some files.

  S   Skip RGB color transform for 24/32-bit images.
      Useful when the transform worsens compression.
      This flag has no effect when no image block types are detected.

  B   Enable brute-force DEFLATE stream detection.
      Slower but may improve detection of compressed streams.

  E   Pre-train the x86/x64 executable model.
      This option pre-trains the EXE model using the paq8px.exe binary itself.
      Archives created with a different paq8px.exe executable (even when built from the same source and build options)
      will differ. To decompress an archive created with -E, you must use the exact same executable that created it.

  T   Pre-train text-oriented models using a dictionary and expression list.
      The word list (english.dic) and expression list (english.exp) are used only to pre-train models before
      compression and they are not stored in the archive.
      You must have these same files available to decompress archives created with -T.

---------------------------
 4. Block Detection Control
---------------------------

  Block detection and segmentation always happen regardless of the memory level or other options - except when forced:

  -forcebinary

      Disable block detection; the whole file is considered as a single binary block and only the generic (binary)
      model set will be used.
      Useful when block detection produces false positives.

  -forcetext

      Disable block detection; consider the whole file as a single text block and use the text model set only.
      Useful when text data is misclassified as binary or fragments in a text file are incorrectly detected as some
      other block type.

---------------------------------------
 5. LSTM Snapshot Options (expert-only)
---------------------------------------

  -savelstm:text FILE

      Saves the LSTM model's learned parameters as a lossless snapshot to the specified file when compression finishes.
      Only the model used for text block(s) will be saved.
      It's not possible to save a snapshot from other block types. This is an experimental feature.

  -loadlstm:text FILE

      Loads the LSTM model's learned parameters from the specified file (which was saved earlier
      by the -savelstm:text option) before compression starts. The LSTM model will use this loaded
      snapshot to bootstrap its predictions.
      At levels -1L .. -12L only text blocks are affected.
      At level -0L all blocks are affected (because a single LSTM model is used for all block types).
      Critical: The same snapshot file MUST be used during decompression or the original content cannot be recovered.

----------------------
 6. Archive Operations
----------------------

  -d  Decompress an archive.
      In single-file mode the content is decompressed, the name of the output is the name of the archive without
      the .paq8px210 extension.
      In multi-file mode first the @LISTFILE is extracted then the rest of the files. Any required folders will
      be created recursively, all files will be extracted with their original names.
      If the output file or files already exist they will be overwritten.

      Example: to decompress file.txt to the current folder:
      paq8px -d file.txt.paq8px210

  -t  Test archive contents by decompressing to memory and comparing with the original data on-the-fly.
      If a file fails the test, the first mismatched position will be printed to screen.

      Example: to test archive contents:
      paq8px -t file.txt.paq8px210

  -l  List archive contents.
      Extracts the embedded @FILELIST (if present) and prints it.
      Applicable only to multi-file archives.

      Example: to list the file list (when the archive was created using @files):
      paq8px -l files.paq8px210

----------------------------------
 7. INPUT and OUTPUT Specification
----------------------------------

  INPUT may be:

  * A single file
  * A path/file
  * A [path/]@FILELIST

  In multi-file mode (i.e. when @FILELIST is provided) only file names, file contents and file sizes are stored
  in the archive. Timestamps, permissions, attributes or any other metadata are not preserved unless stored
  separately and manually by the user in the FILELIST.

  OUTPUT is optional:

    For compression:

    * If omitted, the archive is created in the current directory.
      The name of the archive: INPUT + paq8px210 extension appended.
    * If a filename is given, it is used as the archive name.
    * If a directory is given, the archive is created inside it.
    * If the archive file already exists, it will be overwritten.

    For decompression:

    * If an output filename is not provided, the output will be named the same as the archive without
      the paq8px210 extension.
    * If a filename is given, it is used as the output name.
    * If a directory is given, the restored file will be created inside it (the directory must exist).
    * If the output file(s) already exist, they will be overwritten.

  Examples:

  To create data.txt.paq8px210 in current directory:
  paq8px -8 data.txt

  To create archive.paq8px210 in current directory:
  paq8px -8 data.txt archive.paq8px210

  To create data.txt.paq8px210 in results/ directory:
  paq8px -8 data.txt results/

---------------------------------
 8. @FILELIST Format and Behavior
---------------------------------

  When a @FILELIST is provided, the FILELIST file itself is compressed as the first file in the archive and
  automatically extracted during decompression.

  The FILELIST is a tab-separated text file with this structure:

    Column 1:  Filenames and optional relative paths (required, used by compressor)
    Column 2+: Arbitrary metadata - timestamps, ownership, etc. (optional, preserved but ignored)

    First line: Header (preserved but ignored during processing the file list)

  Only the first column is used by the compressor and decompressor.
  All other columns are preserved but ignored.
  Paths must be relative to the FILELIST location.

  Using this mechanism allows full restoration of file metadata with third-party tools after decompression.


-------------------------
 9. Miscellaneous Options
-------------------------

  -v

    Enable verbose output.

  -log FILE

    Append compression results to a tab-separated log file.
    Logging applies only to compression.

  -simd MODE

    Normally, the highest usable SIMD instruction set is detected and used automatically for the CM mixer and
    neural network operations (LSTM model).
    This option overrides the detected SIMD instruction set. Intended for expert use and benchmarking.
    Supported values (case-insensitive):
       NONE
       SSE2, AVX2, AVX512 (on x64)
       NEON (on ARM)

----------------------
 10. Argument Ordering
----------------------

  Command-line arguments may appear in any order with the following exception:
  INPUT must always precede OUTPUT.

  Example: the following two are equivalent:

    paq8px -v -simd sse2 enwik8 -log results.txt output/ -8
    paq8px -8 enwik8 -log results.txt output/ -v -simd sse2

  Further examples:

    paq8px -8 file.txt         | Compress using ~2.3 GB RAM
    paq8px -12L enwik8         | Compress 'enwik8' with maximum compression (~29 GB RAM), use the LSTM model as well
    paq8px -4 image.jpg        | Compress the 'image.jpg' file - using less memory, even faster
    paq8px -8ba b64sample.xml  | Compress 'b64sample.xml' faster and using less memory
                                 Put more effort into finding and transforming DEFLATE blocks
                                 Use adaptive learning rate.
    paq8px -8s rafale.bmp      | Compress the 'rafale.bmp' image file
                                 Skip color transform - this file compresses better without it

Compatibility & archive basics

A paq8px archive stores one or more files in a highly compressed format.

Note

Files and archives larger than 2 GB are not supported.

Note

paq8px archives are not compatible across different paq8px releases (past or future).

Note

A paq8px archive may contain multiple files, but once created, you cannot add to or remove files from the archive.

How to recognize it

The file extension reflects the exact paq8px version that created it (e.g., .paq8px210).
You can also check the header: if the first bytes read "paq8px", it is likely a paq8px archive.
Exact version information cannot be inferred from the archive content: the archive header does not encode the specific paq8px version used. Only the file extension reflects the version.

Single file vs multiple file modes

In single-file mode, only file contents are stored – no paths, names, timestamps, attributes, permissions, or other metadata.

In multi-file mode, you may preserve such metadata via the @FILELIST mechanism (see the help screen for details).

Notes on pre-training

Warning

Archives made with pre-training-like options (-E, -T, -R) are fragile — decompression requires the same binary and/or external files.

The exe pre-training (-E)
This option pre-trains the EXE model using the paq8px.exe binary itself.
Archives created with a different paq8px.exe binary (even when built from the same source and build options) will differ.
To decompress an archive created with -E, you must use the exact same executable that created it.
Text pre-training (-T)
The word list (english.dic) and expression list (english.exp) are used only to pre-train models before compression and they are not stored in the archive.
You must have these same files available to decompress archives created with -T.
LSTM pre-trained weight repositories (-R)
If you use pre-trained LSTM repositories, ensure the same RNN weight files (english.rnn, x86_64.rnn) are available during decompression.

Warning

The LSTM repositories are temporarily unavailable in the latest release due to the refactoring of the model. The latest version supporting this feature was v209.

How to compile

Building paq8px requires a C++17 capable C++ compiler:
https://en.cppreference.com/w/cpp/compiler_support#cpp17

Windows:
On Windows, you can download a prebuilt executable instead of compiling. Just grab the latest executable from the https://encode.su/threads/342-paq8px thread.
If you would like to build an executable yourself you may use the Visual Studio solution file or in case of Mingw-w64 see the build-mingw-w64-generic-publish.cmd batch file in the build subfolder.

Linux/macOS:
The ./build folder already contains helper scripts.
You may use the following commands to build with cmake:

sudo apt-get install build-essential zlib1g-dev cmake make
cd build
./build-linux-with-cmake.sh

Testing in a Linux VM

Get a Linux VM (such as Lubuntu 25.04 Plucky Puffin)
Install the required compilers and tools with the following commands:

sudo apt update
sudo apt install gcc clang gcc-aarch64-linux-gnu g++-aarch64-linux-gnu build-essential cmake zlib1g-dev

Sample build scripts are provided in the build/ folder:

build/build-linux-with-cmake.sh
build/build-linux-with-gcc.sh
build/build-linux-with-clang.sh
build/build-linux-cross-compile-aarch64.sh

Tested toolchains

The following compiler/OS combinations have been tested successfully:

Version	OS	Compiler/IDE
v210	Windows	Visual Studio 2022 Community Edition 17.14.14
v210	Windows	Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35216
v210	Windows	MinGW-w64 13.0.0 (gcc-15.2.0)
v210	Lubuntu 25.04 Plucky Puffin	gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
v210	Lubuntu 25.04 Plucky Puffin	Ubuntu clang version 20.1.2 (0ubuntu1), Target: x86_64-pc-linux-gnu
v210	Lubuntu 25.04 Plucky Puffin	aarch64-linux-gnu-gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0

Other modern C++17 compilers may also work but are not routinely tested.

Note

We build and test 64-bit releases. 32-bit releases are seldom built or tested.
A known limitation of 32-bit releases is the 2 GB memory barrier. As a consequence, compression and decompression with 32-bit releases may not work ("out of memory") on level 8 and above.

Release checklist

When you make a new release:

Please update the version number in the "Versioning" section in the paq8px.cpp source file.
Please append a short description of your modifications to the CHANGELOG file.
Please carry out some basic tests. Run your tests with asserts on (remove the NDEBUG preprocessor directive).
Please verify if paq8px can be propely built on different platforms (i.e. test all the build scripts)
Update README.md, especially the Benchmark results.

References

Get Visual Studio 2022 Community Edition from: https://visualstudio.microsoft.com/vs/community/
Get MinGW-w64 for Windows from: https://winlibs.com/
zlib source files in the zlib folder originate from: https://github.com/madler/zlib
Get Lubuntu 25.04 Plucky Puffin for testing the build from: https://www.osboxes.org/lubuntu/

How it works

paq8px compresses files bit by bit using a technique called context mixing: multiple models make probabilistic predictions for the next bit, and a mixer combines them into a single, more accurate probability, which is then encoded with an arithmetic coder.

This approach is computationally intensive but highly adaptive, making paq8px especially effective for entropy estimation, compressibility testing and research purposes.

For an in-depth technical explanation, see the DOC file.

Benchmark results

Benchmark results are provided on various corpora for comparison with other compressors.
Rankings are based solely on compression ratio, not speed or memory usage to show reference compressed sizes achievable on these datasets.
Results are drawn from official listings where available, or from community testing when benchmarks are no longer maintained.
Results last verified: Sept 21, 2025.

Summary:

Corpus / Benchmark	Version	Rank
Calgary	v210	#2
Canterbury	v210	#2
Silesia	v210	#1
Lossless Photo Compression Benchmark (LPCB)	v206	#1
Large Text Compression Benchmark (LTCB)	v206	#10
Darek's corpus (DBA)	v207fix1	#1
Maximumcompression benchmark	v207fix1	#1
fenwik9 benchmark by Sportman	v210	#1
World English Bible benchmark by Sportman	v208fix1	#1

For the Calgary, Canterbury, Silesia and MaximumCompression benchmarks, see paq8px evolution up to paq8px_v207fix1, run by Darek in his post in the paq8px thread

Calgary corpus

The Calgary corpus does not have an official maintained ranking, and most published results do not include modern experimental compressors.

Below are compressed sizes for paq8px v210 under various options, compared with cmix v21.

File	-8	-12L	-12LT	(v209) -12RT	cmix v21 (reference)
bib	19595	19520	17492	17376	17180
book1	183318	181492	175722	163431	173709
book2	113979	113143	108844	106668	105918
geo	42475	42255	42265	42367	42760
news	83023	82681	78490	77166	76389
obj1	7063	6982	6841	6892	7053
obj2	40934	40129	39820	39950	40139
paper1	12360	12317	11041	10749	10831
paper2	19538	19467	17478	16589	17169
pic	19624	19666	19669	19677	21883
progc	8870	8804	8206	8189	8193
progl	9512	9449	8876	8864	8788
progp	6378	6296	6061	6097	6126
trans	10977	10939	10056	10045	9990
Total compressed size	577'646	573'140	550'861	534'060	546'128
Compression time (approx. sec)	307	864	1231	1567	n/a

With fair options (-12LT), paq8px v210 achieves results close to cmix v21.
With unfair options (-12RT), results surpass cmix, but these should be excluded (see Benchmarking Notes).

At the time of writing, paq8px v210 likely ranks #2 on Calgary behind cmix v21.

Canterbury corpus

The same general notes apply to the Canterbury corpus as to the Calgary corpus.

Below are compressed sizes for paq8px v210 under various options, compared with cmix v21.

File	-8	-12L	-12LT	(v209) -12RT	cmix v21 (reference)
alice29.txt	33065	32851	31138	28317	31076
asyoulik.txt	31512	31423	29601	28062	29434
cp.html	5405	5389	4740	4720	4746
fields.c	2027	2017	1856	1848	1909
grammar.lsp	861	862	750	732	771
kennedy.xls	8137	7849	7850	7972	7955
lcet10.txt	79119	78807	74655	72594	73365
plrabn12.txt	117451	116694	112546	108648	112263
ptt5	19624	19666	19669	19677	21883
sum	6825	6798	6657	6679	6870
xargs.1	1295	1293	1097	1061	1123
Total compressed size	305'321	303'649	290'559	280'310	291'395
Compression time (approx. sec)	256	707	1015	1352	n/a

At the time of writing, paq8px v210 likely ranks #2 on Canterbury behind cmix v21.

Silesia corpus

paq8px v210 ranked #1 in The Silesia Open Source Compression Benchmark at the time of writing.

Results for paq8px v210 together with cmix v21 as a reference:

| | | precomp v0.4.7 -cn + |

File	-12L	cmix v21 (reference)
dickens	1'860'023	1'802'071
mozilla	6'129'742	6'634'210
mr	1'852'494	1'828'423
nci	776'723	781'325
ooffice	1'218'806	1'221'977
osdb	1'968'252	1'963'597
reymont	699'456	704'817
samba	1'589'315	1'588'875
sao	3'723'922	3'726'502
webster	4'402'064	4'271'915
xml	245'824	233'696
x-ray	3'521'286	3'503'686
Total compressed size	27'987'907	28'261'094
Compression time (approx. sec)	68'837	n/a

Here paq8px outperformed cmix v21 overall, though performance varies per file.

Lossless Photo Compression Benchmark (LPCB)

paq8px v206 ranked #1 at Lossless Photo Compression Benchmark.

The benchmark has not been rerun for later versions.

Large Text Compression Benchmark (LTCB)

paq8px v206 ranked #10 at Large Text Compression Benchmark at the time of writing.
Note, that unlike paq8px, most higher-ranked compressors are tuned specifically for enwik8/enwik9, and often apply enwik-specific preprocessing (e.g., word replacement, article reordering).

The benchmark has not been rerun for later versions.

Darek's corpus (DBA)

Darek's benchmark is no longer actively maintained.
This is not an exhaustive benchmark – it targets only high-end compressors.

See the last results targeting only high-end compressors in Darek's post to the encode.su forum from 2022 including results for v207fix1.

paq8px v207fix1 ranked #1 at that time.

MaximumCompression benchmark

The MaximumCompression benchmark is no longer actively maintained and has no up-to-date official listing.
The official site was last updated in 2011. At that time paq8px was ranked #1.

See paq8px evolution on the MaximumCompression benchmark up until paq8px v207fix1 in Darek's post to the encode.su forum from 2022.

Compressed sizes for v210 with compression option -12L (-12Ls for rafale.bmp).

File	-12L
A10.jpg	624023
acrord32.exe	786553
english_mc.dic	333089
FlashMX.pdf	1289571
fp.log	199933
mso97.dll	1121228
ohs.doc	452209
rafale.bmp	463390
vcfiu.hlp	245448
world95.txt	309236
Total compressed size	5'824'680
Compression time (sec)	19'384

To the best of our knowledge, paq8px's latest version, v210, would still rank #1 at the time of writing.

fenwik9 benchmark

paq8px v210 ranks #1 in the fenwik9 benchmark.
This is a non-standard but exhaustive single-file benchmark maintained by Sportman.

World English Bible benchmark (WEB)

paq8px v208fix1 ranked #1 in the World English Bible benchmark.
This is a non-standard but exhaustive single-file benchmark maintained by Sportman.

Benchmarking Notes

Warning

Using -R to load pre-trained LSTM weight repositories is unfair if the target file to be compressed was part of the training data.
Benchmarks and leaderboards change over time – rankings may shift.
Hardware does not affect compression ratio and memory use, but it does affect runtime; reported times are approximate and for context only.

PAQ8PX contribution timeline

paq8px is a branch of the PAQ compressor series, descended from earlier versions such as PAQ7 and the PAQ8 variants (e.g., PAQ8A-PAQ8P).

Development began in 2009 and remains active, supported by a global community of contributors.

Work has focused on expanding model coverage (images, audio, executables, text) with emphasis on compression ratio.

The table below highlights milestones, contributors, and notable changes over the years.

Year	Versions	Contributors & Highlights
Pre-2009	PAQ roots	Matt Mahoney: Original PAQ author. Early branches (`paq8hp`, `paq8fthis`, `paq8p3`, `lpaq1`) introduced context maps with 16-bit checksums, probabilistic state tables, specialized models (JPEG, sparse, DMC, distance-based), exe model/filter. Added directory compression and drag-and-drop (PAQ8A), BMP/PGM/JPEG/WAV support, APM/StateMap optimizations.
2009	v0–v67	Jan Ondrus: Founded `paq8px`, adding TGA/TIFF/AIFF/MOD/S3M models, PPM/PBM compression, CD sector transform, exe filters, recursive sub-blocks, WAV-model improvements. Simon Berger: TGA 24/8-bit, TIFF/AIFF improvements, MSVC fixes, compression pipeline rewrite. LovePimple: Portability fixes.
2010	v68–v69	Jan Ondrus: Added `-l` listing option, fix for multi-path file compression.
2016	v70–v75	Jan Ondrus: Add zlib recompression (initially unstable), PDF image support, Base64 transform, GIF recompression, and paq8pxd model updates (incl. im8bitModel), plus multiple bugfixes (zlib header/progress display, Base64, GIF).
2017	v76–v127	Márcio Pais: JPEG upgrades (subsampling, thumbnails, MJPEG), record/BMP models, grayscale detection, XML model, x86/x64 pre-training, PNG recompression, DEFLATE MTF + brute force, dBASE parsing, adaptive learning rate, English stemmer. Jan Ondrus: JPEG tweaks, PAM format detection, block handling, PDF 4-bit fix. Zoltán Gotthardt: Fixes, MSVC/Array/`ilog2` fixes, faster JPEG learning rate, IO improvements. Mauro Vezzosi: Bug reports, dmcModel patch.
2018	v128–v173	Márcio Pais: Extended text modeling (English/French/German stemmers, language detection, SparseMatchModel, SSE refinements, RLE/EOL transforms), 8bpp/24–32bpp image model improvements, JPEG tweaks, pre-training refinements. Zoltán Gotthardt: New CLI and file handling, DMC enhancements, hashing improvements, charGroupModel, compiler/portability fixes. Andrew Epstein: AVX2 optimizations, macOS build fixes.
2019	v174–v183	Márcio Pais: Added linearPredictionModel, audio8bModel, audio16bModel, new image/GIF/TIFF handling, text model with word embeddings. Zoltán Gotthardt: refactoring (global scope cleanup, model factory, Shared struct), improved WordModel (PDF text extraction, pre-training), enhancements to StateMap, ContextMap2, MatchModel, and NormalModel.
2020	v184–v200	Andrew Epstein: Code cleanup, modularization, Doxygen docs. Moisés Cardona: ARM/NEON support, base64 fix, SIMD work. Zoltán Gotthardt: Refactoring (predictor separation, RNG, ContextMap), Sparse/SparseBit/Indirect model improvements, fixes, cleanup. Márcio Pais: LSTM model (pre-training, retraining, x86/64 optimizations), DEC Alpha transform/model, new SSE stages. Surya Kandau: JPEG model refinements.
2021	v201–v206	Zoltán Gotthardt: Improved IndirectContext/MatchModel, added high-precision arithmetic encoder & APMPost, introduced ChartModel, MRB detection, metadata modeling, separate mixers per block type, refined text detection, and `-skipdetection` option.
2022	v207	Zoltán Gotthardt: PNG filtering moved to transform layer; DEC-Alpha detection via object signature; TAR detection/transform; base85 filter (from paq8pxd); structured-text WordModel (linemodel) enhancements; separate LSTM per main context.
2023	v208	Zoltán Gotthardt: TAR detection fixes; new -forcetext option; enhanced 1-bit image model; shifted contexts (fewer in IndirectModel, added to WordModel for TEXT); refactors; Pavel Rosický: AVX512 detection
2025	v209	Zoltán Gotthardt: Model tweaks (initialized mixer weights; corrected matchmodel context); TEXT detection fixes; build/toolchain updates
2026	v210	Zoltán Gotthardt: LSTM model enhancements

This timeline is not exhaustive, for details, see CHANGELOG.

Notable borrows

paq8px incorporates ideas and code from a range of sources, often adapted and extended to fit the project’s design:

UTF-8 detection – based on Bjoern Hoehrmann's UTF decoder DFA; integrated by Zoltán Gotthardt
Base64 transform – from paq8pxd by Kaido Orav; integrated by Jan Ondrus
Base85 transform – from paq8pxd by Kaido Orav; integrated by Zoltán Gotthardt
MRB detection – from paq8pxd by Kaido Orav; integrated with enhancements by Zoltán Gotthardt
zlib recompression – from AntiZ; integrated by Jan Ondrus
Text modeling with stemming – based on the Porter/Porter2 stemmers; integrated by Márcio Pais
Audio modeling ideas – based on 'An asymptotically Optimal Predictor for Stereo Lossless Audio Compression' by Florin Ghido; integrated with enhancements by Márcio Pais
Image modeling ideas – from Emma by Márcio Pais
EXE model – incorporates ideas from DisFilter by Fabian Giesen; integrated with enhancements by Márcio Pais
ChartModel – from paq8kx7; integrated with enhancements by Zoltán Gotthardt
MatchModel – ideas from Emma; integrated by Márcio Pais
MatchModel – improvements from paq8gen; integrated by Zoltán Gotthardt
LSTM model – adapted from cmix by Byron Knoll; integrated with enhancements by Márcio Pais, further enhancements based on ligru-compress by Zoltán Gotthardt
OLS predictor – by Sebastian Lehmann; integrated by Márcio Pais

Similar compressors

paq8pdx by Kaido Orav
cmix by Byron Knoll

Copyright

Copyright (C) 2009-2026 Matt Mahoney, Serge Osnach, Alexander Ratushnyak, Bill Pettis, Przemyslaw Skibinski, Matthew Fite, wowtiger, Andrew Paterson, Jan Ondrus, Andreas Morphis, Pavel L. Holoborodko, Kaido Orav, Simon Berger, Neill Corlett, Márcio Pais, Andrew Epstein, Mauro Vezzosi, Zoltán Gotthardt, Moisés Cardona and others.

We would like to express our gratitude for the endless support of many contributors who encouraged paq8px development with ideas, testing, compiling, debugging: LovePimple, Skymmer, Darek, Stephan Busch, m^2, Christian Schneider, pat357, Rugxulo, Gonzalo, a902cd23, pinguin2, Luca Biondi, and the broader community at encode.su.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the GNU General Public License for more details at http://www.gnu.org/copyleft/gpl.html.

A summary in plain language is available at https://tldrlegal.com/license/gnu-general-public-license-v2.

Name		Name	Last commit message	Last commit date
Latest commit History 767 Commits
build		build
file		file
filter		filter
lstm		lstm
model		model
text		text
zlib		zlib
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
APM.cpp		APM.cpp
APM.hpp		APM.hpp
APM1.cpp		APM1.cpp
APM1.hpp		APM1.hpp
APMPost.cpp		APMPost.cpp
APMPost.hpp		APMPost.hpp
AdaptiveMap.cpp		AdaptiveMap.cpp
AdaptiveMap.hpp		AdaptiveMap.hpp
ArithmeticEncoder.cpp		ArithmeticEncoder.cpp
ArithmeticEncoder.hpp		ArithmeticEncoder.hpp
Array.hpp		Array.hpp
BH.hpp		BH.hpp
BitCount.cpp		BitCount.cpp
BitCount.hpp		BitCount.hpp
Block.cpp		Block.cpp
Block.hpp		Block.hpp
BlockType.cpp		BlockType.cpp
BlockType.hpp		BlockType.hpp
Bucket16.hpp		Bucket16.hpp
CHANGELOG		CHANGELOG
CMakeLists.txt		CMakeLists.txt
CharacterNames.hpp		CharacterNames.hpp
Clz.hpp		Clz.hpp
ContextMap.cpp		ContextMap.cpp
ContextMap.hpp		ContextMap.hpp
ContextMap2.cpp		ContextMap2.cpp
ContextMap2.hpp		ContextMap2.hpp
DOC		DOC
DivisionTable.hpp		DivisionTable.hpp
DummyMixer.cpp		DummyMixer.cpp
DummyMixer.hpp		DummyMixer.hpp
Encoder.cpp		Encoder.cpp
Encoder.hpp		Encoder.hpp
Hash.hpp		Hash.hpp
HashElementForBitHistoryState.hpp		HashElementForBitHistoryState.hpp
HashElementForContextMap.hpp		HashElementForContextMap.hpp
HashElementForMatchPositions.hpp		HashElementForMatchPositions.hpp
HashElementForStationaryMap.hpp		HashElementForStationaryMap.hpp
IPredictor.hpp		IPredictor.hpp
Ilog.cpp		Ilog.cpp
Ilog.hpp		Ilog.hpp
IndirectContext.hpp		IndirectContext.hpp
IndirectMap.cpp		IndirectMap.cpp
IndirectMap.hpp		IndirectMap.hpp
LMS.hpp		LMS.hpp
LargeIndirectContext.hpp		LargeIndirectContext.hpp
LargeStationaryMap.cpp		LargeStationaryMap.cpp
LargeStationaryMap.hpp		LargeStationaryMap.hpp
MTFList.cpp		MTFList.cpp
MTFList.hpp		MTFList.hpp
Mixer.cpp		Mixer.cpp
Mixer.hpp		Mixer.hpp
MixerFactory.cpp		MixerFactory.cpp
MixerFactory.hpp		MixerFactory.hpp
MixerFunctions_SIMD_AVX2.hpp		MixerFunctions_SIMD_AVX2.hpp
MixerFunctions_SIMD_AVX512.hpp		MixerFunctions_SIMD_AVX512.hpp
MixerFunctions_SIMD_Neon.hpp		MixerFunctions_SIMD_Neon.hpp
MixerFunctions_SIMD_None.hpp		MixerFunctions_SIMD_None.hpp
MixerFunctions_SIMD_SSE2.hpp		MixerFunctions_SIMD_SSE2.hpp
Models.cpp		Models.cpp
Models.hpp		Models.hpp
OLS.hpp		OLS.hpp
Predictor.cpp		Predictor.cpp
Predictor.hpp		Predictor.hpp
PredictorBlock.cpp		PredictorBlock.cpp
PredictorBlock.hpp		PredictorBlock.hpp
PredictorMain.cpp		PredictorMain.cpp
PredictorMain.hpp		PredictorMain.hpp
PredictorMainLstmOnly.cpp		PredictorMainLstmOnly.cpp
PredictorMainLstmOnly.hpp		PredictorMainLstmOnly.hpp
ProgramChecker.cpp		ProgramChecker.cpp
ProgramChecker.hpp		ProgramChecker.hpp
README.md		README.md
Random.cpp		Random.cpp
Random.hpp		Random.hpp
RingBuffer.hpp		RingBuffer.hpp
SIMDType.hpp		SIMDType.hpp
SSE.cpp		SSE.cpp
SSE.hpp		SSE.hpp
Shared.cpp		Shared.cpp
Shared.hpp		Shared.hpp
Simd.hpp		Simd.hpp
SimdMixer.cpp		SimdMixer.cpp
SimdMixer.hpp		SimdMixer.hpp
SmallStationaryContextMap.cpp		SmallStationaryContextMap.cpp
SmallStationaryContextMap.hpp		SmallStationaryContextMap.hpp
Squash.cpp		Squash.cpp
Squash.hpp		Squash.hpp
StateMap.cpp		StateMap.cpp
StateMap.hpp		StateMap.hpp

GotthardtZ/paq8px

Folders and files

Latest commit

History

Repository files navigation

PAQ8PX – Experimental Lossless Data Compressor & Entropy Estimator

About

Quick start

Command line interface

Some examples

Getting help

Compatibility & archive basics

How to recognize it

Single file vs multiple file modes

Notes on pre-training

How to compile

Testing in a Linux VM

Tested toolchains

Release checklist

References

How it works

Benchmark results

Calgary corpus

Canterbury corpus

Silesia corpus

Lossless Photo Compression Benchmark (LPCB)

Large Text Compression Benchmark (LTCB)

Darek's corpus (DBA)

MaximumCompression benchmark

fenwik9 benchmark

World English Bible benchmark (WEB)

Benchmarking Notes

PAQ8PX contribution timeline

Notable borrows

Similar compressors

Copyright

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages