Huffman coding library

(And a test project and test files/script and fancy benchmarks with charts.)

Note that this is not intended to be used in real world systems, though you are free to. The design is fairly okay, though I'm sure improvements could be made.

License: CC-BY-SA 4.0 International. Attribution to: lucb1e

Benchmarks

Hardware: Intel Core i7 3rd gen and an SSD doing ~3500mbps r/w.

Encoding (two-pass system) first has to obtain character frequencies, which runs at roughly 500mbps, then goes on to encode data at around 100mbps.

Decoding (one pass system) works at about 160 mbps.

Tests were performed with a 150MB chunk of a virtual machine hard disk. For both encoding and decoding, memory usage of the Java process is around 10MB (+/- 0.5MB).

Compression ratio of 250 lines of Java code is about 65% (the encoded version is 65% the size of the original). The data size was 8KB.

Compression ratio of the first paragraph of Wikipedia's Huffman coding article, 4 times repeated, is about 65%. The data length was 4KB.

Todo

The InputWorkaround system is just ugly (hence the name).
Check whether concatenation of bits isn't quicker than the current method in ByteAssembler.writeBits().
Optimize the header futher (at least use a variable-length number instead of a long, and perhaps a byte for the header length?).

Hint: I probably won't actually ever do these things.

Header format

I've thought about how to do this for a while. In the end I ended up with this:

For each character that occurs in the original file, write one byte for the character and 8 bytes (long) for its frequency. Denote end with a zero-count character (since this would never occur). Example:

B00000001A00000003N00000002000000000
^----------------------------------- character (B)
 ^^^^^^^^--------------------------- count (1)
         ^-------------------------- character (A)
          ^^^^^^^^------------------ count (3)
		          ^----------------- character (N)
				   ^^^^^^^^--------- count (2)
				           ^-------- character (NULL)
						    ^^^^^^^^ count (0, denoting header end)

This format supports my principle of supporting practically infinite filesizes (2^63 bytes) but, on the downside, wastes a lot of space. A variable-length number would be better (e.g. use 1 byte for numbers <=63, 2 bytes for numbers <=8191, etc.). Another improvement is prepending a byte to indicate how long the header will actually be (i.e. how many elements it contains). This is probably still suboptimal, more bit packing could be done, but at least makes it feel as if less space is wasted!

Then there is the other problem of how to know when the file ends. At first I used a trailer byte, a byte that was appended to each file and contained the count of padded zeros. If an encoded stream is 14 bits in length, it would (excluding the header) be 3 bytes on disk: 14 bits, 2 bits padding (because you can't write half bytes) and a final byte with the value '2' to indicate that we padded two zero bits.

This worked, but afterwards I thought of a better solution: if you simply add up all frequencies from the header, you end up with the original file length. Therefore we know how many bytes we must output before we have reached the end of the file, without even knowing how many bits the stream contains. After having outputted that many bits, the rest of the data stream is simply ignored. Hint: this means you can append arbitrary data (PKZIP polygots anyone?).

Project keywords: JCF JCF41 Huffman Java Collections Framework

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
HuffmanCodingDemo		HuffmanCodingDemo
hufflib		hufflib
testfiles		testfiles
README.md		README.md
huffman-benchmarks.ods		huffman-benchmarks.ods
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Huffman coding library

Contents

Benchmarks

Todo

Header format

About

Uh oh!

Releases

Packages

Languages

lucb1e/huffmancoding

Folders and files

Latest commit

History

Repository files navigation

Huffman coding library

Contents

Benchmarks

Todo

Header format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages