Code Design Test: Data Compression Design

This project implements an algorithm in C to compress a given data buffer of bytes. It includes a decompression algorithm to restore the buffer to its original form.

Pre-requisites

Tested on macOS and Windows 11 OS
gcc compiler and make
In macOS run xcode-select --install in terminal to install gcc and make.
In windows
- Download MSYS2 exe installer from msys2.org and install it.
- Open the MSYS2 MSYS terminal
- Run pacman -S --needed base-devel mingw-w64-x86_64-toolchain
- Press enter after it prompts which toolchains to install and then press Y to start installation
- Open MSYS2 MinGW 64-bit terminal.
- Verify install by running gcc --version and make --version

How to build

Clone project to local directory. Open Terminal (macOS) or MSYS2 MinGW 64-bit terminal (Windows OS) and change directory to project folder
Run the following make commands

   make clean
   make all

To run the main application

    ./main_compress

To run unit tests on byte_compress

    ./test_compression

To run units tests on array_compaction

    ./test_array_compaction

Design Description

Key Design Points

Problem statment mentions that it is common for the data in the buffer to have the same value repeated in the series. Each byte in the data buffer will contain a number from 0x00 to 0x7F.
The design focuses on removing most of the repeated bytes. Numbers larger than 0x7F can be used as a differentiator between normal bytes (which are <= 0x7F) and bytes that are modified for compression (which are > 0x7F)
Three pieces of information are critical - the value of the repeated byte, the number of times the value repeated and a differentiator that indicates these bytes need to be processed to bring the original buffer back.
Traverse the data in the buffer starting from index 0.
Find the sequences where a number is repeating more than once.
Once such a sequence is found, add 0x80 (number > 0x7F) to repeating number. Place this modified value at the index where this sequence of repeated bytes started. As the resulting value will be greater than 0x7F, the decompress algorithm will know, this isnt a normal data byte. On subtracting 0x80 from the value at this index, it will be able to know the value of the repeated byte.
Place the number of times the byte repeated at the index next to where (0x80 + repeated number) is placed.
The remaining repeated bytes in the sequence can be removed which will compress the buffer and decrease the size of resulting buffer.
If a byte is repeated more than 0xFF times, stop counting at 0xFF. Perform Step 5 till 7. Restart counting for the remaining repeated numbers. This is because, the data buffer can only hold 8-bit values. The maximum value of 8-bit unsigned integer is 0xFF. Hence, split the sequence into smaller chunks each of size <= 0xFF.

Fill Delete Table

delete_table struct array used to compress buffer. Each element of this delete table will have two values. First one is the starting index of the sequence that needs to be deleted. Second one is the number of bytes that need to be deleted.
Add the starting index of the remaining repeated bytes and the number of remaining repeated bytes to the delete table.

Traverse the entire data buffer and fill up the delete table with all the sequences that are repetitive and can be removed.

Compression steps

Shift the bytes that need to be kept into indices that need to be deleted to compress the array in place.
Traverse through the delete table. Get the first sequence of elements that need to be kept. Copy this sequence to the appropriate indices in the same data buffer
Update the index where the next section that needs to be kept can be copied. This index will be where copying of previous section ended.
Get the next sequence that need to be kept and copy this section at the index updated in Step 3. This will in effect shift the elements to the left overwriting the indices that need to be removed.
Repeat Steps 3 and 4 until all the indices marked for deletion are overwritten or not considered in the final size.
Save the new size of this array.

Psuedo-code for byte_compress algorithm

    byte_compress(data_ptr, data_size):
        traverse data_ptr from index 1 till data_size:
            if((curr_element == data_ptr[data_index]) and (count_of_repeated_chars < 0xFF)):
                count_of_repeated_chars++
            else:
                if(count_of_repeated_chars >= 2):
                    data_ptr[data_index - count_of_repeated_chars] = 0x80 + prev_element;
                    data_ptr[(data_index - count_of_repeated_chars) + 1] = count_of_repeated_chars;
                    Store index [(data_index - count_of_repeated_chars) + 2] and  [count_of_repeated_chars - 2] 
                    as starting index and length to be deleted in delete table

        new_size = compact_array(data_ptr)
        return new_size

Psuedo-code for array_compaction algorithm

    array_compaction(data_ptr):
        Traverse delete_table:
            Copy 0 to delete_table[i].start_index to data_ptr
            Save the index at which the above copy ended. This index is called data_index. 
            Get the next section to keep. next section to keep starts at index delete_table[i].start_index + delete_table[i].len_to_delete
            Copy the above section at data_index overwriting the locations that need to be deleted marked by the delete_table. 
            Repeat the process until entire delete_table is traversed. 
        return data_index as it is the new size of the compressed array.

How I reached solution

I had pre-requisite knowledge that zipping of files/folders is done by removing the sequences of consecutive repeated bytes. The problem statement too mentioned that the data buffer will commonly contain such sequences. I decided to focus on that.
I leveraged the fact that the data in the buffer contains numbers only from 0x00 to 0x7F. All the bytes starting from 0x80 are reserved. So, I could use them in the compressing algorithm.
At first, I thought of compressing each repeated sequence into 3 bytes. The first byte would be 0x80 indicating that that next 2 bytes are special bytes. This information is used by the decompression algorithm. The second byte would be the number of times the number repeated. The third byte would be the repeated byte itself. This would tell the decompression algorithm, repeat the value indicated by the third byte, number of times indicated by the second byte to decompress this array.
To further improve my solution, I thought of combining the information the first and third byte provides into 1 byte. Thus, compressing each repeated sequence into 2 bytes. For combining, I decided to add 0x80 to the value of the repeated byte. As the maximum value of the repeated byte could be 0x7F, adding 0x80 to it would still be in bounds for an 8-bit unsigned integer.
If a character repeated more than 0xFF times, I decided to split the sequences each of size <= 0xFF.
Next challenge was compacting the buffer. The problem statement says that on compression, the data in the buffer is modified and size of the modified buffer is returned.
My first thought was to use another temporary array. I will copy the elements I want to keep in this temporary array first. Then copy this temporary array into the original data buffer. But this would mean a space complexity of O(n).
The efficient way would be to perform the shift in place and use a separate hash table style data structure to store the indices and lengths of sequences to be deleted. This way the space complexity would be O(d) where d is the number of sequences to be deleted. d will be less than n hence O(d) < O(n).

Assumptions

The data buffer holds 8-bit unsigned integers.
Maximum size of data buffer is 65535 (max 16-bit unsigned value).
Indices are 0-based

Disadvantages of proposed algorithm

If there are no consecutive repeated numbers in the array, this algorithm will not compress the array.
If there all the sequences of consecutive repeated numbers are of size 2, the algorithm will effectively not compress the array.

Improvements to algorithm

Identify patterns in the numbers - for example 2 numbers repeat itself in a sequence, or the difference between consecutive number is same or has some mathematical patter etc. If such patterns are detected, a similar algorithm can be used to compress such sequences.

Improvements to implementation

The delete_table array is defined as a global with fixed size. If a pointer to this delete_table could be passed to byte_compress() function, the size of the delete_table could be changes as per the size of the data buffer. This would help in saving space where the data buffer is not of the maximum possible size. In the problem statement the signature of the byte_compress() function takes only the data_ptr and data_size, hence I have used the same signature in my implementation.

Testing Strategy

Unity is used for unit testing byte_compress function (16 cases) and array_compaction function (9 cases).
Edge cases such as if the input array is of size 0 or of maximum possible size of 65535 were tested.
Tested both of these functions for different types of inputs such if the values in the array are all 0x00 (min) or 0x7F (max). If the repetitive sequences are found in the beggining or middle or end of the array.

Notes

As the open role pertains to an embedded systems domain, I refrained from using mallocs in my code. In embedded systems, it is recommended to not use malloc as we cannot predict the memory usage before run-time and it also poses risk of fragmentation in memory.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
inc		inc
src		src
test		test
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code Design Test: Data Compression Design

Pre-requisites

How to build

Design Description

Key Design Points

Fill Delete Table

Compression steps

How I reached solution

Assumptions

Disadvantages of proposed algorithm

Improvements to algorithm

Improvements to implementation

Testing Strategy

Notes

About

Uh oh!

Releases

Packages

Languages

MrudulaSatya/Data-Compression-Design

Folders and files

Latest commit

History

Repository files navigation

Code Design Test: Data Compression Design

Pre-requisites

How to build

Design Description

Key Design Points

Fill Delete Table

Compression steps

How I reached solution

Assumptions

Disadvantages of proposed algorithm

Improvements to algorithm

Improvements to implementation

Testing Strategy

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages