Skip to content

Conversation

@vlerkin
Copy link
Collaborator

@vlerkin vlerkin commented Apr 1, 2025

#45

This PR adds a new class that is responsible for compression of the log files. It reads a log file from disk in chunks and writes to a temporary compressed file on a disk (an alternative would be to use an in-memory buffer which is not suitable for large log files, so in sake of user experience it uses disk). Upload function then uses the compression method to upload a file to the remote storage. Depending on the preferred compression method, a log file gets an appropriate extension because of the mapping that is added as a class attribute to the LIbcloud class.

vlerkin added 2 commits April 1, 2025 13:57
… of bytes and compresses file to gzip beforehand using a temporary file in case it has to handle large files (not in-memory buffer)
…esses files to different formats; use the instance and a compress method inside the upload method
@vlerkin vlerkin linked an issue Apr 1, 2025 that may be closed by this pull request
@vlerkin vlerkin marked this pull request as ready for review April 2, 2025 07:37
@vlerkin vlerkin requested a review from wvengen April 2, 2025 07:38
Copy link
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you made it with all the different compression methods, way to go 🚀 :)

One thing I am missing, is not to use compression (compression method none, which may be the default, for backward compatibility).

Do we need to update the dependencies for brotli (nice btw to make it optional)?

@wvengen
Copy link
Member

wvengen commented Apr 3, 2025

p.s. opened #53 as a follow-up issue.

@vlerkin
Copy link
Collaborator Author

vlerkin commented Apr 3, 2025

Just to be clear because in the ticket you wrote that gzip should be default.

One thing I am missing, is not to use compression (compression method none, which may be the default, for backward compatibility).

would you like me to make default as None? So no compression at all?

@wvengen
Copy link
Member

wvengen commented Apr 3, 2025

Just to be clear because in the ticket you wrote that gzip should be default.

Ah, yes, gzip as default compression method. But I think it would be helpful to keep compression off by default, then we can later transition to gzip if we want to. So yes please, default none.

@vlerkin
Copy link
Collaborator Author

vlerkin commented Apr 4, 2025

Sorry, one more question to clarify, when you say

gzip as default compression method

Do you mean that if the provided by user compression method is not listed as implemented, then instead of raising an exception we would proceed with gzip and write a log that explains why we defaulted to gzip? Is this the behavior you have in mind?

@wvengen
Copy link
Member

wvengen commented Apr 4, 2025

Thanks for asking. I think it makes sense to have default compression method none - and with multiple compression methods, I don't think there is another default, so you can ignore that sentence now :) (when choosing a compression method without allowing a user to choose, then we'd best have gzip, but that's not the case here - well done).

… as None; add brotli as a dependency but still treat as optional; make a compression chunk size of 1024 a class constant
Copy link
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thank you!
Please let me know when you think it's ready to merge :)

Can you please verify that when passing multiple configuration files, it is still possible to reset the compression_method to the empty value? If not, then we may need to introduce none as an explicit value.

@wvengen
Copy link
Member

wvengen commented Apr 4, 2025

Perhaps having an explicit compression method of none is useful, while the empty value would bring up the default (this behaviour may be different, if in the future we would opt for a default compression method).

@vlerkin
Copy link
Collaborator Author

vlerkin commented Apr 4, 2025

This is possible, sure, would you like me to add None as an option for the compression_method? The parameter then always should be provided in the config with a desired value.

@wvengen
Copy link
Member

wvengen commented Apr 4, 2025

That would be more explicit. Then we can use none as the default value, so it does not need to be in the config file.
This change distinguishes "whatever the default is" from "no compression" as a configuration value.

@wvengen
Copy link
Member

wvengen commented Apr 4, 2025

So, yes please!

…ping compression of the log files; update config to reflect the current implementation
@vlerkin
Copy link
Collaborator Author

vlerkin commented Apr 4, 2025

For now I haven't provided any default if the compression_method is not provided in the config file, but when we extract the compression method (with "none" included) there might be a default None value. So if you want to extend the method later, you could add something you would consider as default when the parameter is not listed in the config. I hope I understood you correctly, otherwise, I can add something else.

Copy link
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think we're almost there, just two suggestions that could make the code clearer.
Also, please review the use of single vs. double quotes, I think it's almost consistent.

Comment on lines 176 to 189
if self.compression_method and self.compression_method != 'none' and compressed_file_path != local_path:
logger.info(
f"Successfully uploaded compressed file '{object_name}' to container '{self._container_name}'.")
else:
logger.info(f"Successfully uploaded file '{object_name}' to container '{self._container_name}'.")
except (ObjectError, ContainerDoesNotExistError, InvalidContainerNameError) as e:
logger.exception(f"Error uploading the file '{object_name}': {e}")
except Exception as e:
logger.exception(f"An unexpected error occurred while uploading '{object_name}': {e}")
finally:
# Remove temporary file even if upload fails
if compressed_file_path and compressed_file_path != local_path and os.path.exists(compressed_file_path):
os.remove(compressed_file_path)
logger.debug(f"Removed temporary compressed file '{compressed_file_path}'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case the compressed_file_path is only set when the file is compressed, and stored compressed with a different extension? In that case, we don't need to test for compressed_file_path != local_path - we can just check if compressed_file_path is set, right? That would make the code more explicit, in my view.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense, thanks for noticing

Comment on lines 42 to 48
COMPRESSION_EXTENSIONS = {
'gzip': 'gz',
'bzip2': 'bz2',
'lzma': 'xz',
'brotli': 'br'
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving this to the Compressor, e.g. providing a method to return an extension for a compression method, or even giving a compressed filename for an uncompressed filename? Then we only have to adapt one file when we'd add a compression method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good point to move to the compression class but not a good idea to move responsibility of naming the final upload destination to the compression class since it is fully the responsibility of the upload method in the libcloud class. Would be better to stick to direct functions of each class. I can add a method to provide an appropriate file extension or something like this and move this mapping to the compression class.

Copy link
Member

@wvengen wvengen Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm in my mind, providing the path of the compressed file (based on the uncompressed path) is somewhat related to compression, so it would fit there. E.g. most CLI tools that do compression, rename the file as well.

But if you think it belongs where it is now, fine to leave it as it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does already return the compressed file path which is then used by the upload function but we also need an extension type to compose the destination in the remote storage for the upload. That is why it seems good to add a new method to get the appropriate extension instead of, for example, parsing the returned compressed file name or something less transparent and maintainable. I am adding the changes, so you can see the exact thing I am referring here to.

Copy link
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, thank you!

@wvengen wvengen merged commit 6b7bb8b into q-m:main Apr 11, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compression for joblogs

2 participants