-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Allow storage of parquet leaf files not as files on disk, but as a list of files that map to each partition.
This is not covered either with a single leaf parquet like Norder=7/Dir=0/Npix=12.parquet, or directory leaves like Norder=7/Dir=0/Npix=12/part*.parquet.
Enrique Utrilla from ESA brought this up in the context of providing an adapter for GAIA DR4 parquet files. For local file systems, creating symbolic links to the GAIA paths is not a problem. However, once these files are in S3 or other object storage, the notions of symlinks are different and are next expected to work simply.
@hombit seconded the idea with other use cases.
I see at least two ways this could be implemented, and we would want to converge on a preferred approach.
- Single map file - this would contain ALL of the pixels and their respective files in a single file.
/partition_file_map.csv
Norder,Npix,path
3,530,partition_003_00530_01.parquet
3,530,partition_003_00530_02.parquet
4,637,partition_004_00637_01.parquet
Pros:
- Just one file to read
Cons:
- File would grow pretty huge, since we're adding a big old string field to what was previously a very narrow CSV file
- Difficult to update, in the case of incremental data additions
- Map file as leaf file - this would contain a list of files that correspond to the particular data partition.
/Norder=3/Dir=0/Npix=530/partition_file_map.csv
Norder,Npix,path
3,530,partition_003_00530_01.parquet
3,530,partition_003_00530_02.parquet
/Norder=4/Dir=0/Npix=637/partition_file_map.csv
Norder,Npix,path
4,637,partition_004_00637_01.parquet
Pros (inverse of cons above =D ):
- Each file is very small
- Easy to append to the text file for incremental catalog additions
Cons:
- Reading many small files on cloud storage incurs a significant overhead (though this only happens when reading the data partitions, and not when opening catalog high-level metadata)