Skip to content

Consider partition leaf lists #19

@delucchi-cmu

Description

@delucchi-cmu

Allow storage of parquet leaf files not as files on disk, but as a list of files that map to each partition.

This is not covered either with a single leaf parquet like Norder=7/Dir=0/Npix=12.parquet, or directory leaves like Norder=7/Dir=0/Npix=12/part*.parquet.

Enrique Utrilla from ESA brought this up in the context of providing an adapter for GAIA DR4 parquet files. For local file systems, creating symbolic links to the GAIA paths is not a problem. However, once these files are in S3 or other object storage, the notions of symlinks are different and are next expected to work simply.

@hombit seconded the idea with other use cases.

I see at least two ways this could be implemented, and we would want to converge on a preferred approach.

  1. Single map file - this would contain ALL of the pixels and their respective files in a single file.
/partition_file_map.csv
    Norder,Npix,path
    3,530,partition_003_00530_01.parquet
    3,530,partition_003_00530_02.parquet
    4,637,partition_004_00637_01.parquet

Pros:

  • Just one file to read

Cons:

  • File would grow pretty huge, since we're adding a big old string field to what was previously a very narrow CSV file
  • Difficult to update, in the case of incremental data additions
  1. Map file as leaf file - this would contain a list of files that correspond to the particular data partition.
/Norder=3/Dir=0/Npix=530/partition_file_map.csv
    Norder,Npix,path
    3,530,partition_003_00530_01.parquet
    3,530,partition_003_00530_02.parquet
/Norder=4/Dir=0/Npix=637/partition_file_map.csv
    Norder,Npix,path
    4,637,partition_004_00637_01.parquet

Pros (inverse of cons above =D ):

  • Each file is very small
  • Easy to append to the text file for incremental catalog additions

Cons:

  • Reading many small files on cloud storage incurs a significant overhead (though this only happens when reading the data partitions, and not when opening catalog high-level metadata)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions