Store parquet volume metadata on the parquet tokencounts file

I've been happily using feather the last year as a backend with volume metadata on the files.

But parquet is smaller than feather, and *also* supports volume and chunk-level metadata.

So rather than storing metadata in a separate `.json` file, I propose storing the data in the footer of the parquet file, and caching it when needed with [pyarrow.parquet.read_metadata](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_metadata.html).

Reason:
1. Fewer files. The big one for me--taking up 2x as many inodes is a problem in an HPC environment.
2. More portable

Downsides:
1. Requires a pyarrow dependency for parquet, rather than flexibility about using fastparquet instead.
2. Requires either breaking exist pyarrow installations, or some term-unlimited multiple format support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store parquet volume metadata on the parquet tokencounts file #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Store parquet volume metadata on the parquet tokencounts file #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions