Skip to content

Develop benchmarking criteria for consistent comparison across format options #2

@asteiker

Description

@asteiker

Candidate criteria:

  • Formats / chunking schemes to compare
    • Re-chunked HDF5
    • Cloud-optimized HDF5
    • Geoparquet
    • Zarr
    • Kerchunk json
    • h5coro
  • Environment
    • CryoCloud - Small instance
    • Assume we'll store all example files in CryoCloud (i.e. Sync or shared_public)
  • Libraries or clients used to open/read data
  • For each format option:
    • Dataset(s)
      • Based on community feedback/discussion, initial focus on ATL03
    • Files
      • Single and multiple? Files can vary by several GBs ; optimally produce and test 10 files
    • Variable(s)
    • Spatial subset(s)
    • Temporal subset(s)
    • Aggregation
    • End-to-end wall clock time
      • Time to re-chunk or reformat
      • Time to open/read file
        • Multiple tools/libraries/clients to compare per format option?
          • Geopandas, xarray
          • Should we consider dask data frame
    • Compute cost
    • Do we include a real-world example?
      • Time series of 60 day repeat cycle
      • Real world example tie in: Jacobshavn surface height

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions