Skip to content

Commit 86ae141

Browse files
authored
Update zarr docs for zarr-python>3 and to include Zarr version 3 (#172)
1 parent 4771b52 commit 86ae141

File tree

3 files changed

+924
-396
lines changed

3 files changed

+924
-396
lines changed

zarr/environment.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
name: coguide-zarr
2+
channels:
3+
- conda-forge
4+
dependencies:
5+
- python=3.12
6+
- adlfs
7+
- aiohttp
8+
- dask
9+
- fsspec
10+
- ipykernel
11+
- jupyterlab
12+
- planetary-computer
13+
- pystac
14+
- requests
15+
- rich
16+
- xarray
17+
- zarr

zarr/intro.qmd

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,36 +7,37 @@ subtitle: Chunked, Compressed N-Dimensional Arrays
77

88
[Zarr](https://zarr.dev/), despite its name, is not a scary format. It's designed for data that is too big for users' machines, but Zarr makes data small and organizes it in a way where users can take just the bits they need or distribute the load of processing lots of those bits (stored as chunks) across many machines.
99

10-
The Zarr data format is a community-maintained format for large-scale n-dimensional data. A Zarr store consists of compressed and chunked n-dimensional arrays. Zarr's flexible indexing and compatibility with object storage lends itself to parallel processing.
10+
The Zarr data format is a community-maintained format for large-scale n-dimensional data. A Zarr store consists of chunked and compressed n-dimensional arrays. Zarr's flexible indexing and compatibility with object storage lends itself to parallel processing.
1111

12-
A Zarr chunk is Zarr's unit of data storage. Each chunk of a Zarr array is an equally-sized block of the array within a larger Zarr store comprised of one or more arrays and array groups. These blocks or chunks of data are stored separately to make reading and updating small chunks more efficient.
12+
A Zarr chunk is Zarr's unit of data storage. Each chunk of a Zarr array is an equally-sized block of the array within a larger Zarr store. The larger Zarr store is comprised of one or more arrays and array groups. The Zarr chunks are normally stored in separate objects in object storage to make reading and updating individual chunks more efficient.
1313

14-
Read more in the official tutorial: [Zarr Tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html)
14+
Read more in the official zarr-python user guide: [Zarr User Guide](https://zarr.readthedocs.io/en/stable/user-guide/)
1515

1616
## Zarr Version 2 and Version 3
1717

1818
::: {.callout-important}
19-
Zarr Version 3 is underway but not released yet, so all the examples in this guide are for Zarr Version 2 data. The concepts in this page are consistent across both Zarr Version 2 and Zarr Version 3, however some metadata field names and organization are changing from Version 2 to version 3.
19+
Zarr Version 3 represents a new specification of the same array-based data model. The concepts remain largely the same, however some metadata field names and organization have changed. Zarr Version 3 support is included in the canonical Python implementation -- zarr-python -- as of January 2025 (read more in the [release blog post](https://zarr.dev/blog/zarr-python-3-release/)). Zarr Version 2 data is still usable in newer versions of the zarr-python library. The examples in this guide use zarr-python >= 3 and Zarr Version 3 data unless otherwise specified.
2020
:::
2121

22-
Version 3 changes from Version 2:
22+
Zarr Version 3 specification changes from Version 2:
2323

2424
* `dtype` has been renamed to `data_type`,
2525
* `chunks` has been replaced with `chunk_grid`,
2626
* `dimension_separator` has been replaced with `chunk_key_encoding`,
2727
* `order` has been replaced by the [transpose](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/transpose/v1.0.html#transpose-codec-v1) codec,
28+
* multiple chunks can be stored within a single object on object storage (via the [sharding](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/index.html) codec)
2829
* the separate `filters` and `compressor` fields been combined into the single `codecs` field.
2930

3031
Read more:
3132

32-
* [Zarr specification version 2](https://zarr.readthedocs.io/en/stable/spec/v2.html)
33-
* [Zarr specification version 3.0](https://zarr.readthedocs.io/en/stable/spec/v3.html)
33+
* [Zarr specification version 2](https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html)
34+
* [Zarr specification version 3](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html)
3435

3536
## Zarr Data Organization
3637

3738
### Arrays
3839

39-
Zarr arrays are similar to numpy arrays, but chunked and compressed. We will add details about chunking and compression to this guide soon.
40+
Zarr arrays are similar to numpy arrays, but chunked and compressed.
4041

4142
### Hierarchy via Groups
4243

@@ -48,37 +49,39 @@ A Zarr array has zero or more dimensions. A Zarr array's shape is the tuple of t
4849

4950
### Coordinates and Indexes
5051

51-
Zarr indexing supports array subsetting (both reading and writing) without loading the whole array into memory. Advanced indexing operations, such as block indexing, are detailed in the Zarr tutorial: [Advanced indexing](https://zarr.readthedocs.io/en/stable/tutorial.html#advanced-indexing).
52+
Zarr indexing supports array subsetting (both reading and writing) without loading the whole array into memory. Advanced indexing operations, such as block indexing, are detailed in the zarr-python user guide: [Advanced indexing](https://zarr.readthedocs.io/en/stable/user-guide/arrays.html#advanced-indexing).
5253

5354
::: {.callout-note}
5455
The Zarr format is language-agnostic, but this indexing reference is specific to Python.
5556
:::
5657

57-
The [Xarray](https://docs.xarray.dev/) library provides a rich API for slicing and subselecting data. In addition to providing a positional index to subselect data, xarray supports label-based indexing. Labels, or coordinates, in the case of geospatial data, often include latitude and longitude (or y and x). These coordinates (also called names or labels) can be used to read and write data when the position is unknown.
58+
The [Xarray](https://docs.xarray.dev/) library provides a rich API for slicing and subselecting data. In addition to providing a positional index to subselect data, xarray supports label-based indexing. Labels, or coordinates, in the case of geospatial data, often include latitude and longitude (or y and x). Another common coordinate for data cubes is time. These coordinates (also called names or labels) can be used to read and write data without needing to know the positional index value.
5859

5960
### Consolidated Metadata
6061

61-
Every Zarr array has its own metadata. When considering cloud storage options, where latency is high so total requests should be limited, it is important to consolidate metadata so all metadata can be read from one object.
62+
Every Zarr group and every Zarr array has its own metadata. When considering cloud storage options, where latency is high so total requests should be limited, it is important to consolidate metadata at the root of the Zarr store so all metadata can be read from one object.
6263

63-
Read more on [consolidating metadata](https://zarr.readthedocs.io/en/stable/tutorial.html#consolidating-metadata).
64+
Read more on [consolidated metadata](https://zarr.readthedocs.io/en/main/user-guide/consolidated_metadata.html).
6465

6566
## Zarr Data Storage
6667

6768
### Storage
6869

69-
Zarr can be stored in memory, on disk, in Zip files, and in object storage like S3.
70+
At its core Zarr is a very flexible format that does not have any requirements regarding the actual storage system. Zarr can be stored in memory, on disk, in Zip files, and in any key-value store, such as object storage like S3. Learn more in the [Storage section of the Zarr specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id25).
7071

7172
::: {.callout-note}
72-
Any backend that implements `MutableMapping` interface from the Python `collections` module can be used to store Zarr. Learn more and see all the options on the [`Storage (zarr.storage)`](https://zarr.readthedocs.io/en/stable/api/storage.html) documentation page.
73+
Zarr data chunks do not necessarily need to be stored in the same storage system as the Zarr metadata. This is what enables virtual Zarr stores (kerchunk, icechunk) where the metadata references data in legacy chunked data formats (such as NetCDF and HDF5).
7374
:::
7475

75-
As of Zarr version 2.5, Zarr store URLs can be passed to fsspec and it will create a MutableMapping automatically.
76-
7776
### Chunking
7877

7978
Chunking is the process of dividing the data arrays into smaller pieces. This allows for parallel processing and efficient storage.
8079

81-
Once data is chunked, applications may read in 1 or many chunks. Because the data is compressed, within-chunk reads are not possible.
80+
Once data is chunked, applications may read in 1 or many chunks. Because the data is compressed at the chunk-level, within-chunk reads are not possible.
81+
82+
::: {.callout-note}
83+
Traditionally each chunk is stored in a separate object in object storage but with the [sharding] codec in Zarr version 3 now several chunks can be stored within a single object. This is an important enhancement because it prevents Zarr hierarchies from having so many files that they are effectively too large to manage.
84+
:::
8285

8386
### Compression
8487

0 commit comments

Comments
 (0)