Merge branch 'main' into gpu-codecs

TomAugspurger · web-flow · commit 202efae34fd9 · 2026-02-11T10:06:35.000-06:00
diff --git a/changes/3695.bugfix.md b/changes/3695.bugfix.md
@@ -0,0 +1 @@
+Raise error when trying to encode :class:`numpy.dtypes.StringDType` with `na_object` set.
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -131,7 +131,7 @@ The hooks can be installed locally by running:
 prek install
 ```
 
-This would run the checks every time a commit is created locally. The checks will by default only run on the files modified by a commit, but the checks can be triggered for all the files by running:
+This will run the checks every time a commit is created locally. The checks will by default only run on the files modified by a commit, but the checks can be triggered for all the files by running:
 
 ```bash
 prek run --all-files
@@ -249,33 +249,33 @@ Pull requests submitted by an external contributor should be reviewed and approv
 
 Pull requests should not be merged until all CI checks have passed (GitHub Actions, Codecov) against code that has had the latest main merged in.
 
-Before merging the milestone must be set either to decide whether a PR will be in the next patch, minor, or major release. The next section explains which types of changes go in each release.
+Before merging, the milestone must be set to decide whether a PR will be in the next patch, minor, or major release. The next section explains which types of changes go in each release.
 
 ## Compatibility and versioning policies
 
 ### Versioning
 
-Versions of this library are identified by a triplet of integers with the form `<major>.<minor>.<patch>`, for example `3.0.4`. A release of `zarr-python` is associated with a new version identifier. That new identifier is generated by incrementing exactly one of the components of the previous version identifier by 1. When incrementing the `major` component of the version identifier, the `minor` and `patch` components is reset to 0. When incrementing the minor component, the patch component is reset to 0.
+Versions of this library are identified by a triplet of integers with the form `<major>.<minor>.<patch>`, for example `3.0.4`. A release of `zarr-python` is associated with a new version identifier. That new identifier is generated by incrementing exactly one of the components of the previous version identifier by 1. When incrementing the `major` component of the version identifier, the `minor` and `patch` components are reset to 0. When incrementing the minor component, the patch component is reset to 0.
 
 Releases are classified by the library changes contained in that release. This classification determines which component of the version identifier is incremented on release.
 
 * **major** releases (for example, `2.18.0` -> `3.0.0`) are for changes that will require extensive adaptation efforts from many users and downstream projects. For example, breaking changes to widely-used user-facing APIs should only be applied in a major release.
 
   Users and downstream projects should carefully consider the impact of a major release before adopting it. In advance of a major release, developers should communicate the scope of the upcoming changes, and help users prepare for them.
 
-* **minor** releases (for example, `3.0.0` -> `3.1.0`) are for changes that do not require significant effort from most users or downstream downstream projects to respond to. API changes are possible in minor releases if the burden on users imposed by those changes is sufficiently small.
+* **minor** releases (for example, `3.0.0` -> `3.1.0`) are for changes that do not require significant effort from most users or downstream projects to respond to. API changes are possible in minor releases if the burden on users imposed by those changes is sufficiently small.
 
   For example, a recently released API may need fixes or refinements that are breaking, but low impact due to the recency of the feature. Such API changes are permitted in a minor release.
 
   Minor releases are safe for most users and downstream projects to adopt.
 
 * **patch** releases (for example, `3.1.0` -> `3.1.1`) are for changes that contain no breaking or behaviour changes for downstream projects or users. Examples of changes suitable for a patch release are bugfixes and documentation improvements.
 
-  Users should always feel safe upgrading to a the latest patch release.
+  Users should always feel safe upgrading to the latest patch release.
 
 Note that this versioning scheme is not consistent with [Semantic Versioning](https://semver.org/). Contrary to SemVer, the Zarr library may release breaking changes in `minor` releases, or even `patch` releases under exceptional circumstances. But we should strive to avoid doing so.
 
-A better model for our versioning scheme is [Intended Effort Versioning](https://jacobtomlinson.dev/effver/), or "EffVer". The guiding principle off EffVer is to categorize releases based on the *expected effort required to upgrade to that release*.
+A better model for our versioning scheme is [Intended Effort Versioning](https://jacobtomlinson.dev/effver/), or "EffVer". The guiding principle of EffVer is to categorize releases based on the *expected effort required to upgrade to that release*.
 
 Zarr developers should make changes as smooth as possible for users. This means making backwards-compatible changes wherever possible. When a backwards-incompatible change is necessary, users should be notified well in advance, e.g. via informative deprecation warnings.
 
@@ -288,12 +288,12 @@ If an existing Zarr format version changes, or a new version of the Zarr format
 ## Release procedure
 
 Open an issue on GitHub announcing the release using the release checklist template:
-[https://github.com/zarr-developers/zarr-python/issues/new?template=release-checklist.md](https://github.com/zarr-developers/zarr-python/issues/new?template=release-checklist.md>). The release checklist includes all steps necessary for the release.
+[https://github.com/zarr-developers/zarr-python/issues/new?template=release-checklist.md](https://github.com/zarr-developers/zarr-python/issues/new?template=release-checklist.md). The release checklist includes all steps necessary for the release.
 
 ## Benchmarks
 
 Zarr uses [pytest-benchmark](https://pytest-benchmark.readthedocs.io/en/latest/) for running
-performance benchmarks as part of our test suite. The benchmarks can be are found in `tests/benchmarks`.
+performance benchmarks as part of our test suite. The benchmarks are found in `tests/benchmarks`.
 By default pytest is configured to run these benchmarks as plain tests (i.e., no benchmarking). To run
 a benchmark with timing measurements, use the `--benchmark-enable` when invoking `pytest`.
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -1,4 +1,4 @@
-This section  will help you get up and running with
+This section will help you get up and running with
 the Zarr library in Python to efficiently manage and analyze multi-dimensional arrays.
 
 ### Creating an Array
@@ -92,7 +92,7 @@ spam[:] = np.arange(10)
 print(root.tree())
 ```
 
-This creates a group with two datasets: `foo` and `bar`.
+This creates a group hierarchy with a group (`foo`) and two arrays (`bar` and `spam`).
 
 #### Batch Hierarchy Creation
 
diff --git a/docs/user-guide/arrays.md b/docs/user-guide/arrays.md
@@ -72,7 +72,7 @@ print(z[:, 0])
 print(z[:])
 ```
 
-Read more about NumPy-style indexing can be found in the
+More information about NumPy-style indexing can be found in the
 [NumPy documentation](https://numpy.org/doc/stable/user/basics.indexing.html).
 
 ## Persistent arrays
@@ -297,7 +297,7 @@ array without loading the entire array into memory.
 Note that although this functionality is similar to some of the advanced
 indexing capabilities available on NumPy arrays and on h5py datasets, **the Zarr
 API for advanced indexing is different from both NumPy and h5py**, so please
-read this section carefully.  For a complete description of the indexing API,
+read this section carefully. For a complete description of the indexing API,
 see the documentation for the [`zarr.Array`][] class.
 
 ### Indexing with coordinate arrays
diff --git a/docs/user-guide/extending.md b/docs/user-guide/extending.md
@@ -29,7 +29,7 @@ of the array data. Examples include compression codecs, such as
 
 Custom codecs for Zarr are implemented by subclassing the relevant base class, see
 [`zarr.abc.codec.ArrayArrayCodec`][], [`zarr.abc.codec.ArrayBytesCodec`][] and
-[`zarr.abc.codec.BytesBytesCodec`][]. Most custom codecs should implemented the
+[`zarr.abc.codec.BytesBytesCodec`][]. Most custom codecs should implement the
 `_encode_single` and `_decode_single` methods. These methods operate on single chunks
 of the array data. Alternatively, custom codecs can implement the `encode` and `decode`
 methods, which operate on batches of chunks, in case the codec is intended to implement
diff --git a/docs/user-guide/groups.md b/docs/user-guide/groups.md
@@ -13,7 +13,7 @@ root = zarr.create_group(store=store)
 print(root)
 ```
 
-Groups have a similar API to the Group class from [h5py](https://www.h5py.org/).  For example, groups can contain other groups:
+Groups have a similar API to the Group class from [h5py](https://www.h5py.org/). For example, groups can contain other groups:
 
 ```python exec="true" session="groups" source="above"
 foo = root.create_group('foo')
diff --git a/docs/user-guide/storage.md b/docs/user-guide/storage.md
@@ -91,8 +91,8 @@ print(group)
 
 ## Explicit Store Creation
 
-In some cases, it may be helpful to create a store instance directly. Zarr-Python offers four
-built-in store: [`zarr.storage.LocalStore`][], [`zarr.storage.FsspecStore`][],
+In some cases, it may be helpful to create a store instance directly. Zarr-Python offers
+built-in stores: [`zarr.storage.LocalStore`][], [`zarr.storage.FsspecStore`][],
 [`zarr.storage.ZipStore`][], [`zarr.storage.MemoryStore`][], and [`zarr.storage.ObjectStore`][].
 
 ### Local Store
diff --git a/docs/user-guide/v3_migration.md b/docs/user-guide/v3_migration.md
@@ -20,7 +20,7 @@ so we can improve this guide.
 
 The goals described above necessitated some breaking changes to the API (hence the
 major version update), but where possible we have maintained backwards compatibility
-in the most widely used parts of the API. This in the [`zarr.Array`][] and
+in the most widely used parts of the API. This includes the [`zarr.Array`][] and
 [`zarr.Group`][] classes and the "top-level API" (e.g. [`zarr.open_array`][] and
 [`zarr.open_group`][]).
 
diff --git a/src/zarr/core/dtype/npy/string.py b/src/zarr/core/dtype/npy/string.py
@@ -742,6 +742,43 @@ class VariableLengthUTF8(UTF8Base[np.dtypes.StringDType]):  # type: ignore[type-
 
         dtype_cls = np.dtypes.StringDType
 
+        @classmethod
+        def from_native_dtype(cls, dtype: TBaseDType) -> Self:
+            """
+            Create an instance of this data type from a compatible NumPy data type.
+            We reject NumPy StringDType instances that have the `na_object` field set,
+            because this is not representable by the Zarr `string` data type.
+
+            Parameters
+            ----------
+            dtype : TBaseDType
+                The native data type.
+
+            Returns
+            -------
+            Self
+                An instance of this data type.
+
+            Raises
+            ------
+            DataTypeValidationError
+                If the input is not compatible with this data type.
+            ValueError
+                If the input is `numpy.dtypes.StringDType` and has `na_object` set.
+            """
+            if cls._check_native_dtype(dtype):
+                if hasattr(dtype, "na_object"):
+                    msg = (
+                        f"Zarr data type resolution from {dtype} failed. "
+                        "Attempted to resolve a zarr data type from a `numpy.dtypes.StringDType` "
+                        "with `na_object` set, which is not supported."
+                    )
+                    raise ValueError(msg)
+                return cls()
+            raise DataTypeValidationError(
+                f"Invalid data type: {dtype}. Expected an instance of {cls.dtype_cls}"
+            )
+
         def to_native_dtype(self) -> np.dtypes.StringDType:
             """
             Create a NumPy string dtype from this VariableLengthUTF8 ZDType.
diff --git a/src/zarr/core/dtype/registry.py b/src/zarr/core/dtype/registry.py
@@ -161,6 +161,10 @@ def match_dtype(self, dtype: TBaseDType) -> ZDType[TBaseDType, TBaseScalar]:
             raise ValueError(msg)
         matched: list[ZDType[TBaseDType, TBaseScalar]] = []
         for val in self.contents.values():
+            # DataTypeValidationError means "this dtype doesn't match me", which is
+            # expected and suppressed. Other exceptions (e.g. ValueError for a dtype
+            # that matches the type but has an invalid configuration) are propagated
+            # to the caller.
             with contextlib.suppress(DataTypeValidationError):
                 matched.append(val.from_native_dtype(dtype))
         if len(matched) == 1:
diff --git a/tests/test_dtype_registry.py b/tests/test_dtype_registry.py
@@ -15,9 +15,11 @@
     get_data_type_from_json,
 )
 from zarr.core.dtype.common import unpack_dtype_json
+from zarr.core.dtype.npy.string import _NUMPY_SUPPORTS_VLEN_STRING
 from zarr.dtype import (  # type: ignore[attr-defined]
     Bool,
     FixedLengthUTF32,
+    VariableLengthUTF8,
     ZDType,
     data_type_registry,
     parse_data_type,
@@ -74,6 +76,16 @@ def test_match_dtype(
         data_type_registry_fixture.register(wrapper_cls._zarr_v3_name, wrapper_cls)
         assert isinstance(data_type_registry_fixture.match_dtype(np.dtype(dtype_str)), wrapper_cls)
 
+    @pytest.mark.skipif(not _NUMPY_SUPPORTS_VLEN_STRING, reason="requires numpy with T dtype")
+    @staticmethod
+    def test_match_dtype_string_na_object_error(
+        data_type_registry_fixture: DataTypeRegistry,
+    ) -> None:
+        data_type_registry_fixture.register(VariableLengthUTF8._zarr_v3_name, VariableLengthUTF8)  # type: ignore[arg-type]
+        dtype: np.dtype[Any] = np.dtypes.StringDType(na_object=None)  # type: ignore[call-arg]
+        with pytest.raises(ValueError, match=r"Zarr data type resolution from StringDType.*failed"):
+            data_type_registry_fixture.match_dtype(dtype)
+
     @staticmethod
     def test_unregistered_dtype(data_type_registry_fixture: DataTypeRegistry) -> None:
         """

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+Raise error when trying to encode :class:`numpy.dtypes.StringDType` with `na_object` set.