Skip to content

No extension type registry — consumers must manually decode FixedSizeBinary #423

@rustyconover

Description

@rustyconover

Summary

Arrow JS has no mechanism to register custom getters for Arrow extension types. Columns with ARROW:extension:name and ARROW:extension:metadata field metadata always return raw bytes from get(). Every consumer must independently check metadata and decode values.

Background

The Arrow Extension Type spec (format docs) allows producers to annotate fields with semantic type information via metadata:

  • ARROW:extension:name — type identifier (e.g., "arrow.uuid", "arrow.opaque")
  • ARROW:extension:metadata — serialized type parameters (e.g., {"type_name": "hugeint", "vendor_name": "DuckDB"})

Other Arrow implementations provide extension type registration:

  • Arrow C++: arrow::ExtensionType — register a subclass with RegisterExtensionType(), and IPC deserialization automatically produces typed arrays with custom accessors
  • Arrow Python: pyarrow.ExtensionType — register with register_extension_type(), custom __arrow_ext_deserialize__ decodes IPC data into Python objects
  • Arrow Rust: arrow::datatypes::ExtensionType trait

Arrow JS has no equivalent. Extension types are preserved in field metadata but get() returns the raw storage value (e.g., Uint8Array for FixedSizeBinary).

Impact

DuckDB with arrow_lossless_conversion=true serializes several types as Arrow extension types:

DuckDB Type Arrow Storage Extension Name Bytes
HUGEINT FixedSizeBinary[16] arrow.opaque 16-byte two's complement signed int
UHUGEINT FixedSizeBinary[16] arrow.opaque 16-byte unsigned int
TIME WITH TIME ZONE FixedSizeBinary[8] arrow.opaque packed micros + offset
UUID FixedSizeBinary[16] arrow.uuid 16 raw bytes
BIGNUM Binary arrow.opaque 3-byte header + big-endian magnitude
VARINT Binary arrow.opaque same as BIGNUM
BIT Binary arrow.opaque padding byte + bit data

For each of these, consumers must:

  1. Check field.metadata.get("ARROW:extension:metadata")
  2. Parse the JSON to get type_name
  3. Read raw bytes from column.data[0].values at the correct offset
  4. Interpret the binary encoding (two's complement, packed bitfields, etc.)

This is ~100 lines of manual decoding in our codebase, repeated by every consumer that reads DuckDB Arrow output.

Proposal

Add an extension type registry, similar to C++/Python:

import { registerExtensionType } from 'apache-arrow';

registerExtensionType({
  name: 'arrow.opaque',         // matches ARROW:extension:name
  match: (metadata) => {        // optional: filter by extension metadata
    const parsed = JSON.parse(metadata);
    return parsed.type_name === 'hugeint';
  },
  get: (data, index) => {       // custom getter, replaces default
    const dv = new DataView(data.values.buffer, data.values.byteOffset + index * 16, 16);
    const lo = dv.getBigUint64(0, true);
    const hi = dv.getBigUint64(8, true);
    const raw = lo | (hi << 64n);
    if (raw & (1n << 127n)) {
      const mask = (1n << 128n) - 1n;
      return -(((raw ^ mask) + 1n) & mask);
    }
    return raw;
  },
});

After registration, vector.get(i) on a HUGEINT column would return a BigInt directly instead of a Uint8Array.

This could also support a serialize method for the write path, making round-trip extension types fully supported.

Alternatives

  • Do nothing: consumers continue to manually decode. Works, but fragile and duplicated.
  • Vendor-specific packages: e.g., @duckdb/arrow-extensions that monkey-patches Arrow's visitor. Feasible but hacky.
  • Local fork of get.mjs: what we currently do via Vite alias. Maintenance burden.

Context

We maintain a DuckDB WASM frontend that displays query results through Arrow IPC. Every DuckDB extension type requires custom byte-level decoding because Arrow JS can't be taught about them. The same decoding logic would need to be written by anyone consuming DuckDB, Spark, or other engines that use Arrow extension types in JS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions