Summary
Arrow JS has no mechanism to register custom getters for Arrow extension types. Columns with ARROW:extension:name and ARROW:extension:metadata field metadata always return raw bytes from get(). Every consumer must independently check metadata and decode values.
Background
The Arrow Extension Type spec (format docs) allows producers to annotate fields with semantic type information via metadata:
ARROW:extension:name — type identifier (e.g., "arrow.uuid", "arrow.opaque")
ARROW:extension:metadata — serialized type parameters (e.g., {"type_name": "hugeint", "vendor_name": "DuckDB"})
Other Arrow implementations provide extension type registration:
- Arrow C++:
arrow::ExtensionType — register a subclass with RegisterExtensionType(), and IPC deserialization automatically produces typed arrays with custom accessors
- Arrow Python:
pyarrow.ExtensionType — register with register_extension_type(), custom __arrow_ext_deserialize__ decodes IPC data into Python objects
- Arrow Rust:
arrow::datatypes::ExtensionType trait
Arrow JS has no equivalent. Extension types are preserved in field metadata but get() returns the raw storage value (e.g., Uint8Array for FixedSizeBinary).
Impact
DuckDB with arrow_lossless_conversion=true serializes several types as Arrow extension types:
| DuckDB Type |
Arrow Storage |
Extension Name |
Bytes |
HUGEINT |
FixedSizeBinary[16] |
arrow.opaque |
16-byte two's complement signed int |
UHUGEINT |
FixedSizeBinary[16] |
arrow.opaque |
16-byte unsigned int |
TIME WITH TIME ZONE |
FixedSizeBinary[8] |
arrow.opaque |
packed micros + offset |
UUID |
FixedSizeBinary[16] |
arrow.uuid |
16 raw bytes |
BIGNUM |
Binary |
arrow.opaque |
3-byte header + big-endian magnitude |
VARINT |
Binary |
arrow.opaque |
same as BIGNUM |
BIT |
Binary |
arrow.opaque |
padding byte + bit data |
For each of these, consumers must:
- Check
field.metadata.get("ARROW:extension:metadata")
- Parse the JSON to get
type_name
- Read raw bytes from
column.data[0].values at the correct offset
- Interpret the binary encoding (two's complement, packed bitfields, etc.)
This is ~100 lines of manual decoding in our codebase, repeated by every consumer that reads DuckDB Arrow output.
Proposal
Add an extension type registry, similar to C++/Python:
import { registerExtensionType } from 'apache-arrow';
registerExtensionType({
name: 'arrow.opaque', // matches ARROW:extension:name
match: (metadata) => { // optional: filter by extension metadata
const parsed = JSON.parse(metadata);
return parsed.type_name === 'hugeint';
},
get: (data, index) => { // custom getter, replaces default
const dv = new DataView(data.values.buffer, data.values.byteOffset + index * 16, 16);
const lo = dv.getBigUint64(0, true);
const hi = dv.getBigUint64(8, true);
const raw = lo | (hi << 64n);
if (raw & (1n << 127n)) {
const mask = (1n << 128n) - 1n;
return -(((raw ^ mask) + 1n) & mask);
}
return raw;
},
});
After registration, vector.get(i) on a HUGEINT column would return a BigInt directly instead of a Uint8Array.
This could also support a serialize method for the write path, making round-trip extension types fully supported.
Alternatives
- Do nothing: consumers continue to manually decode. Works, but fragile and duplicated.
- Vendor-specific packages: e.g.,
@duckdb/arrow-extensions that monkey-patches Arrow's visitor. Feasible but hacky.
- Local fork of get.mjs: what we currently do via Vite alias. Maintenance burden.
Context
We maintain a DuckDB WASM frontend that displays query results through Arrow IPC. Every DuckDB extension type requires custom byte-level decoding because Arrow JS can't be taught about them. The same decoding logic would need to be written by anyone consuming DuckDB, Spark, or other engines that use Arrow extension types in JS.
Summary
Arrow JS has no mechanism to register custom getters for Arrow extension types. Columns with
ARROW:extension:nameandARROW:extension:metadatafield metadata always return raw bytes fromget(). Every consumer must independently check metadata and decode values.Background
The Arrow Extension Type spec (format docs) allows producers to annotate fields with semantic type information via metadata:
ARROW:extension:name— type identifier (e.g.,"arrow.uuid","arrow.opaque")ARROW:extension:metadata— serialized type parameters (e.g.,{"type_name": "hugeint", "vendor_name": "DuckDB"})Other Arrow implementations provide extension type registration:
arrow::ExtensionType— register a subclass withRegisterExtensionType(), and IPC deserialization automatically produces typed arrays with custom accessorspyarrow.ExtensionType— register withregister_extension_type(), custom__arrow_ext_deserialize__decodes IPC data into Python objectsarrow::datatypes::ExtensionTypetraitArrow JS has no equivalent. Extension types are preserved in field metadata but
get()returns the raw storage value (e.g.,Uint8ArrayforFixedSizeBinary).Impact
DuckDB with
arrow_lossless_conversion=trueserializes several types as Arrow extension types:HUGEINTUHUGEINTTIME WITH TIME ZONEUUIDBIGNUMVARINTBITFor each of these, consumers must:
field.metadata.get("ARROW:extension:metadata")type_namecolumn.data[0].valuesat the correct offsetThis is ~100 lines of manual decoding in our codebase, repeated by every consumer that reads DuckDB Arrow output.
Proposal
Add an extension type registry, similar to C++/Python:
After registration,
vector.get(i)on a HUGEINT column would return a BigInt directly instead of a Uint8Array.This could also support a
serializemethod for the write path, making round-trip extension types fully supported.Alternatives
@duckdb/arrow-extensionsthat monkey-patches Arrow's visitor. Feasible but hacky.Context
We maintain a DuckDB WASM frontend that displays query results through Arrow IPC. Every DuckDB extension type requires custom byte-level decoding because Arrow JS can't be taught about them. The same decoding logic would need to be written by anyone consuming DuckDB, Spark, or other engines that use Arrow extension types in JS.