Skip to content

feat: add array_sum, array_product, array_avg functions and list_min alias#21376

Open
crm26 wants to merge 4 commits intoapache:mainfrom
crm26:feat/array-aggregate-functions
Open

feat: add array_sum, array_product, array_avg functions and list_min alias#21376
crm26 wants to merge 4 commits intoapache:mainfrom
crm26:feat/array-aggregate-functions

Conversation

@crm26
Copy link
Copy Markdown

@crm26 crm26 commented Apr 4, 2026

Which issue does this PR close?

Closes gaps in DataFusion's array function coverage compared to DuckDB (list_sum, list_aggregate) and Trino (reduce).

Rationale for this change

DataFusion has array_min and array_max but no array_sum, array_product, or array_avg. These are common operations on array columns that currently require verbose workarounds (UNNEST + aggregate + ARRAY_AGG).

What changes are included in this PR?

New functions:

  • array_sum / list_sum — sum of all elements in an array (same return type as element)
  • array_product / list_product — product of all elements (rejects Decimal types where raw integer multiplication produces incorrect results due to scale)
  • array_avg / list_avg — arithmetic mean, always returns Float64

Bug fixes:

  • Added missing list_min alias to ArrayMin (parity with list_max on ArrayMax)
  • Extended convert_to_f64_array in vector_math.rs to handle Int8, Int16, UInt8, UInt16, UInt32, UInt64

Implementation:

  • array_sum and array_product use the same downcast_primitive! + offset-window pattern as array_min/array_max for zero-copy performance
  • array_avg converts elements to Float64 via the shared convert_to_f64_array primitive
  • All functions: NULL elements skipped, all-NULL/empty arrays return NULL, List and LargeList supported, FixedSizeList coerced automatically

Dependencies: Builds on #21371 (vector distance + array math functions) for shared vector_math.rs primitives.

Are these changes tested?

Yes — 41 new sqllogictest cases covering:

  • Integer, float, unsigned integer inputs
  • NULL element handling (skip NULLs, all-NULL → NULL)
  • Empty array and NULL input
  • LargeList support
  • Multi-row queries
  • Alias tests (list_sum, list_product, list_avg, list_min)
  • Type preservation (arrow_typeof assertions)
  • Error cases (string input, Decimal rejection, no arguments)

All existing tests pass (79 unit tests + 2 doctests + sqllogictests).

Are there any user-facing changes?

Yes — three new SQL functions available:

SELECT array_sum([1, 2, 3, 4]);        -- 10
SELECT array_product([2, 3, 4]);       -- 24
SELECT array_avg([1, 2, 3, 4]);        -- 2.5
SELECT list_min([3, 1, 4, 2]);         -- 1 (new alias)

crm26 and others added 4 commits April 4, 2026 16:24
Add 6 new scalar functions to datafusion-functions-nested:
- cosine_distance(array, array) — cosine distance (1 - cosine similarity)
- inner_product(array, array) — dot product
- array_normalize(array) — L2 unit normalization
- array_add(array, array) — element-wise addition
- array_subtract(array, array) — element-wise subtraction
- array_scale(array, float) — scalar multiplication

Shared math primitives (dot_product, magnitude, sum_of_squares) extracted
into vector_math.rs to avoid duplication across functions.

Includes aliases (list_*, dot_product), 29 unit tests, and a sqllogictest
file with vector search pattern coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds cosine_distance, inner_product, array_normalize, array_add,
array_subtract, and array_scale to datafusion-functions-nested.

Shared primitives in vector_math.rs (dot_product_f64, magnitude_f64,
sum_of_squares_f64, convert_to_f64_array) are reused across all
functions and the existing array_distance. Consolidates the duplicate
convert_to_f64_array from distance.rs into the shared module.

Functions:
  cosine_distance(a, b) → float64    (aliases: list_cosine_distance)
  inner_product(a, b) → float64      (aliases: list_inner_product, dot_product)
  array_normalize(a) → list(float64) (aliases: list_normalize)
  array_add(a, b) → list(float64)    (aliases: list_add)
  array_subtract(a, b) → list(float64) (aliases: list_subtract)
  array_scale(a, f) → list(float64)  (aliases: list_scale)

Enables vector search in standard SQL:
  SELECT id, cosine_distance(embedding, ARRAY[0.1, 0.2, ...]) as dist
  FROM documents ORDER BY dist LIMIT 10

79 tests, sqllogictest coverage, clippy clean.
…st_min alias

Adds three new array aggregate scalar functions to datafusion-functions-nested:

- array_sum / list_sum: sum of all elements in an array
- array_product / list_product: product of all elements (rejects Decimal types
  where scale adjustment would produce incorrect results)
- array_avg / list_avg: arithmetic mean, always returns Float64

Also adds the missing list_min alias to ArrayMin for parity with list_max.

Extends convert_to_f64_array in vector_math.rs to handle all numeric types
(Int8, Int16, UInt8, UInt16, UInt32, UInt64) in addition to existing coverage.

Includes comprehensive sqllogictest coverage: NULL handling, type preservation,
alias tests, error cases, LargeList, multi-row, UInt8, and Decimal rejection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant