feat: add array_sum, array_product, array_avg functions and list_min alias#21376
Open
crm26 wants to merge 4 commits intoapache:mainfrom
Open
feat: add array_sum, array_product, array_avg functions and list_min alias#21376crm26 wants to merge 4 commits intoapache:mainfrom
crm26 wants to merge 4 commits intoapache:mainfrom
Conversation
Add 6 new scalar functions to datafusion-functions-nested: - cosine_distance(array, array) — cosine distance (1 - cosine similarity) - inner_product(array, array) — dot product - array_normalize(array) — L2 unit normalization - array_add(array, array) — element-wise addition - array_subtract(array, array) — element-wise subtraction - array_scale(array, float) — scalar multiplication Shared math primitives (dot_product, magnitude, sum_of_squares) extracted into vector_math.rs to avoid duplication across functions. Includes aliases (list_*, dot_product), 29 unit tests, and a sqllogictest file with vector search pattern coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds cosine_distance, inner_product, array_normalize, array_add, array_subtract, and array_scale to datafusion-functions-nested. Shared primitives in vector_math.rs (dot_product_f64, magnitude_f64, sum_of_squares_f64, convert_to_f64_array) are reused across all functions and the existing array_distance. Consolidates the duplicate convert_to_f64_array from distance.rs into the shared module. Functions: cosine_distance(a, b) → float64 (aliases: list_cosine_distance) inner_product(a, b) → float64 (aliases: list_inner_product, dot_product) array_normalize(a) → list(float64) (aliases: list_normalize) array_add(a, b) → list(float64) (aliases: list_add) array_subtract(a, b) → list(float64) (aliases: list_subtract) array_scale(a, f) → list(float64) (aliases: list_scale) Enables vector search in standard SQL: SELECT id, cosine_distance(embedding, ARRAY[0.1, 0.2, ...]) as dist FROM documents ORDER BY dist LIMIT 10 79 tests, sqllogictest coverage, clippy clean.
…st_min alias Adds three new array aggregate scalar functions to datafusion-functions-nested: - array_sum / list_sum: sum of all elements in an array - array_product / list_product: product of all elements (rejects Decimal types where scale adjustment would produce incorrect results) - array_avg / list_avg: arithmetic mean, always returns Float64 Also adds the missing list_min alias to ArrayMin for parity with list_max. Extends convert_to_f64_array in vector_math.rs to handle all numeric types (Int8, Int16, UInt8, UInt16, UInt32, UInt64) in addition to existing coverage. Includes comprehensive sqllogictest coverage: NULL handling, type preservation, alias tests, error cases, LargeList, multi-row, UInt8, and Decimal rejection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes gaps in DataFusion's array function coverage compared to DuckDB (
list_sum,list_aggregate) and Trino (reduce).Rationale for this change
DataFusion has
array_minandarray_maxbut noarray_sum,array_product, orarray_avg. These are common operations on array columns that currently require verbose workarounds (UNNEST+ aggregate +ARRAY_AGG).What changes are included in this PR?
New functions:
array_sum/list_sum— sum of all elements in an array (same return type as element)array_product/list_product— product of all elements (rejects Decimal types where raw integer multiplication produces incorrect results due to scale)array_avg/list_avg— arithmetic mean, always returns Float64Bug fixes:
list_minalias toArrayMin(parity withlist_maxonArrayMax)convert_to_f64_arrayinvector_math.rsto handle Int8, Int16, UInt8, UInt16, UInt32, UInt64Implementation:
array_sumandarray_productuse the samedowncast_primitive!+ offset-window pattern asarray_min/array_maxfor zero-copy performancearray_avgconverts elements to Float64 via the sharedconvert_to_f64_arrayprimitiveDependencies: Builds on #21371 (vector distance + array math functions) for shared
vector_math.rsprimitives.Are these changes tested?
Yes — 41 new sqllogictest cases covering:
list_sum,list_product,list_avg,list_min)arrow_typeofassertions)All existing tests pass (79 unit tests + 2 doctests + sqllogictests).
Are there any user-facing changes?
Yes — three new SQL functions available: