perf: Implement groups accumulator count distinct primitive types#21561
perf: Implement groups accumulator count distinct primitive types#21561coderfender wants to merge 20 commits intoapache:mainfrom
Conversation
|
Requesting to run benchmarks TPCH , TPCDS , clickbench etc to see how groups accumulator impl performs in count(distinct) use case |
|
run benchmark tpch tpcds clickbench_partitiomed clickbemch_extended |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (f0b2a4a) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (f0b2a4a) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (f0b2a4a) to 16e578d (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (f0b2a4a) to 16e578d (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
1 similar comment
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
5692893 to
ee0c865
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
@Dandandan could you retrigger the benchmarks again please? |
|
run benchmark tpch tpcds clickbench_partitioned clickbench_extended |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (ee0c865) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (ee0c865) to 16e578d (merge-base) diff using: clickbench_extended File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (ee0c865) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (ee0c865) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_extended — base (merge-base)
clickbench_extended — branch
File an issue against this benchmark runner |
5b7f593 to
067a6b6
Compare
|
Comparison with main |
|
@Dandandan true! I also benchmarked through |
2aa6549 to
cbfee35
Compare
cbfee35 to
c2bc9d3
Compare
|
@Dandandan , could you re run the benchmarks please ? I do see a consistent speedup with Q9 and seems like this PR is ready for review |
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (47110a8) to 882be98 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
Thank you for the approval @Dandandan . but it doesn't seem like the benchmarks ran correctly. Could you please trigger it again and perhaps even toch tpcds and other relevant queries please? |
| EmitTo::First(n) => n, | ||
| }; | ||
|
|
||
| let mut group_values: Vec<Vec<T::Native>> = vec![Vec::new(); num_emitted]; |
There was a problem hiding this comment.
Something like this (Claude-generated) probably is faster?
// Prefix-sum counts[..num_emitted] into offsets
let mut offsets = Vec::with_capacity(num_emitted + 1);
offsets.push(0i32);
let mut total = 0i32;
for &c in &self.counts[..num_emitted] {
total += c as i32;
offsets.push(total);
}
let mut all_values = vec![T::Native::default(); total as usize];
let mut cursors: Vec<i32> = offsets[..num_emitted].to_vec();
if matches!(emit_to, EmitTo::All) {
for (group_idx, value) in self.seen.drain() {
let pos = cursors[group_idx] as usize;
all_values[pos] = value;
cursors[group_idx] += 1;
}
self.counts.clear();
} else {
let mut remaining = HashSet::default();
for (group_idx, value) in self.seen.drain() {
if group_idx < num_emitted {
let pos = cursors[group_idx] as usize;
all_values[pos] = value;
cursors[group_idx] += 1;
} else {
remaining.insert((group_idx - num_emitted, value));
}
}
self.seen = remaining;
let _ = emit_to.take_needed(&mut self.counts);
There was a problem hiding this comment.
Ah yes . I tried having one vector but that my implementation more complex than I thought I could maintain . But let me give this a try and get back to you to see if that helps with benchmarks
| let list_array = values[0].as_list::<i32>(); | ||
|
|
||
| for (row_idx, &group_idx) in group_indices.iter().enumerate() { | ||
| let inner = list_array.value(row_idx); |
There was a problem hiding this comment.
Something like this should be slightly faster:
let inner_values = inner.values();
let offsets = list_array.offsets();
for (row_idx, &group_idx) in group_indices.iter().enumerate() {
let start = offsets[row_idx] as usize;
let end = offsets[row_idx + 1] as usize;
for &value in &inner_values[start..end] {
if self.seen.insert((group_idx, value)) {
self.counts[group_idx] += 1;
}
}
}
| offsets.push(all_values.len() as i32); | ||
| } | ||
|
|
||
| let values_array = Arc::new(PrimitiveArray::<T>::from_iter_values(all_values)); |
There was a problem hiding this comment.
| let values_array = Arc::new(PrimitiveArray::<T>::from_iter_values(all_values)); | |
| let values_array = Arc::new(PrimitiveArray::<T>::new(ScalarBuffer::from(all_values), None)); |
There was a problem hiding this comment.
This is faster as well @coderfender (avoids a copy).
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (066504d) to 5a427cb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (066504d) to 5a427cb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (066504d) to 5a427cb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
@Dandandan , pushed both optimizations. |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (617e790) to 5a427cb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (617e790) to 5a427cb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing implement_groups_accumulator_count_distinct_primitive_types (617e790) to 5a427cb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Evaluate perf with group accumulators for count distinct
Related : #21575 benchmark PR
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?