perf: pool zstd encoders via sync.Pool and reduce allocations#380
perf: pool zstd encoders via sync.Pool and reduce allocations#380jpkrohling wants to merge 1 commit intosplunk:mainfrom
Conversation
Add zstd encoder pooling to amortize encoder creation cost across writer instances, matching the pattern used by OTLP exporters. Changes: - Add sync.Pool for *zstd.Encoder with WithEncoderConcurrency(1) - Add Close() to FrameEncoder to return encoders to pool - Add Close() to generated writers (template + regenerated code) - Clear stale writer reference before returning encoder to pool - Fix constructor error paths to avoid leaking pooled encoders - Pre-allocate uvarintBuf to avoid allocation per frame restart - Fix exporter Shutdown to flush, close writer under mutex, and nil-guard flusher to prevent panic after shutdown Benchmark (repeated writer create/write/close, the exporter pattern): FrameEncoderZstd/pooled 1059µs → 3.7µs (-99.65%) B/op 19.1MB → 391B (-100.00%) allocs/op 55 → 6 (-89.09%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
For context: I compared STEF with other options and the performance of STEF+zstd was quite bad in terms of CPU and memory. It seemed unreasonable to me, so I compared how this uses zstd with how we use in the Collector and found out that STEF isn't pooling the encoders. This change should bring the performance on par with OTLP gRPC/HTTP w/ zstd. |
|
@jpkrohling I would like to understand your use case better before we go ahead with this. STEF streams are long lasting and each stream creates only one zstd encoder per stream lifetime so the rate of zstd encoder creation should be very low. I don't see how this can have significant performance impact unless somehow you create and destroy a lot of STEF streams at a high rate. If that's what you are doing I want to understand why. The Collector exporter/receiver should not be doing that, the streams are supposed to be kept open for several minutes. |
|
Good to challenge that! I tried to isolate the STEF parts in my benchmark when comparing with OTLP, and that didn't reflect the actual behavior of the exporter. I refactored the benchmark so that it's closer to the reality, and the results are really great. I'll try to get this implemented in the next few days in an internal component here and report back. In any case, I'd appreciate some feedback about the benchmark. Does it look sane to you? Am I missing something else? |
Summary
sync.Poolfor zstd encoders withWithEncoderConcurrency(1)to amortize encoder creation cost across writer instancesClose()method toFrameEncoderand all generated writers to return encoders to pooluvarintBufto avoid allocation per frame restartShutdownto flush/close writer under mutex with nil-guard in flusherBenchmark
Repeated writer create/write/close cycle (the exporter pattern):
Test plan
go/pkg,go/otel,go/pdata,otelcol, all examples)BenchmarkFrameEncoderZstdvalidates pooling performanceClose()is idempotent (nil-guarded, safe to call multiple times)stefc)Resolves E-1535
🤖 Generated with Claude Code