Skip to content

implement adaptive sampling instead of probablistic sampling #1967

@zmalik

Description

@zmalik

On high-throughput clusters, the current static dataSamplingRate configuration requires manual tuning and cannot respond to changing traffic conditions. Users must guess an appropriate sampling rate, too aggressive causes unnecessary data loss during quiet periods, too permissive causes buffer overflow during traffic spikes.

This current static approach cannot adapt to load changes, and when retina_lost_events_total starts climbing, there is no automatic mechanism to reduce event volume.

Describe the solution you'd like
related to the feature ask #1966

Implement adaptive sampling using BPF ring buffer back-pressure (requires kernel 5.8+). With ring buffers, bpf_ringbuf_reserve() returns NULL when the buffer is full, providing natural back-pressure without explicit sampling logic.

This approach provides:

  • Zero overhead when buffer has capacity, no random number generation or map lookups per packet
  • Automatic adaptation drops events only when buffer is actually full
  • Configurable capacity users tune buffer size rather than sampling rate
  • Predictable behavior buffer size directly controls memory usage and burst capacity

This should be implemented alongside the BPF ring buffer feature request, as it depends on BPF_MAP_TYPE_RINGBUF (kernel 5.8+).

Describe alternatives you've considered

  1. BPF map-based rate control - Userspace monitors load and writes sampling rate to a BPF map that the BPF program reads per-packet. Adds map lookup overhead and has feedback delay between userspace detection and BPF adjustment.

  2. Token bucket in BPF - Implement rate limiting entirely in BPF using per-CPU maps. Complex to implement correctly with per-CPU state management and token refill logic.

Additional context

This feature is tied to the BPF ring buffer implementation. Ring buffers provide natural back-pressure that eliminates the need for explicit adaptive sampling logic. The buffer itself becomes the adaptation mechanism. Users configure buffer size based on their memory budget and acceptable burst capacity, and the system automatically drops events only when that capacity is exceeded.

Reference: https://nakryiko.com/posts/bpf-ringbuf/ "BPF ring buffer provides a special BPF_RB_NO_WAKEUP flag that can be used to avoid waking up user-space when buffer space is available, as well as BPF_RB_FORCE_WAKEUP to force wake-up."

Related to #655 as that kickstarted our internal investigation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions