-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Here are details on one of many hosts that exhibit this issue, and what results from it:
- Every 15 second window, one process (
calico-felix)fork()s ~6000 times over a 6 second window, then immediatelyexec()sipseton top of thosefork()'ed copies another ~6000 times -lightswitchis not running for this time range, and we don't see anything in the BPF kernel logs:
13:33:13
13:33:14
13:33:15
13:33:16
13:33:17
13:33:18
13:33:19
@[tracepoint:sched:sched_process_exec, ipset]: 425
@[tracepoint:sched:sched_process_fork, calico-felix]: 426
13:33:20
@[tracepoint:sched:sched_process_fork, calico-felix]: 1269
@[tracepoint:sched:sched_process_exec, ipset]: 1269
13:33:21
@[tracepoint:sched:sched_process_fork, calico-felix]: 1261
@[tracepoint:sched:sched_process_exec, ipset]: 1261
13:33:22
@[tracepoint:sched:sched_process_fork, calico-felix]: 1209
@[tracepoint:sched:sched_process_exec, ipset]: 1210
13:33:23
@[tracepoint:sched:sched_process_fork, calico-felix]: 1172
@[tracepoint:sched:sched_process_exec, ipset]: 1172
13:33:24
@[tracepoint:sched:sched_process_fork, calico-felix]: 480
@[tracepoint:sched:sched_process_exec, ipset]: 481
13:33:25
13:33:26
- Now,
lightswitchis started. During the above events, you will now see in the BPF kernel log the following data drops - again, only iflightswitchis running:
CPU:70 [LOST 160 EVENTS]
CPU:10 [LOST 82 EVENTS]
CPU:64 [LOST 80 EVENTS]
CPU:64 [LOST 81 EVENTS]
CPU:64 [LOST 82 EVENTS]
CPU:52 [LOST 81 EVENTS]
CPU:52 [LOST 81 EVENTS]
CPU:26 [LOST 84 EVENTS]
- Confirmed with bpftrace that the number of
exit()tracepoints firing match the number offork()tracepoints firing whetherlightswitchis running or not, so eBPF itself isn't missing any exit events. - Since these processes are so short lived,
lightswitchis only capturing stacks for a small fraction of them. But it's enough to overload things. - The resource consumption logs indicate that:
- The number of new short-lived processes we've detected via incoming samples is fairly high, as one would expect
- But we're only detecting the
exit()of a fraction of the processes that we know fired theexit()tracepoint via separate simultaneous bptrace
Example:After adding debug messages to the new process detection code as well as the process exit code, we see messages like the following, which indicate that if the process is short lived enough, the process_exit tracepoint fires before we've had a chance to add the new process to the exec_mappings map such that the BPF2025-10-11T13:53:25.645911Z DEBUG lightswitch::profiler: 55 new processes this session, 1493 new processes since profiler startup 2025-10-11T13:53:25.645912Z DEBUG lightswitch::profiler: 0 process exits this session; 80 process exits since profiler startupprocess_is_known()(distinct from the Rust/userspaceprocess_is_known()function) cannot detect the process is being tracked before the processexit()has already occurred:<...>-188302 [042] d..h1.. 2149412.246371: bpf_trace_printk: [debug] new PID 188302 DETECTION queued ipset-188302 [042] ....2.1 2149412.246681: bpf_trace_printk: [debug] exit tracer PID 188302 is NOT KNOWN
- It's not yet clear exactly where the events are being lost, continuing to investigate.
Metadata
Metadata
Assignees
Labels
No labels