High Process fork()/exec() rates cause drops and missing process exits

Here are details on one of many hosts that exhibit this issue, and what results from it:

* Every 15 second window, one process (`calico-felix`) `fork()`s ~6000 times over a 6 second window, then immediately `exec()`s `ipset` on top of those `fork()`'ed copies another ~6000 times - `lightswitch` is not running for this time range, and we don't see anything in the BPF kernel logs:

```
13:33:13

13:33:14

13:33:15

13:33:16

13:33:17

13:33:18

13:33:19
@[tracepoint:sched:sched_process_exec, ipset]: 425
@[tracepoint:sched:sched_process_fork, calico-felix]: 426
13:33:20
@[tracepoint:sched:sched_process_fork, calico-felix]: 1269
@[tracepoint:sched:sched_process_exec, ipset]: 1269
13:33:21
@[tracepoint:sched:sched_process_fork, calico-felix]: 1261
@[tracepoint:sched:sched_process_exec, ipset]: 1261
13:33:22
@[tracepoint:sched:sched_process_fork, calico-felix]: 1209
@[tracepoint:sched:sched_process_exec, ipset]: 1210
13:33:23
@[tracepoint:sched:sched_process_fork, calico-felix]: 1172
@[tracepoint:sched:sched_process_exec, ipset]: 1172
13:33:24
@[tracepoint:sched:sched_process_fork, calico-felix]: 480
@[tracepoint:sched:sched_process_exec, ipset]: 481
13:33:25

13:33:26

```
* Now, `lightswitch` is started. During the above events, you will now see in the BPF kernel log the following data drops - again, only if `lightswitch` is running:
```
CPU:70 [LOST 160 EVENTS]
CPU:10 [LOST 82 EVENTS]
CPU:64 [LOST 80 EVENTS]
CPU:64 [LOST 81 EVENTS]
CPU:64 [LOST 82 EVENTS]
CPU:52 [LOST 81 EVENTS]
CPU:52 [LOST 81 EVENTS]
CPU:26 [LOST 84 EVENTS]
```
* Confirmed with bpftrace that the number of `exit()` tracepoints firing match the number of `fork()` tracepoints firing whether `lightswitch` is running or not, so eBPF itself isn't missing any exit events.
* Since these processes are so short lived, `lightswitch` is only capturing stacks for a small fraction of them.  But it's enough to overload things.
* The resource consumption logs indicate that:
   * The number of new short-lived processes we've detected via incoming samples is fairly high, as one would expect
   * But we're only detecting the `exit()` of a fraction of the processes that we *know* fired the `exit()` tracepoint via separate simultaneous bptrace
   Example:
     ```
     2025-10-11T13:53:25.645911Z DEBUG lightswitch::profiler: 55 new processes this session, 1493 new processes since profiler startup
     2025-10-11T13:53:25.645912Z DEBUG lightswitch::profiler: 0 process exits this session; 80 process exits since profiler startup
     ```
     After adding debug messages to the new process detection code as well as the process exit code, we see messages like the following, which indicate that if the process is short lived enough, the process_exit tracepoint fires before we've had a chance to add the new process to the exec_mappings map such that the BPF `process_is_known()` (distinct from the Rust/userspace `process_is_known()` function) cannot detect the process is being tracked before the process `exit()` has already occurred:
     ```
           <...>-188302  [042] d..h1.. 2149412.246371: bpf_trace_printk: [debug] new PID 188302 DETECTION queued
           ipset-188302  [042] ....2.1 2149412.246681: bpf_trace_printk: [debug] exit tracer PID 188302 is NOT KNOWN

     ```
* It's not yet clear exactly where the events are being lost, continuing to investigate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High Process fork()/exec() rates cause drops and missing process exits #324

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

High Process fork()/exec() rates cause drops and missing process exits #324

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions