Containers Beta: Concurrent container starts fail silently - some containers never execute

## Description

When triggering multiple container starts simultaneously via Durable Objects, some containers report `running = true` but never actually execute their entrypoint script. This results in builds getting stuck indefinitely.

## Environment

- Wrangler version: 4.61.1
- Node version: v22.x
- OS: macOS (development), Cloudflare Workers (production)
- Container instance_type: `basic` (1/4 vCPU, 1 GiB memory)
- max_instances: 20

## Configuration

```toml
[[containers]]
class_name = "BuilderContainer"
image = "../../fly-builder/Dockerfile"
max_instances = 20
instance_type = "basic"
rollout_step_percentage = [100]

[[queues.consumers]]
queue = "guidebook-build-queue"
max_batch_size = 1
max_batch_timeout = 30
max_retries = 3
max_concurrency = 2  # Variable in testing
```

## Reproduction Steps

1. Set up a Durable Object that starts containers via `ctx.container.start()`
2. Use a Queue consumer to process build jobs
3. Trigger 16 build jobs simultaneously
4. Observe that some containers never execute despite `start()` succeeding

## Observed Behavior

| Queue max_concurrency | Success Rate | Stuck Containers |
|----------------------|--------------|------------------|
| 1 | 16/16 (100%) | 0 |
| 2 | 13/16 (81%) | 3 |
| 3 | 11/16 (69%) | 5 |
| 10 | 11/16 (69%) | 5 |

**Key observation**: `ctx.container.start()` completes without error, and `ctx.container.running` returns `true`, but the container process (entrypoint script) never executes. No logs are produced, no files are written to R2.

## Mitigations Attempted (all ineffective for concurrent starts)

1. **Burst avoidance delay**: Random 0-5 second delay before `start()`
2. **Retry logic**: Up to 3 retries with exponential backoff
3. **Running state verification**: Wait up to 10 seconds for `running = true`
4. **Monitor tracking**: `ctx.container.monitor()` for lifecycle events

All mitigations work correctly, but stuck containers still occur when `max_concurrency > 1`.

## Durable Object Code

```typescript
const MAX_START_RETRIES = 3;
const START_RETRY_DELAY_MS = 1000;
const RUNNING_CHECK_TIMEOUT_MS = 10000;
const BURST_AVOIDANCE_MAX_DELAY_MS = 5000;

export class BuilderContainer extends DurableObject {
  private async startBuild(request: Request): Promise<Response> {
    const params = await request.json() as BuildParams;
    
    // Burst avoidance
    const burstDelay = Math.floor(Math.random() * BURST_AVOIDANCE_MAX_DELAY_MS);
    await this.sleep(burstDelay);

    for (let attempt = 1; attempt <= MAX_START_RETRIES; attempt++) {
      await this.ctx.container.start({
        enableInternet: true,
        env: envVars,
      });

      this.ctx.container.monitor()
        .then(() => console.log(`Container exited normally`))
        .catch((err) => console.error(`Container error:`, err));

      const isRunning = await this.waitForRunning();
      if (isRunning) {
        // SUCCESS: But container may still not execute!
        return new Response(JSON.stringify({ status: 'started' }));
      }
      
      await this.sleep(START_RETRY_DELAY_MS * attempt);
    }
    throw new Error('Failed to start container after retries');
  }

  private async waitForRunning(): Promise<boolean> {
    const startTime = Date.now();
    while (Date.now() - startTime < RUNNING_CHECK_TIMEOUT_MS) {
      if (this.ctx.container?.running) return true;
      await this.sleep(100);
    }
    return false;
  }
}
```

## Expected Behavior

All containers should execute their entrypoint when `start()` succeeds and `running` returns `true`.

## Actual Behavior

With concurrent container starts (`max_concurrency > 1`), approximately 20-30% of containers never execute despite successful `start()` and `running = true`.

## Workaround

Setting `max_concurrency = 1` achieves 100% success rate, but severely limits throughput.

## Questions

1. Is there an undocumented limit on concurrent container starts?
2. Is this a known issue with the Beta scheduler?
3. Are there any recommended patterns for reliable concurrent container starts?

## Related Issues

- cloudflare/workers-sdk#9877 - Different issue (max_instances defaulting to 0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers Beta: Concurrent container starts fail silently - some containers never execute #5996

Description

Environment

Configuration

Reproduction Steps

Observed Behavior

Mitigations Attempted (all ineffective for concurrent starts)

Durable Object Code

Expected Behavior

Actual Behavior

Workaround

Questions

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Containers Beta: Concurrent container starts fail silently - some containers never execute #5996

Description

Description

Environment

Configuration

Reproduction Steps

Observed Behavior

Mitigations Attempted (all ineffective for concurrent starts)

Durable Object Code

Expected Behavior

Actual Behavior

Workaround

Questions

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions