Skip to content

Containers Beta: Concurrent container starts fail silently - some containers never execute #5996

@kondo-masaki

Description

@kondo-masaki

Description

When triggering multiple container starts simultaneously via Durable Objects, some containers report running = true but never actually execute their entrypoint script. This results in builds getting stuck indefinitely.

Environment

  • Wrangler version: 4.61.1
  • Node version: v22.x
  • OS: macOS (development), Cloudflare Workers (production)
  • Container instance_type: basic (1/4 vCPU, 1 GiB memory)
  • max_instances: 20

Configuration

[[containers]]
class_name = "BuilderContainer"
image = "../../fly-builder/Dockerfile"
max_instances = 20
instance_type = "basic"
rollout_step_percentage = [100]

[[queues.consumers]]
queue = "guidebook-build-queue"
max_batch_size = 1
max_batch_timeout = 30
max_retries = 3
max_concurrency = 2  # Variable in testing

Reproduction Steps

  1. Set up a Durable Object that starts containers via ctx.container.start()
  2. Use a Queue consumer to process build jobs
  3. Trigger 16 build jobs simultaneously
  4. Observe that some containers never execute despite start() succeeding

Observed Behavior

Queue max_concurrency Success Rate Stuck Containers
1 16/16 (100%) 0
2 13/16 (81%) 3
3 11/16 (69%) 5
10 11/16 (69%) 5

Key observation: ctx.container.start() completes without error, and ctx.container.running returns true, but the container process (entrypoint script) never executes. No logs are produced, no files are written to R2.

Mitigations Attempted (all ineffective for concurrent starts)

  1. Burst avoidance delay: Random 0-5 second delay before start()
  2. Retry logic: Up to 3 retries with exponential backoff
  3. Running state verification: Wait up to 10 seconds for running = true
  4. Monitor tracking: ctx.container.monitor() for lifecycle events

All mitigations work correctly, but stuck containers still occur when max_concurrency > 1.

Durable Object Code

const MAX_START_RETRIES = 3;
const START_RETRY_DELAY_MS = 1000;
const RUNNING_CHECK_TIMEOUT_MS = 10000;
const BURST_AVOIDANCE_MAX_DELAY_MS = 5000;

export class BuilderContainer extends DurableObject {
  private async startBuild(request: Request): Promise<Response> {
    const params = await request.json() as BuildParams;
    
    // Burst avoidance
    const burstDelay = Math.floor(Math.random() * BURST_AVOIDANCE_MAX_DELAY_MS);
    await this.sleep(burstDelay);

    for (let attempt = 1; attempt <= MAX_START_RETRIES; attempt++) {
      await this.ctx.container.start({
        enableInternet: true,
        env: envVars,
      });

      this.ctx.container.monitor()
        .then(() => console.log(`Container exited normally`))
        .catch((err) => console.error(`Container error:`, err));

      const isRunning = await this.waitForRunning();
      if (isRunning) {
        // SUCCESS: But container may still not execute!
        return new Response(JSON.stringify({ status: 'started' }));
      }
      
      await this.sleep(START_RETRY_DELAY_MS * attempt);
    }
    throw new Error('Failed to start container after retries');
  }

  private async waitForRunning(): Promise<boolean> {
    const startTime = Date.now();
    while (Date.now() - startTime < RUNNING_CHECK_TIMEOUT_MS) {
      if (this.ctx.container?.running) return true;
      await this.sleep(100);
    }
    return false;
  }
}

Expected Behavior

All containers should execute their entrypoint when start() succeeds and running returns true.

Actual Behavior

With concurrent container starts (max_concurrency > 1), approximately 20-30% of containers never execute despite successful start() and running = true.

Workaround

Setting max_concurrency = 1 achieves 100% success rate, but severely limits throughput.

Questions

  1. Is there an undocumented limit on concurrent container starts?
  2. Is this a known issue with the Beta scheduler?
  3. Are there any recommended patterns for reliable concurrent container starts?

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Untriaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions