-
Notifications
You must be signed in to change notification settings - Fork 524
Description
Description
When triggering multiple container starts simultaneously via Durable Objects, some containers report running = true but never actually execute their entrypoint script. This results in builds getting stuck indefinitely.
Environment
- Wrangler version: 4.61.1
- Node version: v22.x
- OS: macOS (development), Cloudflare Workers (production)
- Container instance_type:
basic(1/4 vCPU, 1 GiB memory) - max_instances: 20
Configuration
[[containers]]
class_name = "BuilderContainer"
image = "../../fly-builder/Dockerfile"
max_instances = 20
instance_type = "basic"
rollout_step_percentage = [100]
[[queues.consumers]]
queue = "guidebook-build-queue"
max_batch_size = 1
max_batch_timeout = 30
max_retries = 3
max_concurrency = 2 # Variable in testingReproduction Steps
- Set up a Durable Object that starts containers via
ctx.container.start() - Use a Queue consumer to process build jobs
- Trigger 16 build jobs simultaneously
- Observe that some containers never execute despite
start()succeeding
Observed Behavior
| Queue max_concurrency | Success Rate | Stuck Containers |
|---|---|---|
| 1 | 16/16 (100%) | 0 |
| 2 | 13/16 (81%) | 3 |
| 3 | 11/16 (69%) | 5 |
| 10 | 11/16 (69%) | 5 |
Key observation: ctx.container.start() completes without error, and ctx.container.running returns true, but the container process (entrypoint script) never executes. No logs are produced, no files are written to R2.
Mitigations Attempted (all ineffective for concurrent starts)
- Burst avoidance delay: Random 0-5 second delay before
start() - Retry logic: Up to 3 retries with exponential backoff
- Running state verification: Wait up to 10 seconds for
running = true - Monitor tracking:
ctx.container.monitor()for lifecycle events
All mitigations work correctly, but stuck containers still occur when max_concurrency > 1.
Durable Object Code
const MAX_START_RETRIES = 3;
const START_RETRY_DELAY_MS = 1000;
const RUNNING_CHECK_TIMEOUT_MS = 10000;
const BURST_AVOIDANCE_MAX_DELAY_MS = 5000;
export class BuilderContainer extends DurableObject {
private async startBuild(request: Request): Promise<Response> {
const params = await request.json() as BuildParams;
// Burst avoidance
const burstDelay = Math.floor(Math.random() * BURST_AVOIDANCE_MAX_DELAY_MS);
await this.sleep(burstDelay);
for (let attempt = 1; attempt <= MAX_START_RETRIES; attempt++) {
await this.ctx.container.start({
enableInternet: true,
env: envVars,
});
this.ctx.container.monitor()
.then(() => console.log(`Container exited normally`))
.catch((err) => console.error(`Container error:`, err));
const isRunning = await this.waitForRunning();
if (isRunning) {
// SUCCESS: But container may still not execute!
return new Response(JSON.stringify({ status: 'started' }));
}
await this.sleep(START_RETRY_DELAY_MS * attempt);
}
throw new Error('Failed to start container after retries');
}
private async waitForRunning(): Promise<boolean> {
const startTime = Date.now();
while (Date.now() - startTime < RUNNING_CHECK_TIMEOUT_MS) {
if (this.ctx.container?.running) return true;
await this.sleep(100);
}
return false;
}
}Expected Behavior
All containers should execute their entrypoint when start() succeeds and running returns true.
Actual Behavior
With concurrent container starts (max_concurrency > 1), approximately 20-30% of containers never execute despite successful start() and running = true.
Workaround
Setting max_concurrency = 1 achieves 100% success rate, but severely limits throughput.
Questions
- Is there an undocumented limit on concurrent container starts?
- Is this a known issue with the Beta scheduler?
- Are there any recommended patterns for reliable concurrent container starts?
Related Issues
- Containers Beta: Provisioning Failure - Instances Stuck in Inactive / Unknown State workers-sdk#9877 - Different issue (max_instances defaulting to 0)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status