Skip to content

Conversation

@wezell
Copy link
Contributor

@wezell wezell commented Jan 29, 2026

ref: #34442

Below is documentation:

dotCMS Environment Cloning

This feature allows you to use the dotCMS Docker image as an init container to clone data and assets from another running dotCMS environment. This is useful for:

  • Setting up development environments from production/staging
  • Creating test environments with real data
  • Initializing new dotCMS instances with existing content

How It Works

The 10-import-env.sh script runs at container startup (before dotCMS itself starts) and:

  1. Downloads the database backup from the source environment via the Maintenance API
  2. Downloads the assets archive from the source environment
  3. Imports the database into PostgreSQL
  4. Extracts assets to the shared data directory
  5. Exits the container so it can be restarted to run dotCMS.

When the script exits cleanly and is restarted, it will skip re-importing and run dotCMS. This enables the "init container" pattern in Kubernetes where the import runs once, then the main dotCMS container starts with the imported data.

Environment Variables

Required

Variable Description
DOT_IMPORT_ENVIRONMENT URL of the source dotCMS environment (e.g., https://demo.dotcms.com)
DOT_IMPORT_API_TOKEN API token for authentication (Bearer token)
DOT_IMPORT_USERNAME_PASSWORD Alternative: Basic auth credentials in user:password format

Note: Either DOT_IMPORT_API_TOKEN or DOT_IMPORT_USERNAME_PASSWORD is required.

Database Connection

Variable Description
DB_BASE_URL JDBC URL for target PostgreSQL (e.g., jdbc:postgresql://host/dbname)
DB_USERNAME PostgreSQL username
DB_PASSWORD PostgreSQL password

Optional

Variable Default Description
DOT_IMPORT_DROP_DB false Drop existing database schema before import
DOT_IMPORT_MAX_ASSET_SIZE 100mb Maximum asset file size to download
DOT_IMPORT_ALL_ASSETS false Include non-live (working/archived) assets
DOT_IMPORT_IGNORE_ASSET_ERRORS true Continue if asset extraction has errors

Usage Examples

Docker Standalone

# Create environment file
cat > app.env << 'EOF'
DOT_IMPORT_ENVIRONMENT=https://demo.dotcms.com
[email protected]:admin
DOT_IMPORT_DROP_DB=true

DB_BASE_URL=jdbc:postgresql://localhost:5432/dotcms
DB_USERNAME=dotcmsdbuser
DB_PASSWORD=password
EOF

# Run dotCMS with environment cloning
docker run --env-file app.env \
  -v ./data:/data \
  -p 8080:8082 \
  dotcms/dotcms:latest

Kubernetes Init Container

apiVersion: v1
kind: Pod
metadata:
  name: dotcms
spec:
  initContainers:
    - name: clone-environment
      image: dotcms/dotcms:latest
      env:
        - name: DOT_IMPORT_ENVIRONMENT
          value: "https://source.dotcms.com"
        - name: DOT_IMPORT_API_TOKEN
          valueFrom:
            secretKeyRef:
              name: dotcms-secrets
              key: import-api-token
        - name: DB_BASE_URL
          value: "jdbc:postgresql://postgres:5432/dotcms"
        - name: DB_USERNAME
          valueFrom:
            secretKeyRef:
              name: dotcms-secrets
              key: db-username
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: dotcms-secrets
              key: db-password
      volumeMounts:
        - name: shared-data
          mountPath: /data/shared
  containers:
    - name: dotcms
      image: dotcms/dotcms:latest
      # ... main dotCMS container config
      volumeMounts:
        - name: shared-data
          mountPath: /data/shared
  volumes:
    - name: shared-data
      persistentVolumeClaim:
        claimName: dotcms-data

Note: It is possible to not use an init container and just have dotCMS fire up the first time to clone the target environment. In this case, you will need to adjust your probes and add an appropriate start up delay before the pod gets cycled (not terrible in dev but probably not recommended for production values).

Behavior Details

Idempotency

  • The script creates an import_complete.txt marker file after successful import
  • Subsequent container starts skip the import if this marker exists
  • Delete the marker file to force a re-import

Locking

  • A lock.txt file prevents concurrent imports (important for Kubernetes)
  • Lock files older than 30 minutes are considered stale and removed
  • Pods wait 3 minutes and exit if a lock is held by another process

Database Safety

  • The script checks if the target database already contains data (inode count)
  • Import is skipped if data exists (unless DOT_IMPORT_DROP_DB=true)
  • Use DOT_IMPORT_DROP_DB=true to wipe and reimport

Downloaded Files

  • Database and asset backups are cached in $IMPORT_DATA_DIR
  • File names include an MD5 hash of the source URL
  • Delete cached files to force re-download

Exit Codes

Code Meaning
0 No import needed (already complete or not configured)
1 Error during import
13 Import completed successfully (signals init container completion)

Troubleshooting

Import stuck or failed

  1. Check for stale lock file: ls -la /data/shared/import/lock.txt
  2. Remove lock if stale: rm /data/shared/import/lock.txt

Force reimport

  1. Remove the completion marker: rm /data/shared/import/import_complete.txt
  2. Optionally remove cached downloads to re-download:
    rm /data/shared/import/*_assets.zip
    rm /data/shared/import/*_dotcms_db.sql.gz

Authentication failures

  • Verify DOT_IMPORT_API_TOKEN or DOT_IMPORT_USERNAME_PASSWORD is correct
  • Ensure the user has access to the Maintenance API endpoints:
    • /api/v1/maintenance/_downloadAssets
    • /api/v1/maintenance/_downloadDb

Source Environment Requirements

The source dotCMS environment must:

  1. Be accessible over HTTPS/HTTP from the target environment

Remove the unconditional `exit 0` so the entrypoint continues to source
startup scripts, clarify the import script’s exit-13 success path, and
install `libarchive-tools` to support asset unpacking during imports.

ref: #34442
Remove the unconditional `exit 0` so the entrypoint continues to source
startup scripts, clarify the import script’s exit-13 success path, and
install `libarchive-tools` to support asset unpacking during imports.

ref: #34442
@wezell wezell linked an issue Jan 29, 2026 that may be closed by this pull request
7 tasks
    Remove the unconditional `exit 0` so the entrypoint continues to source
    startup scripts, clarify the import script’s exit-13 success path, and
    install `libarchive-tools` to support asset unpacking during imports.

ref: #34442
@spbolton
Copy link
Contributor

I am fine with this, the lack of a helm chart to help with the templating of this init-container makes integrating this a pain, but is fine for starting with a few instances. There are a couple of concerns that may need addressing and clarifying taking into account how it would effectively work when there is more than on replica for the stateful set, it is more an issue if trying to run on a upgrade where pods are being replaced and we still have active connections.

PR #34443 Review - dotCMS Environment Cloning Feature

Summary

This PR adds environment cloning functionality using an init container pattern. The implementation works for initial deployment scenarios with EFS shared storage and OrderedReady pod management, but has critical limitations that should be addressed before production use.

Overall Assessment: ⚠️ Request Changes - Works for intended use case but needs improvements for robustness


✅ What Works (With EFS + OrderedReady)

  • Per-pod volume isolation: Eliminated (EFS is shared)
  • Lock file race condition: Mitigated (OrderedReady prevents concurrent execution)
  • Database import race: Mitigated (sequential pod startup)
  • Initial deployment: Works correctly for first-time setup

🚨 Critical Issues (Must Fix)

1. Lock File Race Condition - Fundamental Flaw

Problem: The lock file mechanism has a race condition. It only "works" because OrderedReady prevents concurrent execution, not because the lock is correct.

Current Code (10-import-env.sh lines 178-195):

# Check if lock exists
if [ -f "$IMPORT_IN_PROCESS" ]; then
  # ... check lock age ...
fi

# Create lock (NOT ATOMIC with check above)
mkdir -p $IMPORT_DATA_DIR && touch $IMPORT_IN_PROCESS

Why It's Flawed:

  • Check-then-create is not atomic
  • Two processes can both see "no lock" and both create it
  • Only works because OrderedReady prevents Pod-1 from starting until Pod-0 is Ready

Impact: If podManagementPolicy: Parallel is used (or OrderedReady fails), multiple pods will import simultaneously → database corruption

Recommendation: Use atomic directory creation:

LOCK_DIR="$IMPORT_DATA_DIR/.lock"
if mkdir "$LOCK_DIR" 2>/dev/null; then
  # We got the lock
  trap "rmdir '$LOCK_DIR' 2>/dev/null" EXIT
else
  # Lock exists, check if stale
  # ... existing stale lock logic ...
fi

Alternative: Use PostgreSQL advisory locks (automatically released on connection close)


2. Doesn't Work on Existing StatefulSets with Multiple Replicas

Problem: Cannot safely run on existing StatefulSets during rolling updates. Old pods remain active while new pods run import, causing database conflicts.

Scenario:

Rolling Update:
  Pod-2 (old) deleted
  Pod-2 (new) created
    ├─ Init container runs 10-import-env.sh
    ├─ Sees import_complete.txt doesn't exist (or was deleted)
    ├─ Starts importing database
    └─ Pod-0 and Pod-1 (old) still running, connected to DB
    → CONFLICT: Database operations conflict with active connections

Impact:

  • Database corruption risk
  • Active sessions killed
  • Service disruption
  • Data loss

Recommendation: Add active connection check before import:

check_active_connections() {
  ACTIVE=$(psql -h "$DB_HOST" -d "$DB_NAME" -U "$DB_USERNAME" -qtAX -c \
    "SELECT count(*) FROM pg_stat_activity 
     WHERE datname = '$DB_NAME' 
     AND pid != pg_backend_pid()
     AND state != 'idle'" 2>/dev/null || echo "0")
  
  if [ "$ACTIVE" -gt 0 ]; then
    echo "ERROR: Database has $ACTIVE active connections"
    echo "Cannot import while database is in use"
    echo "This script is designed for initial deployment only"
    echo "Stop all pods before performing refresh"
    exit 1
  fi
}

# Call before import
check_active_connections || exit 1

Also: Document this limitation clearly in the PR description and script comments


⚠️ High Priority Issues (Should Fix)

3. Stale Lock File Handling

Problem: If Pod-0 crashes during import, lock file remains for up to 30 minutes, blocking progress.

Current Behavior:

  • Pod restarts, sees lock (< 30 min old), waits 3 min, exits
  • Repeats until lock is 30 minutes old
  • No way to manually recover

Recommendation:

  • Reduce timeout from 30 to 10 minutes
  • Add trap handler for cleanup on exit
  • Consider database advisory locks (auto-released on connection close)

4. Partial Import on Failure

Problem: If import fails partway (e.g., asset extraction fails), database may be imported but assets not extracted, leaving inconsistent state.

Recommendation:

  • Add cleanup on failure (trap ERR)
  • Only create import_complete.txt if all steps succeed
  • Document recovery procedures

📝 Documentation & Clarity Issues

5. OrderedReady Dependency Not Documented

Issue: The script depends on OrderedReady but this isn't clearly documented.

Recommendation: Add to PR description:

Important: This feature requires podManagementPolicy: OrderedReady (default). If Parallel is used, multiple pods may import simultaneously, causing database corruption.

6. Exit Code 13 - Non-Standard

Issue: Exit code 13 is non-standard (typically EACCES). Monitoring systems may misinterpret as error.

Recommendation:

  • Document exit code 13 clearly
  • Or use exit code 0 and check for marker file in entrypoint
  • Add to troubleshooting section

7. EFS Requirement Not Explicit

Issue: Script assumes shared storage but doesn't validate.

Recommendation:

  • Document EFS requirement clearly
  • Add validation or at least warning if per-pod volumes detected
  • Add to "Requirements" section

✅ Positive Aspects

  1. Idempotency: Good use of import_complete.txt marker
  2. Error Handling: Basic error handling present
  3. Documentation: PR description is comprehensive
  4. Use Case: Solves real problem (environment cloning)
  5. Init Container Pattern: Correct approach for production

🔧 Recommended Changes

Must Fix (Before Merge)

  1. ✅ Fix lock file race condition (atomic operations)
  2. ✅ Add active connection check before import
  3. ✅ Document OrderedReady requirement
  4. ✅ Document EFS requirement

Should Fix (Before Production)

  1. ⚠️ Improve stale lock handling (shorter timeout, better cleanup)
  2. ⚠️ Add partial import cleanup on failure
  3. ⚠️ Document exit code 13 behavior

🧪 Testing Recommendations

Must Test:

  • Initial deployment with 3 replicas (sequential startup)
  • Pod restart during import (stale lock handling)
  • Rolling update on existing StatefulSet (should fail gracefully with active connection check)
  • Per-pod volumes scenario (should detect/warn)

Should Test:

  • Large database import (timeout handling)
  • Network failure during download (recovery)
  • Asset extraction failure (cleanup)

📊 Risk Assessment

Scenario Risk Level Works? Notes
Initial deployment (EFS + OrderedReady) ✅ Low Yes Works correctly
Existing StatefulSet (rolling update) 🚨 High No Conflicts with active connections
Without OrderedReady 🚨 High No Lock race condition causes failures
Per-pod volumes 🚨 High No Each pod imports independently

💡 Additional Suggestions

  1. Consider database advisory locks instead of file-based locks (more reliable, auto-cleanup)
  2. Add pod ordinal check for extra safety (only pod-0 imports on initial deployment)
  3. Add structured logging for better observability
  4. Add metrics/telemetry for import operations
  5. Consider resume capability for failed imports (checkpoint progress)

Final Recommendation

Request Changes - Address critical issues before merge:

  1. Fix lock file race condition (atomic operations)
  2. Add active connection check (prevent conflicts on existing StatefulSets)
  3. Document dependencies (OrderedReady, EFS)

The feature is useful and the init container pattern is correct, but the implementation needs these fixes for production robustness.


Context Notes

  • Storage: Assumes /data/shared is EFS (shared across pods)
  • Pod Management: Requires OrderedReady (default StatefulSet policy)
  • Use Cases: Initial deployment or total refresh (intentional database/filesystem replacement)
  • Pattern: Init container (as documented in PR description)

        Remove the unconditional  so the entrypoint continues to source
        startup scripts, clarify the import script’s exit-13 success path, and
        install  to support asset unpacking during imports.

    ref: #34442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEATURE] Make it easy for dotCMS to clone another environment.

3 participants