Skip to content

[DEFECT] Telemetry leaves idle connections open affecting Evergreen customers #34478

@erickgonzalez

Description

@erickgonzalez

Problem Statement

Telemetry is leaving idle database connections open, causing resource leakage across all Evergreen customers.
The issue started after the deployment of dotcms/dotcms:26.01.02-01_b8b3476 and impacts all customers running Evergreen environments.

This behavior increases the risk of connection pool exhaustion and overall platform instability.
As an immediate workaround, telemetry has been disabled for affected environments, which reduces observability and is not a sustainable long-term solution.

Initial investigation points to telemetry collectors related to experiments, specifically:
CountVariantsInAllArchivedExperimentsMetricType.

More context and operational impact details are available in Slack under #eng-critical-incidents.

Steps to Reproduce

  1. Deploy or run an Evergreen environment using image dotcms/dotcms:26.01.02-01_b8b3476 or newest.
  2. Ensure telemetry is enabled (default configuration).
  3. Allow Telemetry job runs, you can change the schedule by using: TELEMETRY_SAVE_SCHEDULE
  4. Observe database connection pool metrics.
SELECT pid,datname AS database,usename AS user, state, backend_start AS connection_established, state_change AS last_activity, now() - state_change AS idle_duration, query AS last_query  FROM pg_stat_activity WHERE state IN ('idle', 'idle in transaction') ORDER BY idle_duration DESC;
  1. Notice idle connections remaining open and not being released over time.
  2. Disable telemetry and observe that the idle connection behavior stops.

Acceptance Criteria

• Telemetry collectors do not leave idle database connections open.
• Connection lifecycle is properly managed and connections are released after telemetry execution.
• The collector CountVariantsInAllArchivedExperimentsMetricType is reviewed and corrected if necessary.
• Telemetry can remain enabled in Evergreen environments without causing connection leaks.
• Fix is validated in Evergreen and confirmed not to regress other telemetry metrics.

dotCMS Version

Everything after 26.01.02-01_b8b3476

Severity

High - Major functionality broken

Links

https://dotcms.slack.com/archives/C096E0QAM7G/p1770055843362409

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions