Skip to content

Conversation

@norrisng-bc
Copy link
Contributor

@norrisng-bc norrisng-bc commented Nov 21, 2025

Description

This PR introduces changes to better handle large sync jobs, which typically occur when a recursive sync is requested:

  • Enqueuing sync jobs is now split into multiple INSERT queries if the number of parameters exceeds PostgreSQL's hard limit of 65,535
    • Previously, everything was enqueued in one giant INSERT, which would often fail with recursive sync requests on buckets that have a very deep nested folder structure.
  • The Knex pool size has been reduced by half. There are 2 database connection pools in COMS: one managed by Knex inside each pod, and another by PgBouncer above the pods. Shrinking the Knex pool (keep in mind that there's separate "unshareable" pools per pod) will reduce the likelihood of hitting the Postgres max_connections limit.
    • UPDATE (Dec 1): constraining the Knex pool is a bit overly restrictive, doing this via the pod count is a bit better
  • UPDATE (Dec 1): reduce max autoscaling from 8 pods to 6
    • This still reduces the number of Knex-reserved connections, but not by so much that it results in under-utilization
  • UPDATE (Jan 7): increase Knex database connection acquisition timeout from 60 seconds (default) to 90 seconds
    • This value may be tweaked via a new environment variable: DB_ACQUIRE_CONNECTION_TIMEOUT (db.acquireConnectionTimeout)
  • UPDATE (Jan 7): syncVersions() now throws an error if retrieving either the S3 or COMS object versions fails
    • This ensures the entire sync job is terminated instead of causing a de-sync
    • See "syncVersions() atomicity" under "further comments" for finer implementation details
  • UPDATE (Jan 7): more granular error handling for utils.getBucket()
    • 404's are now only returned if the bucketId isn't found in the database
    • 504's are returned on database connection timeouts
    • 500's are returned for all other errors

Why? Some users have very deep nested folder structures. A recursive sync on the root of one will result in a very large sync job, which will cause failed sync requests and a denial of service for others currently using COMS (even for non-sync-related features), as COMS works its way through the sync queue.

Additional tidbits included:

  • Knex connections now also include the hostname (i.e. the OpenShift pod name) via the application_name field
    • This makes it easier to tell how many connections each pod is making to the database (see query in "Further comments" below)
    • This is because PgBouncer pooling makes it impossible to differentiate pod connections using the IP address
  • Bugfix: Don't copy bucket permissions during a recursive bucket sync if current user is SYSTEM_USER
    • Recursive sync using basic auth would fail, since SYSTEM_USER doesn't have any permissions (i.e. explicit entries in the bucket_permission table) on the parent bucket to copy over to its children
    • UPDATE (Dec 1): Enable horizontal pod autoscaling in DEV and DEV-PR to facilitate load testing in DEV
    • UPDATE (Dec 1): Remove pod CPU/memory limits

https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4099
https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4104
https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4105

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read the CONTRIBUTING doc
  • I have checked that unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Pod autoscaling and database connection limits

In theory, with a DB_POOL_MAX of 10 and an autoscaling.maxReplicas of 8, this would only consume 80 connections at any time, out of a PostgreSQL max_connections of 100 (we haven't changed it from the default value). However, there appears to be at least 16 other connections being made for the following processes (and 80 + 16 = 96 is awfully close to 100):

  • Metabase (1)
  • Patroni heartbeat (1)
  • Crunchy replicas (2)
  • PgBouncer (2)
  • Unknown background/system processes (5)
    • These show up as largely NULL rows with an empty application_name
  • A bunch of unknown ones under the app username, after accounting for the app connections (5)

(as observed with 3 pods running at the time)

To list all of the current connections on the DB using psql:

SELECT * FROM pg_stat_activity;

syncVersions() atomicity

Original:

// Check for COMS and S3 Version statuses
const [comsVersions, s3VersionsRaw] = await Promise.allSettled([
  versionService.list(comsObject.id, trx),
  storageService.listAllObjectVersions({ filePath: comsObject.path, bucketId: comsObject.bucketId })
]).then(settled => settled.map(promise => promise.value));

This PR:

// Check for COMS and S3 Version statuses
const [comsVersions, s3VersionsRaw] = await Promise.all([
  versionService.list(comsObject.id, trx),
  storageService.listAllObjectVersions({ filePath: comsObject.path, bucketId: comsObject.bucketId })
]);

list() and listAllObjectVersions() are promises that can potentially fail (and therefore return Promise.reason instead of Promise.value).

Attempting to continue syncing versions when either fails can cause a de-sync between the COMS DB and the actual S3 object; replacing Promise.allSettled() with Promise.all() ensures that an error is thrown right then and there, which kills the sync job (which arguably is safer) instead of causing a de-sync.

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

Coverage Report

Totals Coverage
Statements: 55.38% ( 3080 / 5562 )
Methods: 45.52% ( 330 / 725 )
Lines: 62.03% ( 1854 / 2989 )
Branches: 48.48% ( 896 / 1848 )

Copy link
Contributor

@TimCsaky TimCsaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess you will have monitored the number of connections during a big sync job to see how this works

DB_PORT: "5432"
# DB_POOL_MIN: "2"
# DB_POOL_MAX: "10"
DB_POOL_MAX: "5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opposite of what i was suggesting (LOL) but this does make sense.
Hopefully Knex will queue the queries within this limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work almost too perfectly - I've noticed during testing that it keeps the CPU and memory usage so low that it doesn't trigger the HorizontalPodAutoscaler to do its thing. Probably needs a bit more tweaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought you were referring to the number of pods and not the number of Knex connections. These two simultaneous DB pools make things so confusing 😅

Copy link
Contributor

@TimCsaky TimCsaky Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the problem only happens when the coms pods scale up to 8.
how about:

  • keep knex max pool at 10 (allowing for better performance when pod count is 3 (nearly always)
  • adjust hpa so that max pod count is 6
  • also as explained here remove limit on db cpu/memory for all for all environments (update helm values like this one)

only other idea i have is to add a small interval in the processNextJob() call to slow things down

@norrisng-bc norrisng-bc force-pushed the bug/large-sync-jobs branch 3 times, most recently from 0f0bfe6 to 8f0d340 Compare December 2, 2025 00:45
Copy link
Contributor

@TimCsaky TimCsaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess we can merge this?

@norrisng-bc norrisng-bc force-pushed the bug/large-sync-jobs branch 3 times, most recently from dd6e31c to 3281caa Compare January 8, 2026 17:40
Postgres has a hard limit of 65,535 params in a prepared statement, so if there's too many params due to how many sync jobs there are, split it into smaller queries that don't exceed that limit.
PgBouncer pooling makes it impossible to differentiate connections via pod IP addresses - this will make it easier to debug DB connections.
If the current user is SYSTEM_USER (i.e. S3 or basic auth), the current user won't have any permissions over the current bucket (according to the bucket_permissions table, that is).
As a result, any attempt to copy over these (non-existent) permissions to any child buckets will fail.
The current user is now checked first - only if it isn't SYSTEM_USER, are the bucket permissions copied over to child buckets.
* Decrease COMS max pod count to avoid saturating the shared Postgres connection pool
* Remove CPU/memory limits on Postgres pods
* Enable pod autoscaling in DEV and PR
* syncVersions now atomic - if listing either S3 or COMS versions fails, throw an error and don't continue
* Increase DB acquire connection timeout to 1.5 mins (configurable via new envar: db.acquireConnectionTimeout)
* Separate out getBucket() errors instead of throwing everything as a 404 (including DB connection-related errors)
This cuts down on the creation of new connections, which can overwhelm the S3 server during high load.
The adjusted timeouts and keep-alive values also deal with this high load.
@TimCsaky TimCsaky force-pushed the bug/large-sync-jobs branch from 0a557e2 to 6c1bc76 Compare January 21, 2026 00:03
@TimCsaky TimCsaky merged commit 1a5422a into master Jan 21, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants