Improved handling of large sync jobs #315

norrisng-bc · 2025-11-21T19:56:13Z

Description

This PR introduces changes to better handle large sync jobs, which typically occur when a recursive sync is requested:

Enqueuing sync jobs is now split into multiple INSERT queries if the number of parameters exceeds PostgreSQL's hard limit of 65,535
- Previously, everything was enqueued in one giant INSERT, which would often fail with recursive sync requests on buckets that have a very deep nested folder structure.
The Knex pool size has been reduced by half. There are 2 database connection pools in COMS: one managed by Knex inside each pod, and another by PgBouncer above the pods. Shrinking the Knex pool (keep in mind that there's separate "unshareable" pools per pod) will reduce the likelihood of hitting the Postgres max_connections limit.
- UPDATE (Dec 1): constraining the Knex pool is a bit overly restrictive, doing this via the pod count is a bit better
UPDATE (Dec 1): reduce max autoscaling from 8 pods to 6
- This still reduces the number of Knex-reserved connections, but not by so much that it results in under-utilization
UPDATE (Jan 7): increase Knex database connection acquisition timeout from 60 seconds (default) to 90 seconds
- This value may be tweaked via a new environment variable: DB_ACQUIRE_CONNECTION_TIMEOUT (db.acquireConnectionTimeout)
UPDATE (Jan 7): syncVersions() now throws an error if retrieving either the S3 or COMS object versions fails
- This ensures the entire sync job is terminated instead of causing a de-sync
- See "syncVersions() atomicity" under "further comments" for finer implementation details
UPDATE (Jan 7): more granular error handling for utils.getBucket()
- 404's are now only returned if the bucketId isn't found in the database
- 504's are returned on database connection timeouts
- 500's are returned for all other errors

Why? Some users have very deep nested folder structures. A recursive sync on the root of one will result in a very large sync job, which will cause failed sync requests and a denial of service for others currently using COMS (even for non-sync-related features), as COMS works its way through the sync queue.

Additional tidbits included:

Knex connections now also include the hostname (i.e. the OpenShift pod name) via the application_name field
- This makes it easier to tell how many connections each pod is making to the database (see query in "Further comments" below)
- This is because PgBouncer pooling makes it impossible to differentiate pod connections using the IP address
Bugfix: Don't copy bucket permissions during a recursive bucket sync if current user is SYSTEM_USER
- Recursive sync using basic auth would fail, since SYSTEM_USER doesn't have any permissions (i.e. explicit entries in the bucket_permission table) on the parent bucket to copy over to its children
- UPDATE (Dec 1): Enable horizontal pod autoscaling in DEV and DEV-PR to facilitate load testing in DEV
- UPDATE (Dec 1): Remove pod CPU/memory limits
  - Platform Services has since relaxed limits and pods are now allowed a much more generous quota based on worker node load (See: OpenShift project resource quotas: LimitRange)

https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4099
https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4104
https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4105

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist

I have read the CONTRIBUTING doc
I have checked that unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

Pod autoscaling and database connection limits

In theory, with a DB_POOL_MAX of 10 and an autoscaling.maxReplicas of 8, this would only consume 80 connections at any time, out of a PostgreSQL max_connections of 100 (we haven't changed it from the default value). However, there appears to be at least 16 other connections being made for the following processes (and 80 + 16 = 96 is awfully close to 100):

Metabase (1)
Patroni heartbeat (1)
Crunchy replicas (2)
PgBouncer (2)
Unknown background/system processes (5)
- These show up as largely NULL rows with an empty application_name
A bunch of unknown ones under the app username, after accounting for the app connections (5)

(as observed with 3 pods running at the time)

To list all of the current connections on the DB using psql:

SELECT * FROM pg_stat_activity;

`syncVersions()` atomicity

Original:

// Check for COMS and S3 Version statuses
const [comsVersions, s3VersionsRaw] = await Promise.allSettled([
  versionService.list(comsObject.id, trx),
  storageService.listAllObjectVersions({ filePath: comsObject.path, bucketId: comsObject.bucketId })
]).then(settled => settled.map(promise => promise.value));

This PR:

// Check for COMS and S3 Version statuses
const [comsVersions, s3VersionsRaw] = await Promise.all([
  versionService.list(comsObject.id, trx),
  storageService.listAllObjectVersions({ filePath: comsObject.path, bucketId: comsObject.bucketId })
]);

list() and listAllObjectVersions() are promises that can potentially fail (and therefore return Promise.reason instead of Promise.value).

Attempting to continue syncing versions when either fails can cause a de-sync between the COMS DB and the actual S3 object; replacing Promise.allSettled() with Promise.all() ensures that an error is thrown right then and there, which kills the sync job (which arguably is safer) instead of causing a de-sync.

github-actions · 2025-11-21T19:56:56Z

Coverage Report

Totals
Statements:	55.38% ( 3080 / 5562 )
Methods:	45.52% ( 330 / 725 )
Lines:	62.03% ( 1854 / 2989 )
Branches:	48.48% ( 896 / 1848 )

TimCsaky

i guess you will have monitored the number of connections during a big sync job to see how this works

TimCsaky · 2025-11-25T19:58:04Z

charts/coms/values.yaml

    DB_PORT: "5432"
    # DB_POOL_MIN: "2"
-    # DB_POOL_MAX: "10"
+    DB_POOL_MAX: "5"


opposite of what i was suggesting (LOL) but this does make sense.
Hopefully Knex will queue the queries within this limit

Seems to work almost too perfectly - I've noticed during testing that it keeps the CPU and memory usage so low that it doesn't trigger the HorizontalPodAutoscaler to do its thing. Probably needs a bit more tweaking.

Ah, I thought you were referring to the number of pods and not the number of Knex connections. These two simultaneous DB pools make things so confusing 😅

i think the problem only happens when the coms pods scale up to 8.
how about:

keep knex max pool at 10 (allowing for better performance when pod count is 3 (nearly always)

adjust hpa so that max pod count is 6

also as explained here remove limit on db cpu/memory for all for all environments (update helm values like this one)

only other idea i have is to add a small interval in the processNextJob() call to slow things down

app/src/services/objectQueue.js

TimCsaky

guess we can merge this?

Postgres has a hard limit of 65,535 params in a prepared statement, so if there's too many params due to how many sync jobs there are, split it into smaller queries that don't exceed that limit.

PgBouncer pooling makes it impossible to differentiate connections via pod IP addresses - this will make it easier to debug DB connections.

If the current user is SYSTEM_USER (i.e. S3 or basic auth), the current user won't have any permissions over the current bucket (according to the bucket_permissions table, that is). As a result, any attempt to copy over these (non-existent) permissions to any child buckets will fail. The current user is now checked first - only if it isn't SYSTEM_USER, are the bucket permissions copied over to child buckets.

* Decrease COMS max pod count to avoid saturating the shared Postgres connection pool * Remove CPU/memory limits on Postgres pods * Enable pod autoscaling in DEV and PR

* syncVersions now atomic - if listing either S3 or COMS versions fails, throw an error and don't continue * Increase DB acquire connection timeout to 1.5 mins (configurable via new envar: db.acquireConnectionTimeout) * Separate out getBucket() errors instead of throwing everything as a 404 (including DB connection-related errors)

This cuts down on the creation of new connections, which can overwhelm the S3 server during high load. The adjusted timeouts and keep-alive values also deal with this high load.

norrisng-bc requested review from TimCsaky and jatindersingh93 as code owners November 21, 2025 19:56

TimCsaky reviewed Nov 25, 2025

View reviewed changes

norrisng-bc force-pushed the bug/large-sync-jobs branch 3 times, most recently from 0f0bfe6 to 8f0d340 Compare December 2, 2025 00:45

TimCsaky approved these changes Jan 6, 2026

View reviewed changes

norrisng-bc force-pushed the bug/large-sync-jobs branch 3 times, most recently from dd6e31c to 3281caa Compare January 8, 2026 17:40

norrisng-bc added 6 commits January 14, 2026 17:06

Split sync job enqueue into multiple queries to avoid Postgres limits

0aebf60

Postgres has a hard limit of 65,535 params in a prepared statement, so if there's too many params due to how many sync jobs there are, split it into smaller queries that don't exceed that limit.

Include hostname (i.e. pod name) in Knex connection as application_name

bced07f

PgBouncer pooling makes it impossible to differentiate connections via pod IP addresses - this will make it easier to debug DB connections.

Various OpenShift resource limit tweaks

51bd64b

* Decrease COMS max pod count to avoid saturating the shared Postgres connection pool * Remove CPU/memory limits on Postgres pods * Enable pod autoscaling in DEV and PR

Reuse S3 connections: cache S3Clients, adjust timeouts & keep-alive

d6fe6c4

This cuts down on the creation of new connections, which can overwhelm the S3 server during high load. The adjusted timeouts and keep-alive values also deal with this high load.

norrisng-bc force-pushed the bug/large-sync-jobs branch from f962735 to d6fe6c4 Compare January 15, 2026 01:06

Add more info-level logging for sync processes

6c1bc76

TimCsaky force-pushed the bug/large-sync-jobs branch from 0a557e2 to 6c1bc76 Compare January 21, 2026 00:03

TimCsaky merged commit 1a5422a into master Jan 21, 2026
13 checks passed

TimCsaky deployed to pr January 21, 2026 00:24 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved handling of large sync jobs #315

Improved handling of large sync jobs #315

Uh oh!

norrisng-bc commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

TimCsaky left a comment

Uh oh!

TimCsaky Nov 25, 2025

Uh oh!

norrisng-bc Nov 26, 2025

Uh oh!

norrisng-bc Nov 26, 2025

Uh oh!

TimCsaky Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

TimCsaky left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improved handling of large sync jobs #315

Improved handling of large sync jobs #315

Uh oh!

Conversation

norrisng-bc commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Checklist

Further comments

Pod autoscaling and database connection limits

syncVersions() atomicity

Uh oh!

github-actions bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

TimCsaky left a comment

Choose a reason for hiding this comment

Uh oh!

TimCsaky Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

norrisng-bc Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

norrisng-bc Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

TimCsaky Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TimCsaky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

norrisng-bc commented Nov 21, 2025 •

edited

Loading

`syncVersions()` atomicity

github-actions bot commented Nov 21, 2025 •

edited

Loading

TimCsaky Dec 1, 2025 •

edited

Loading