-
Notifications
You must be signed in to change notification settings - Fork 9
Improved handling of large sync jobs #315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TimCsaky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess you will have monitored the number of connections during a big sync job to see how this works
charts/coms/values.yaml
Outdated
| DB_PORT: "5432" | ||
| # DB_POOL_MIN: "2" | ||
| # DB_POOL_MAX: "10" | ||
| DB_POOL_MAX: "5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opposite of what i was suggesting (LOL) but this does make sense.
Hopefully Knex will queue the queries within this limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to work almost too perfectly - I've noticed during testing that it keeps the CPU and memory usage so low that it doesn't trigger the HorizontalPodAutoscaler to do its thing. Probably needs a bit more tweaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I thought you were referring to the number of pods and not the number of Knex connections. These two simultaneous DB pools make things so confusing 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the problem only happens when the coms pods scale up to 8.
how about:
- keep knex max pool at 10 (allowing for better performance when pod count is 3 (nearly always)
- adjust hpa so that max pod count is 6
- also as explained here remove limit on db cpu/memory for all for all environments (update helm values like this one)
only other idea i have is to add a small interval in the processNextJob() call to slow things down
0f0bfe6 to
8f0d340
Compare
TimCsaky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess we can merge this?
dd6e31c to
3281caa
Compare
Postgres has a hard limit of 65,535 params in a prepared statement, so if there's too many params due to how many sync jobs there are, split it into smaller queries that don't exceed that limit.
PgBouncer pooling makes it impossible to differentiate connections via pod IP addresses - this will make it easier to debug DB connections.
If the current user is SYSTEM_USER (i.e. S3 or basic auth), the current user won't have any permissions over the current bucket (according to the bucket_permissions table, that is). As a result, any attempt to copy over these (non-existent) permissions to any child buckets will fail. The current user is now checked first - only if it isn't SYSTEM_USER, are the bucket permissions copied over to child buckets.
* Decrease COMS max pod count to avoid saturating the shared Postgres connection pool * Remove CPU/memory limits on Postgres pods * Enable pod autoscaling in DEV and PR
* syncVersions now atomic - if listing either S3 or COMS versions fails, throw an error and don't continue * Increase DB acquire connection timeout to 1.5 mins (configurable via new envar: db.acquireConnectionTimeout) * Separate out getBucket() errors instead of throwing everything as a 404 (including DB connection-related errors)
This cuts down on the creation of new connections, which can overwhelm the S3 server during high load. The adjusted timeouts and keep-alive values also deal with this high load.
f962735 to
d6fe6c4
Compare
0a557e2 to
6c1bc76
Compare
Description
This PR introduces changes to better handle large sync jobs, which typically occur when a recursive sync is requested:
INSERTqueries if the number of parameters exceeds PostgreSQL's hard limit of 65,535INSERT, which would often fail with recursive sync requests on buckets that have a very deep nested folder structure.The Knex pool size has been reduced by half. There are 2 database connection pools in COMS: one managed by Knex inside each pod, and another by PgBouncer above the pods. Shrinking the Knex pool (keep in mind that there's separate "unshareable" pools per pod) will reduce the likelihood of hitting the Postgresmax_connectionslimit.DB_ACQUIRE_CONNECTION_TIMEOUT(db.acquireConnectionTimeout)syncVersions()now throws an error if retrieving either the S3 or COMS object versions failssyncVersions()atomicity" under "further comments" for finer implementation detailsutils.getBucket()404's are now only returned if thebucketIdisn't found in the database504's are returned on database connection timeouts500's are returned for all other errorsWhy? Some users have very deep nested folder structures. A recursive sync on the root of one will result in a very large sync job, which will cause failed sync requests and a denial of service for others currently using COMS (even for non-sync-related features), as COMS works its way through the sync queue.
Additional tidbits included:
application_namefieldSYSTEM_USERSYSTEM_USERdoesn't have any permissions (i.e. explicit entries in thebucket_permissiontable) on the parent bucket to copy over to its childrenhttps://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4099
https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4104
https://apps.nrs.gov.bc.ca/int/jira/browse/SHOWCASE-4105
Types of changes
Bug fix (non-breaking change which fixes an issue)
Checklist
Further comments
Pod autoscaling and database connection limits
In theory, with a
DB_POOL_MAXof 10 and anautoscaling.maxReplicasof 8, this would only consume 80 connections at any time, out of a PostgreSQLmax_connectionsof 100 (we haven't changed it from the default value). However, there appears to be at least 16 other connections being made for the following processes (and80 + 16 = 96is awfully close to 100):NULLrows with an emptyapplication_nameappusername, after accounting for the app connections (5)(as observed with 3 pods running at the time)
To list all of the current connections on the DB using
psql:syncVersions()atomicityOriginal:
This PR:
list()andlistAllObjectVersions()are promises that can potentially fail (and therefore returnPromise.reasoninstead ofPromise.value).Attempting to continue syncing versions when either fails can cause a de-sync between the COMS DB and the actual S3 object; replacing
Promise.allSettled()withPromise.all()ensures that an error is thrown right then and there, which kills the sync job (which arguably is safer) instead of causing a de-sync.