Right-size opendatahub cluster pool: smaller workers, larger pool#76484
Right-size opendatahub cluster pool: smaller workers, larger pool#76484macgregor wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
@macgregor: GitHub didn't allow me to request PR reviews from the following users: opendatahub. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: macgregor The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( |
|
@macgregor: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse |
|
@macgregor: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@macgregor: no rehearsable tests are affected by this change |
| platform: | ||
| aws: | ||
| type: m5.2xlarge | ||
| type: m5.xlarge |
There was a problem hiding this comment.
Are you sure? Are we checking the total utilization, or the resource requested in the cluster? (using m5.xlarge, I hit the limit in odh-gitops: https://github.com/opendatahub-io/odh-gitops/blob/main/.tekton/helm-chart-validation-ocp-4.19.yaml#L42). It depends also on the number of nodes
|
Going to hold off on this until we have more complete worker utilization dataset |
Right-sizes the
opendatahub-ocp-4-19-amd64-awscluster pool based on 90 days of CI utilization data. Workers are heavily overprovisioned while the pool is too small, causing a 37-41% cluster claim failure rate.Smaller workers partially offset the cost of the larger pool.
/cc @opendatahub