Skip to content

BCDA-8777: Update alert destinations for workflows and lambdas#1175

Merged
laurenkrugen-navapbc merged 13 commits intomainfrom
lauren/BCDA-8777
Aug 8, 2025
Merged

BCDA-8777: Update alert destinations for workflows and lambdas#1175
laurenkrugen-navapbc merged 13 commits intomainfrom
lauren/BCDA-8777

Conversation

@laurenkrugen-navapbc
Copy link
Contributor

@laurenkrugen-navapbc laurenkrugen-navapbc commented Aug 4, 2025

🎫 Ticket

https://jira.cms.gov/browse/BCDA-8777

🛠 Changes

Admin Lambdas:

  • Updated slack alerts to be triggered in the handle function
  • All status updates will go to operations channel

Workflows:

  • Deploy related workflows have had destination channels updated to deploy channel

General:

  • Added slack utils package and constants for slack messages to reduce duplicate code

ℹ️ Context

Alerting channels are currently getting status and informational messages that don't pertain to alerts. To better manage alerts and critical infrastructure related monitoring, our alerting channel should be free of messages that do not indicate an issue. Deploy related updates will only be in the deploy channel and admin lambdas will report status updates to the operations channel, including run failures, since those are expected.

🧪 Validation

Integration tests are passing, workflows in expected channels. Looking to verify:

  • lambdas that run on a schedule (non-admin) route failures to the alerts channel but route success to operations
  • admin lambdas route all statuses to operations
  • deploy workflows and related workflows route only to deploy channel
  • Starting.. messages removed to reduce noise

SlackToken string
}

type Notifier interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this was only in use so it could be mocked during tests. I moved the slack postMessage function to the handler call to simplify things so this was no longer needed.

}

func handleACODenies(ctx context.Context, conn PgxConnection, data payload, notifier Notifier) error {
_, _, err := notifier.PostMessageContext(ctx, slackChannel, slack.MsgOptionText(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the notifier.PostMessageContext + the error check has been removed for each lambda and replaced with SendSlackMessage. Mostly done for readability and removing duplicate lines.

@laurenkrugen-navapbc laurenkrugen-navapbc requested a review from a team August 5, 2025 19:40
@laurenkrugen-navapbc laurenkrugen-navapbc marked this pull request as ready for review August 5, 2025 19:40
@laurenkrugen-navapbc laurenkrugen-navapbc requested a review from a team as a code owner August 5, 2025 19:40
color := "danger"
if status {
color = "good"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Dont think we need to have Slack in the name of the function (SendSlackMessage -> SendMessage).

Nit: Not a big fan of status (bool) param type, it seems to exist just to differentiate between the color? I know its more verbose but Id prefer a SendSuccess(...) function and a SendFailure(...) function but interested to hear what other people think. I think at the very least we could change the param name to something more appropriate.

Overall this feels like a fairly thin wrapper around the actual slack client and its interface. To some degree I feel like its a bit excessive but it does also slim various use cases down a fair amount. I think this concern is shown by what happens if we need to add a bunch of customization/configuration to the messages we want to send. This wrapper then gets built up until we more or less just have the slack client all over again. However as our use case is now, this does save us quite a bit of verbose references elsewhere.

Part of the goal of notifier was to decouple the thing we want to happen (notification of success/failure) from the implementation tool (slack). If/when we need to change implementations it should be a bit more straightforward. I dont think the notifier was a good implementation of this, but maybe theres a good middle ground? One possibility would be to change this slack_utils package to a notifier package, with a slack implementation/wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree that the wrapper is pretty thin and if we needed something more verbose in the future, then we would have to modify it to suit the customizations, etc. I mostly created it because the notifier statements with the error handling were repetitive and added a lot of noise to the functions.

I ultimately decided against making it a little more flexible for future changes because 1) I don't really see this changing much in the future and 2) this wrapper is so light that a modification to it would have been equivalent in lift to modifying every notifier statement that we had or reverting it would been similar.

I did the green/red bool because I removed the informational "starting.." messages but if we want the neutral, warning, etc, we can always updated. I'd be happy to break it out into two functions, or I can modify so the color is optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding my 2 cents here, but we wouldn't want a whole lot of customization even in the future. If we do need it at a particular time, we can update it then. Doing it now feels a bit premature optimization and against the principle of Yagni

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I assumed we weren't going to be needing a lot of changes in the future which is why I went with the bool red/green strategy.

@carlpartridge - what are we thinking in terms of leaving this wrapper in here? Seemed like a good way to enforce consistent slack messages and clean up redundant code, but I am open if we feel strongly that this isn't the best approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things are fine as is. I think one or some of the following could help though:

  • rename bool var (status -> success_msg? color?)
  • change bool var to string and just pass in the color option
  • default bool value to success instead of danger
  • create a SendSuccess() and a SendFailure() function that handle the bool
  • instead of passing a bool in pass in an options object, with color as one of the options (again this is rewriting the slack package so in addition to being more work is probably not worth it)

err = handleCreateACO(ctx, conn, data, id, slackClient)
err = handleCreateACO(ctx, conn, data, id)
if err != nil {
slUtls.SendSlackMessage(slackClient, slUtls.OperationsChannel, fmt.Sprintf("%s: Create ACO lambda in %s env.", slUtls.FailureMsg, os.Getenv("ENV")), false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern about moving notification messages out of the business logic (eg handleCreateACO) is that we can no longer test/verify that we are sending notifications as part of the process. Im not sure if that matters too much overall.

My general thinking around lambdas is that the handler should do as little as possible since its very difficult to properly test. Basically it should be setting up the requirements/environment in which to run the actual business logic. This allows us to verify as much as possible with unit tests and such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair - there's definitely a gap in testing with slack if we are to make changes in the future, some even outside the scope of testing here. Ex: while moving some of these alerts to a different channel, the slackbot didn't have access to the channel yet. The error was logged, but that was it.

possible solutions:

  • revert these changes altogether and put the notifiers back in
  • do some error handling for the slack messages (which didn't exist before - just logging it like it is now)
  • manual review every so often of alerts to verify that they aren't missed from our critical services
  • nothing; if modifications are made to these lambdas, reviewers and PR owner should be validating that integration tests work, and lambda runs aren't logging errors, etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reason we only log an error when slack fails to send a message is that we dont want to fail the function on an issue with slack. Overall I see the slack messages in these lambda contexts as a nice to have. Keeps us more aware of whats happening, but losing some isnt the end of the world.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, and we will still get logs if it fails - I was able to debug because it failed to send the message to the new slack channel yesterday. Let me know if you'd like me to revert -- I made a call to keep things as they are because the impact would be super low and I don't expect many changes to this in the future. The mock we used in testing also assumed a happy path, so I felt like it was no risk to remove it.

I'm open to any changes!

Copy link
Collaborator

@carlpartridge carlpartridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like some tests for slack_utils, other than that things look good!

@laurenkrugen-navapbc
Copy link
Contributor Author

I would like some tests for slack_utils, other than that things look good!

sounds good, adding tests now!

carlpartridge
carlpartridge previously approved these changes Aug 8, 2025
@laurenkrugen-navapbc laurenkrugen-navapbc merged commit 5b3a227 into main Aug 8, 2025
11 of 12 checks passed
@laurenkrugen-navapbc laurenkrugen-navapbc deleted the lauren/BCDA-8777 branch August 8, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants