BCDA-8777: Update alert destinations for workflows and lambdas#1175
BCDA-8777: Update alert destinations for workflows and lambdas#1175laurenkrugen-navapbc merged 13 commits intomainfrom
Conversation
| SlackToken string | ||
| } | ||
|
|
||
| type Notifier interface { |
There was a problem hiding this comment.
Seems like this was only in use so it could be mocked during tests. I moved the slack postMessage function to the handler call to simplify things so this was no longer needed.
| } | ||
|
|
||
| func handleACODenies(ctx context.Context, conn PgxConnection, data payload, notifier Notifier) error { | ||
| _, _, err := notifier.PostMessageContext(ctx, slackChannel, slack.MsgOptionText( |
There was a problem hiding this comment.
the notifier.PostMessageContext + the error check has been removed for each lambda and replaced with SendSlackMessage. Mostly done for readability and removing duplicate lines.
bcda/slack/slack.go
Outdated
| color := "danger" | ||
| if status { | ||
| color = "good" | ||
| } |
There was a problem hiding this comment.
Nit: Dont think we need to have Slack in the name of the function (SendSlackMessage -> SendMessage).
Nit: Not a big fan of status (bool) param type, it seems to exist just to differentiate between the color? I know its more verbose but Id prefer a SendSuccess(...) function and a SendFailure(...) function but interested to hear what other people think. I think at the very least we could change the param name to something more appropriate.
Overall this feels like a fairly thin wrapper around the actual slack client and its interface. To some degree I feel like its a bit excessive but it does also slim various use cases down a fair amount. I think this concern is shown by what happens if we need to add a bunch of customization/configuration to the messages we want to send. This wrapper then gets built up until we more or less just have the slack client all over again. However as our use case is now, this does save us quite a bit of verbose references elsewhere.
Part of the goal of notifier was to decouple the thing we want to happen (notification of success/failure) from the implementation tool (slack). If/when we need to change implementations it should be a bit more straightforward. I dont think the notifier was a good implementation of this, but maybe theres a good middle ground? One possibility would be to change this slack_utils package to a notifier package, with a slack implementation/wrapper.
There was a problem hiding this comment.
yeah, I agree that the wrapper is pretty thin and if we needed something more verbose in the future, then we would have to modify it to suit the customizations, etc. I mostly created it because the notifier statements with the error handling were repetitive and added a lot of noise to the functions.
I ultimately decided against making it a little more flexible for future changes because 1) I don't really see this changing much in the future and 2) this wrapper is so light that a modification to it would have been equivalent in lift to modifying every notifier statement that we had or reverting it would been similar.
I did the green/red bool because I removed the informational "starting.." messages but if we want the neutral, warning, etc, we can always updated. I'd be happy to break it out into two functions, or I can modify so the color is optional.
There was a problem hiding this comment.
Just adding my 2 cents here, but we wouldn't want a whole lot of customization even in the future. If we do need it at a particular time, we can update it then. Doing it now feels a bit premature optimization and against the principle of Yagni
There was a problem hiding this comment.
Yeah, I assumed we weren't going to be needing a lot of changes in the future which is why I went with the bool red/green strategy.
@carlpartridge - what are we thinking in terms of leaving this wrapper in here? Seemed like a good way to enforce consistent slack messages and clean up redundant code, but I am open if we feel strongly that this isn't the best approach.
There was a problem hiding this comment.
I think things are fine as is. I think one or some of the following could help though:
- rename bool var (status -> success_msg? color?)
- change bool var to string and just pass in the color option
- default bool value to success instead of danger
- create a SendSuccess() and a SendFailure() function that handle the bool
- instead of passing a bool in pass in an options object, with color as one of the options (again this is rewriting the slack package so in addition to being more work is probably not worth it)
bcda/lambda/admin_create_aco/main.go
Outdated
| err = handleCreateACO(ctx, conn, data, id, slackClient) | ||
| err = handleCreateACO(ctx, conn, data, id) | ||
| if err != nil { | ||
| slUtls.SendSlackMessage(slackClient, slUtls.OperationsChannel, fmt.Sprintf("%s: Create ACO lambda in %s env.", slUtls.FailureMsg, os.Getenv("ENV")), false) |
There was a problem hiding this comment.
One concern about moving notification messages out of the business logic (eg handleCreateACO) is that we can no longer test/verify that we are sending notifications as part of the process. Im not sure if that matters too much overall.
My general thinking around lambdas is that the handler should do as little as possible since its very difficult to properly test. Basically it should be setting up the requirements/environment in which to run the actual business logic. This allows us to verify as much as possible with unit tests and such.
There was a problem hiding this comment.
that's fair - there's definitely a gap in testing with slack if we are to make changes in the future, some even outside the scope of testing here. Ex: while moving some of these alerts to a different channel, the slackbot didn't have access to the channel yet. The error was logged, but that was it.
possible solutions:
- revert these changes altogether and put the notifiers back in
- do some error handling for the slack messages (which didn't exist before - just logging it like it is now)
- manual review every so often of alerts to verify that they aren't missed from our critical services
- nothing; if modifications are made to these lambdas, reviewers and PR owner should be validating that integration tests work, and lambda runs aren't logging errors, etc
There was a problem hiding this comment.
I believe the reason we only log an error when slack fails to send a message is that we dont want to fail the function on an issue with slack. Overall I see the slack messages in these lambda contexts as a nice to have. Keeps us more aware of whats happening, but losing some isnt the end of the world.
There was a problem hiding this comment.
yup, and we will still get logs if it fails - I was able to debug because it failed to send the message to the new slack channel yesterday. Let me know if you'd like me to revert -- I made a call to keep things as they are because the impact would be super low and I don't expect many changes to this in the future. The mock we used in testing also assumed a happy path, so I felt like it was no risk to remove it.
I'm open to any changes!
carlpartridge
left a comment
There was a problem hiding this comment.
I would like some tests for slack_utils, other than that things look good!
sounds good, adding tests now! |
🎫 Ticket
https://jira.cms.gov/browse/BCDA-8777
🛠 Changes
Admin Lambdas:
handlefunctionWorkflows:
General:
ℹ️ Context
Alerting channels are currently getting status and informational messages that don't pertain to alerts. To better manage alerts and critical infrastructure related monitoring, our alerting channel should be free of messages that do not indicate an issue. Deploy related updates will only be in the deploy channel and admin lambdas will report status updates to the operations channel, including run failures, since those are expected.
🧪 Validation
Integration tests are passing, workflows in expected channels. Looking to verify:
Starting..messages removed to reduce noise