Skip to content

Conversation

@gpop63
Copy link
Contributor

@gpop63 gpop63 commented Sep 16, 2025

Overview

This PR introduces the first set of alert rule templates for key AWS data streams. For each stream, we selected the two most critical metrics to monitor.

ec2_metrics

  • High CPU Utilization
    • This alarm is used to detect high CPU utilization.
  • Status Check failed
    • This alarm is used to detect the underlying problems with instances, including both system status check failures and instance status check failures.

lambda

  • High Number of Throttles
    • The alarm helps detect a high number of throttled invocation requests for a Lambda function.
  • High Number of Errors
    • The alarm helps detect high error counts in function invocations.

sqs

  • Oldest Message Name is Too High
    • This alarm is used to detect whether the age of the oldest message in the QueueName queue is too high. Threshold depends on situation.
  • High Number of Visible Messages
    • This alarm is used to detect whether the message count of the active queue is too high and consumers are slow to process the messages or there are not enough consumers to process them. Threshold depends on situation.

sns

  • Any Message Delivery Fails
    • This alarm helps you proactively find issues with the delivery of notifications and take appropriate actions to address them. Threshold depends on situation.
  • Number of Notifications Filtered Out - Invalid Attributes
    • The alarm is used to detect if the published messages are not valid or if inappropriate filters have been applied to a subscriber.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

  • Closes elastic/obs-integration-team/issues/536

Screenshots

@gpop63 gpop63 requested review from a team as code owners September 16, 2025 14:09
@cla-checker-service
Copy link

cla-checker-service bot commented Sep 16, 2025

💚 CLA has been signed

@gpop63 gpop63 force-pushed the add_aws_alert_rule_templates branch from ef16f46 to bbb5db6 Compare September 16, 2025 14:12
@gpop63 gpop63 self-assigned this Sep 16, 2025
@gpop63 gpop63 added Integration:aws AWS enhancement New feature or request labels Sep 16, 2025
@andrewkroh andrewkroh added the Team:obs-ds-hosted-services Observability Hosted Services team [elastic/obs-ds-hosted-services] label Sep 16, 2025
@ishleenk17
Copy link
Member

@gpop63 : The template will be usable from 9.2 onwards .
If yes, lets mark the PR as DON'T MERGE.

Can you please share a screenshot of how a particular alert looks like. Also, are we not adding any information about alert support in the README's ?

@muthu-mps muthu-mps changed the title [AWS] Introduce initial alert rule templates [AWS] Introduce initial alert rule templates - DO NOT MERGE Sep 17, 2025
"type": "alerting_rule_template",
"attributes": {
"name": "EC2 High CPU Utilization",
"tags": [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tags?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tags were you thinking of, Muthu?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags can have the service name and the Alert metrics name. Similar to what I have added here in Azure AI Foundry.
e.g., [AWS EC2, AWS EC2 CPU Utilization].

@@ -0,0 +1,37 @@
{
"id": "b6513de4-6c36-499a-8f0a-98431cd4dbee",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the id match with the file name of the rule_template?
Error: defines non-matching ID

@gpop63
Copy link
Contributor Author

gpop63 commented Sep 17, 2025

@ishleenk17 right now the support is not fully there we only see them under assets and in saved objects

image

@muthu-mps muthu-mps requested a review from agithomas September 18, 2025 05:52
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the time field be @timestamp? Is there a reason for choosing event.ingested instead of @timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the @timestamp field but it wasn't generating alerts. For some AWS data streams @timestamp is when the actual metric happened in AWS.

"esql": "FROM metrics-aws.ec2_metrics-default\n| STATS cpuutilization=avg(aws.ec2.metrics.CPUUtilization.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE cpuutilization >= 80"
},
"aggType": "count",
"groupBy": "all",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this groupBy not applicable while using ESQL query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group by of actual data happens in the esql query itself, this has to be a property of the alert.

"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.ec2_metrics-default\n| STATS cpuutilization=avg(aws.ec2.metrics.CPUUtilization.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE cpuutilization >= 80"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying dataset filter help fetch only the specific data for the alerting metrics. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we do that? also this esql query targets documents from a specific data stream/index (metrics-aws.ec2_metrics-default)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ignore this as we directly target against specific datastream.

"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to groupby. Check whether the threshold value is applied directly from ESQL query and not from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold is set in the esql query, this is a different property of the alert.

@elastic-vault-github-plugin-prod

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

@elastic-sonarqube
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
0.7% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube

@muthu-mps muthu-mps changed the title [AWS] Introduce initial alert rule templates - DO NOT MERGE [AWS] Introduce initial alert rule templates Sep 26, 2025
Copy link

@daniela-elastic daniela-elastic Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we declare which service (entity) this alert template applies to? Something like resource : aws.ec2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included the service name in the name of the alert rule template. I suppose Kibana should allow us to filter by tags or by partial matches on the title of the alert rule template.

@muthu-mps muthu-mps requested a review from MichelLosier October 9, 2025 09:08
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.ec2_metrics-default\n| STATS cpuutilization=avg(aws.ec2.metrics.CPUUtilization.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE cpuutilization >= 80"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field aws.ec2.metrics.CPUUtilization.avg is renamed here. Do you think this field should be changed to host.cpu.usage field?

Copy link
Contributor Author

@gpop63 gpop63 Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both metrics are present but switching to the ECS field seems like the better option. Switched to host.cpu.usage in d28dd85. I had to multiply it by 100 to use percentages.

Copy link
Contributor

@MichelLosier MichelLosier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration looks good to me!

@gpop63 gpop63 marked this pull request as draft October 10, 2025 10:35
Comment on lines 8 to 10
"schedule": {
"interval": "1m"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable for all the configurations.

Should we keep this so frequently? I suggest, this be equal to the default period value for metrics ingestion. Following so, it helps to avoid any no-data found alert (when user decides to extend the configuration)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set timeWindowSize to match the integration period? That way, for example, every 5 minutes we’d check for alerts in documents from the past 5 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, thats a resonable thing to do. The impact I assume here will be that instead of an alert being notified at the period + 1m interval, the alert will be notified at 2 x period internal. Here period is 5m for most AWS servies.

@tommyers-elastic , what would be your recommendation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we have any way to couple configs in agent policy templates with these rule configurations, so whatever we choose will have to be always added by hand.

my only thinking here is that it doesn't make sense to run a rule more frequently than the integration collection period. matching the rule frequency with the collection period seems sensible to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a shame there's no way to put hints in the form such that we could have something that shows up and says "should match the integration collection period" or something. if we think it's worthwhile we could suggest this as a feature.

@muthu-mps muthu-mps marked this pull request as ready for review November 5, 2025 10:38
subscription: basic
kibana:
version: "^8.19.0 || ^9.1.0"
version: "^9.2.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elastic/security-service-integrations team, This feature is supported starting from 9.2.1 release version. The minimum stack version gets upgraded to 9.2.1. Since AWS integrations involve co-ownership, Could you confirm if the stack version upgrade is fine with the integrations managed by security team?

@gpop63
Copy link
Contributor Author

gpop63 commented Nov 7, 2025

/test

@elasticmachine
Copy link

elasticmachine commented Nov 9, 2025

💔 Build Failed

Failed CI Steps

History

cc @gpop63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Integration:aws AWS Team:obs-ds-hosted-services Observability Hosted Services team [elastic/obs-ds-hosted-services]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants