Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions packages/aws/changelog.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# newer versions go on top
- version: "3.17.0"
changes:
- description: Add ec2_metrics, lambda, sqs and sns alert rule templates.
type: enhancement
link: https://github.com/elastic/integrations/pull/999
- version: "3.16.0"
changes:
- description: Map `recipient_account_id` to `cloud.account.id` for AWS CloudTrail.
Expand Down
Copy link

@daniela-elastic daniela-elastic Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we declare which service (entity) this alert template applies to? Something like resource : aws.ec2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included the service name in the name of the alert rule template. I suppose Kibana should allow us to filter by tags or by partial matches on the title of the alert rule template.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "ec2-high-cpu-utilization",
"type": "alerting_rule_template",
"attributes": {
"name": "EC2 High CPU Utilization",
"tags": [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tags?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tags were you thinking of, Muthu?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags can have the service name and the Alert metrics name. Similar to what I have added here in Azure AI Foundry.
e.g., [AWS EC2, AWS EC2 CPU Utilization].

"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to groupby. Check whether the threshold value is applied directly from ESQL query and not from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold is set in the esql query, this is a different property of the alert.

0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.ec2_metrics-default\n| STATS cpuutilization=avg(aws.ec2.metrics.CPUUtilization.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE cpuutilization >= 80"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying dataset filter help fetch only the specific data for the alerting metrics. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we do that? also this esql query targets documents from a specific data stream/index (metrics-aws.ec2_metrics-default)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ignore this as we directly target against specific datastream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field aws.ec2.metrics.CPUUtilization.avg is renamed here. Do you think this field should be changed to host.cpu.usage field?

Copy link
Contributor Author

@gpop63 gpop63 Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both metrics are present but switching to the ECS field seems like the better option. Switched to host.cpu.usage in d28dd85. I had to multiply it by 100 to use percentages.

},
"aggType": "count",
"groupBy": "all",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this groupBy not applicable while using ESQL query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group by of actual data happens in the esql query itself, this has to be a property of the alert.

"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the time field be @timestamp? Is there a reason for choosing event.ingested instead of @timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the @timestamp field but it wasn't generating alerts. For some AWS data streams @timestamp is when the actual metric happened in AWS.

"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "ec2-status-check-failed",
"type": "alerting_rule_template",
"attributes": {
"name": "EC2 Status Check Failed",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.ec2_metrics-default\n| STATS statusfailed=max(aws.ec2.metrics.StatusCheckFailed.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE statusfailed > 0"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
37 changes: 37 additions & 0 deletions packages/aws/kibana/alerting_rule_template/lambda-errors.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "lambda-errors",
"type": "alerting_rule_template",
"attributes": {
"name": "Lambda Errors",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.lambda-default\n| STATS errors=avg(aws.lambda.Errors.avg) by cloud.account.id, cloud.region, aws.dimensions.FunctionName\n| WHERE errors > 0"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "lambda-throttles",
"type": "alerting_rule_template",
"attributes": {
"name": "Lambda Throttles",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.lambda-default\n| STATS throttles=avg(aws.lambda.Throttles.avg) by cloud.account.id, cloud.region, aws.dimensions.FunctionName\n| WHERE throttles > 0"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "sns-notifications-failed",
"type": "alerting_rule_template",
"attributes": {
"name": "SNS Notifications Failed",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.sns-default\n| STATS notificationsfailed=avg(aws.sns.NumberOfNotificationsFailed.sum) by cloud.account.id, cloud.region, aws.dimensions.TopicName\n| WHERE notificationsfailed > 0"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "sns-notifications-filtered-out",
"type": "alerting_rule_template",
"attributes": {
"name": "SNS Notifications Filtered Out",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.sns-default\n| STATS notificationsfilteredout=avg(aws.sns.NumberOfNotificationsFilteredOut-InvalidAttributes.sum) by cloud.account.id, cloud.region, aws.dimensions.TopicName\n| WHERE notificationsfilteredout > 0"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "sqs-messages-visible",
"type": "alerting_rule_template",
"attributes": {
"name": "SQS Messages Visible",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
Comment on lines 8 to 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable for all the configurations.

Should we keep this so frequently? I suggest, this be equal to the default period value for metrics ingestion. Following so, it helps to avoid any no-data found alert (when user decides to extend the configuration)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set timeWindowSize to match the integration period? That way, for example, every 5 minutes we’d check for alerts in documents from the past 5 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, thats a resonable thing to do. The impact I assume here will be that instead of an alert being notified at the period + 1m interval, the alert will be notified at 2 x period internal. Here period is 5m for most AWS servies.

@tommyers-elastic , what would be your recommendation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we have any way to couple configs in agent policy templates with these rule configurations, so whatever we choose will have to be always added by hand.

my only thinking here is that it doesn't make sense to run a rule more frequently than the integration collection period. matching the rule frequency with the collection period seems sensible to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a shame there's no way to put hints in the form such that we could have something that shows up and says "should match the integration collection period" or something. if we think it's worthwhile we could suggest this as a feature.

"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.sqs-default\n| STATS msgsvisible=max(aws.sqs.messages.visible) by cloud.account.id, cloud.region, aws.dimensions.QueueName\n| WHERE msgsvisible >= 1000"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"id": "sqs-oldest-message",
"type": "alerting_rule_template",
"attributes": {
"name": "SQS Oldest Message",
"tags": [],
"ruleTypeId": ".es-query",
"schedule": {
"interval": "1m"
},
"params": {
"searchType": "esqlQuery",
"timeWindowSize": 5,
"timeWindowUnit": "m",
"threshold": [
0
],
"thresholdComparator": ">",
"size": 100,
"esqlQuery": {
"esql": "FROM metrics-aws.sqs-default\n| STATS oldestmsgage=max(aws.sqs.oldest_message_age.sec) by cloud.account.id, cloud.region, aws.dimensions.QueueName\n| WHERE oldestmsgage >= 300"
},
"aggType": "count",
"groupBy": "all",
"termSize": 5,
"sourceFields": [],
"timeField": "event.ingested",
"excludeHitsFromPreviousRun": true
},
"alertDelay": {
"active": 1
}
},
"managed": true,
"coreMigrationVersion": "8.8.0",
"typeMigrationVersion": "10.1.0"
}
4 changes: 2 additions & 2 deletions packages/aws/manifest.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
format_version: 3.3.2
format_version: 3.5.0
name: aws
title: AWS
version: 3.16.0
version: 3.17.0
description: Collect logs and metrics from Amazon Web Services (AWS) with Elastic Agent.
type: integration
categories:
Expand Down