Author: Will Rose, Senior Cloud Engineer at TrueMark
đź“… Published: April 15, 2025
One of the biggest challenges I see our customers face—and other cloud first organizations, generally—is the complexity of maintaining and servicing cloud observability across their expanding service catalogs. Every minute of service degradation translates directly to revenue loss and potential damage to your reputation.
I know I’m preaching to the choir. Even so, why do cloud observability and monitoring remain such a persistent challenge for experienced engineering teams?
Let’s talk about it.
The complexity stems from the distributed nature of modern cloud architectures. You’re dealing with:
Standardizing alerting philosophies and implementation is the goal but if we’re being honest, the logistical effort required to build out tools and automation to achieve that goal usually ends in a myriad of disagreements, diverted developer resources and never ending back and forth. Ultimately, these challenges culminate in:
Is this a problem or a nightmare? The kind that keeps you up at night (oh, is that pager duty calling at 1 AM?). I see you, I know your pain.
While I enjoy IT group therapy, being vulnerable and talking about the fever dreams that keep us up at night, here at TrueMark Technologies, we’re in the business of providing solutions. So, let me offer you one.
On behalf of TrueMark Technologies, allow me to introduce you to a load off your shoulders. We call it AutoAlarm.
To address the logistical nightmares associated with standardized monitoring and alerting we have built an event driven observability tool that dynamically manages CloudWatch and Amazon Managed Prometheus alarms via simple resource tagging.
We have abstracted the complexity of backend AWS configurations into a straightforward tag schema that when applied and/or modified, or when resources are decommissioned, AutoAlarm intelligently updates or removes the corresponding monitoring in seconds.
I’m talking about:
In implementing AutoAlarm across our customer’s environments, we’ve seen :
We have AutoAlarm currently deployed across hundreds of complex distributed services and dozens of AWS accounts all managing thousands of alarms for our customer base.
Let’s get into a bit more technical detail and talk specifics about AutoAlarm’s open source magic.
Wait, did I say opensource? Yes I did! You can view all the code and documentation at https://github.com/truemark/autoalarm.
We built this automated observability management suite from the ground up for ease of deployment, performance and reliability, user simplicity, hands-off maintenance, and functional commonsense featuring.
Below, I’ve crafted a simple FlowChart that abstracts away most of the backend complexity so you have a high view frame of reference on how events and alarm management flows through AutoAlarm:
Now that you have a simple reference point for how events move through AutoAlarm, let’s give you a tour of the features and architectural design that makes AutoAlarm such a performant and robust Cloud Observability suite.
At its core, AutoAlarm is an event driven application, reacting in real time to resource lifecycle events (state changes, tag modifications and system health events). Its core architecture shines in its simplicity:
For those managing hundreds of microservices at scale and who need instant and efficient monitoring, we've baked in several performance and reliability features so you hit the ground running:
The result? A monitoring system that scales with your infrastructure without becoming part of the problem.
As we built out the feature set for AutoAlarm, we did so with the intent of creating a tool that is simple enough to eliminate 99% of operational overhead for standardized implementation across teams but robust enough to provide comprehensive monitoring for engineers that know the services they maintain and know best what needs to be monitored and how. We really wanted you to be able to have your cake and eat it too. To achieve this goal, AutoAlarm comes with:
At the time of writing, AutoAlarm supports automated tag-based alarm management for the following services:
We are aggressively developing and adding features to AutoAlarm, so if you don’t see a service that fits your use case and needs, please feel free to contact us and let us know.
Now that we’ve done a technical deep dive into how AutoAlarm is built, let’s accelerate and I’ll show you how to build and deploy AutoAlarm in a matter of minutes.
The following instructions are for manual install and deployment via CLI but we recommend using pipelines for larger multi-account deployments.
First, you’ll want to make sure that you have the following prerequisites set up before beginning deployment:
Next, let’s grab the latest release and deploy:
git clone https://github.com/truemark/autoalarm.git
cd autoalarm
pnpm install
export AWS_REGION="your-region"
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_SESSION_TOKEN="session-token-if-applicable"
cdk bootstrap
pnpm -r build
cd cdk ; cdk deploy
That's it! In less than 10 minutes, you'll have automated observability management up and running in your AWS environment.
Now that we’ve deployed AutoAlarm to AWS, let’s go over how to effectively use all of AutoAlarm’s features. In this section, we will go over:
There may be instances when you want instant no-touch monitoring. While observability is not a one-size-fits-all solution, we have provided default settings that will create multiple alarms per resource tagged with sane baseline values whenever AutoAlarm is enabled.
To enable monitoring, set the following tag on a supported service:
autoalarm:enabled=true
To disable all alarms (both default and configured) you can simply remove that tag or change the value to `false`.
This is standard across all supported services.
Each service uses a standardized, straightforward pattern for alarm management. The pattern is as follows:
autoalarm:<metric>=<warning threshold>/<critical threshold>/<period>/<evaluation periods>/<statistic>/<datapoints to alarm>/<comparison operator>/<missing data treatment>
Let’s break each part down of the tag values.
Tag Value Component | Explanation | Example Value |
---|---|---|
Warning Threshold | Numeric threshold at which a warning alarm triggers (typically lower-severity alerts). | 80% CPU Utilization |
Critical Threshold | Numeric threshold at which a critical alarm triggers (higher-severity alerts, indicating immediate action). | 95% CPU Utilization |
Period | The duration (in seconds) of each evaluation interval. Valid values: 10, 30 seconds, or multiples of 60 seconds (e.g., 60, 120, 300). | 60 |
Evaluation Periods | Number of consecutive periods over which a metric must breach thresholds before triggering the alarm. | 5 |
Statistic | Specifies how CloudWatch calculates the data for the alarm evaluation period. E.g., Average, Maximum, Minimum, Sum, Percentile (pXX). | Maximum |
Datapoints To Alarm | How many of the evaluated data points (periods) must breach threshold to trigger an alarm. (Usually equal or less than Evaluation Periods). | 5 |
Comparison Operator | Logical operator that determines how CloudWatch interprets the threshold. | GreaterThanThreshold |
Missing Data Treatment | Defines alarm behavior if data points are missing. Possible values: missing , ignore , breaching , or notBreaching . |
ignore |
In practice, setting a cpu utilization alarm on a RDS cluster might look like:
autoalarm:cpu=80/95/60/5/Maximum/5/GreaterThanThreshold/ignore
Let’s break down the tag values once again for our practical example.
Tag Value Component | Example | Meaning |
---|---|---|
Warning Threshold | 80 | Warning triggered at 80% CPU usage. |
Critical Threshold | 95 | Critical alarm triggered at or above 95% CPU usage. |
Period | 60 seconds | CloudWatch checks CPU utilization every 60 seconds. |
Evaluation Periods | 5 | CloudWatch evaluates the CPU utilization over the last 5 periods of 60 seconds each (total of 5 minutes). |
Statistic | Maximum | CloudWatch uses the maximum reported CPU usage in each period. |
Datapoints To Alarm | 5 | All 5 consecutive data points must breach the threshold to trigger this alarm (consistent elevated CPU usage required). |
Comparison Operator | GreaterThanThreshold | CloudWatch triggers the alarm if the CPU usage is greater than the threshold. |
Missing Data Treatment | ignore | If CPU utilization data is missing for one or more periods, CloudWatch simply ignores that missing data period, rather than triggering or preventing an alarm. |
You can see how simple tagging resources can be using the unified tagging schema but typing out a long string of values can be cumbersome. Let me show you how you can short-hand tag values to save keystrokes and time.
Consider an EC2 instance with default monitoring enabled via (autoalarm:enabled=true) This automatically creates Warning and Critical alarms for CPU, memory, and storage utilization. Now imagine you want to make the following custom adjustments while keeping other default values:
In this scenario, we can use empty/undefined values to keep the defaults and define the values we want to change. Here is an example:
autoalarm:cpu=77//120/3///LessThanOrEqualToThreshold/missing
autoalarm:memory=87//120/4
autoalarm:storage=75//90/2//3//notbreaching
Now all three alarms are uniquely configured according to your use case.
AutoAlarm uses implicit defaults when a value is undefined or invalid (each value is separated by a /). Order is important in your tag definition - provide either a defined value or undefined placeholder up to the last value you want to change. If you skip positional values, AutoAlarm will use default values for any improperly positioned values so make sure you follow the order of positional values.
The null character “-” tells AutoAlarm to either ignore an alarm or to remove one that already exists. Let’s walk through another short practical example of how to use the null character:
For CloudFront distributions, you might want anomaly detection for 5xx errors, but only need Critical alerts with a custom critical threshold but with all other default alarm configs:
autoalarm:5xx-errors-anomaly=-/5
Pretty simple, right? By now, you can see how concise and easy setting tags can be.
To wrap this tagging schema tutorial, let’s pull everything together and demonstrate how short hand tagging with AutoAlarm’s unified tagging schema can make managing alarms a breeze in a final practical example:
Say you have an application you’re testing on an EC2 instance but you only want critical CPU alarms with a custom threshold, duration, evaluation periods and data treatment. You can easily implement this by setting the following tags:
autoalarm:enabled=true
autoalarm:cpu=-/99/180/1/Average///ignore
autoalarm:memory=-/-
autoalarm:storage=-/-
All it took was enabling AutoAlarm and setting 1 tag for each of these alarms to get exactly what we need without clicking through the AWS console, figuring out SDK syntax, or tracking down an indent, curly brace or semi-colon in your IaC…
If you need to set up quick and dirty monitoring for this use case on a new EC2 instance, you can also use AWS CLI to set up in seconds and tear down in seconds like so:
aws ec2 create-tags --resources i-1234567890abcdef0 --tags \
Key=autoalarm:enabled,Value=true \
Key=autoalarm:cpu,Value="-/99/180/1/Average///ignore" \
Key=autoalarm:memory,Value="-/-" \
Key=autoalarm:storage,Value="-/-"
// And to tear down all the alarms, we only need to set autoalarm:enabled to //'false
aws ec2 create-tags --resources i-1234567890abcdef0 --tags Key=autoalarm:enabled,Value=false
AutoAlarm integrates with AWS’s CloudWatch Service. Any alarm created or managed by AutoAlarm will populate in the CloudWatch console along with any other monitoring you have configured.
AutoAlarms can easily be identified by naming convention. Each AutoAlarm is named using the following pattern:
AutoAlarm-”Service Type (e.g. EC2)”-”Resource ID (e.g. arn or resource name)”- “MetricName”-”Warning/Critical”
Here are a few examples for reference:
AutoAlarm-EC2-i-00e123456789bcdef-CPUUtilization-Warning AutoAlarm-EC2-i-00e123456789bcdef-CPUUtilization-Critical AutoAlarm-RDS-veryfast-performant-dbname-DBLoad-anomaly-Warning AutoAlarm-RDSCluster-larger-rdscluster1-SwapUsage-anomaly-Critical Autoalarm-OS-opensearch-clustername-JVMMemoryPressure-Critical
This naming convention eliminates confusion and provides immediate context about which service and monitoring dimension requires attention when an alarm triggers.
Often, critical alerts are missed due to various reasons such as alert fatigue, maintenance confusion, tired on-call engineers, or a NOC technician otherwise occupied watching funny cat videos on youtube. Regardless of the cause, missed alerts can result in catastrophic outcomes and unnecessary downtime.
By default, ReAlarm runs on a 2-hour interval without requiring any configuration. It identifies CloudWatch alarms in an "ALARM" state, resets them to "OK," which allows them to retrigger if the condition persists. This creates a subsequent alert notification for follow-up.
ReAlarm intelligently excludes alarms associated with AutoScaling actions from this reset process. This prevents interference with critical scaling activities for mission-critical and redundant services.
ReAlarm offers two simple configuration options via resource tagging
// Disable ReAlarm for Specific Resource Alarms:
autoalarm:re-alarm-enabled=false
// Customize the Reset Interval for Specific Resource Alarms
autoalarm:re-alarm-minutes=30
In cases where you know you are addressing an issue and do not need the extra noise or in instances where ReAlarm is too aggressive or not aggressive enough, you can use these tags to configure ReAlarm according to your needs.
Now, you are equipped with everything you need to stand up AutoAlarm in AWS and roll out either ad hoc monitoring or mass observability included with your next IaC deployment.
If you made it this far, I offer my sincerest thank you. AutoAlarm represents countless hours building and solving difficult at scale problems to make sure this tool is everything our customers need it to be.
As mentioned, AutoAlarm is publicly available for free. Please download the project, give it a try and let us know what you think.
If AutoAlarm aligns perfectly with your cloud observability needs, contact TrueMark Technologies to discuss deployment. If AutoAlarm meets most—but not all—of your requirements, we welcome your feedback. Our team prioritizes customer-driven enhancements and would be glad to discuss tailored features or support options to ensure AutoAlarm fully supports your business goals.
Here’s a quick usage and reference guide for convenience when you’re getting started.
autoalarm:enabled=true
autoalarm:<metric>=<warning threshold>/<critical threshold>/<period>/<evaluation periods>/<statistic>/<datapoints to alarm>/<comparison operator>/<missing data treatment>
autoalarm:cpu=80/95/60/5/Maximum/5/GreaterThanThreshold/ignore
autoalarm:cpu=-/99/1/Average///ignore
autoalarm:memory=-/-
autoalarm:storage=-/-
autoalarm:re-alarm-enabled=false
autoalarm:re-alarm-minutes=30