Technical Articles

Massively Automate Cloud Observability Through Tag Based Alarm Management

Author: Will Rose, Senior Cloud Engineer at TrueMark

đź“… Published: April 15, 2025

One of the biggest challenges I see our customers face—and other cloud first organizations, generally—is the complexity of maintaining and servicing cloud observability across their expanding service catalogs. Every minute of service degradation translates directly to revenue loss and potential damage to your reputation.

I know I’m preaching to the choir. Even so, why do cloud observability and monitoring remain such a persistent challenge for experienced engineering teams?

The Real Problem: Managing Observability Across Distributed Systems at Scale

Let’s talk about it.

The complexity stems from the distributed nature of modern cloud architectures. You’re dealing with:

  • Multiple engineering teams managing separate service domains
  • Offset deployment cycles across microservices (which should be fine because you’re using microservices that "shouldn't impact that other stuff”, right…?)
  • Varying product priorities and release cadences
  • Cross-team coordination of observability and monitoring initiatives

Standardizing alerting philosophies and implementation is the goal but if we’re being honest, the logistical effort required to build out tools and automation to achieve that goal usually ends in a myriad of disagreements, diverted developer resources and never ending back and forth. Ultimately, these challenges culminate in:

  • Missed Critical Alerts: Outdated, missing, or incorrectly configured alarms can result in unnoticed issues and inevitable downtime surprises.
  • Configuration Drift: Manual maintenance often leads to inconsistent alerting philosophy and outdated definitions that do not reflect the current state of your service catalog and infrastructure.
  • Difficulty Standardizing: Establishing and maintaining standardized alerts across various AWS environments (EC2, RDS, SQS, etc.) and teams is notoriously difficult.
  • Operational Overhead: Manual alarm configurations and “just this once until we build out automation” monitoring changes create tech debt over time, burdens operations teams, and diverts resources from more critical revenue driving tasks.

Is this a problem or a nightmare? The kind that keeps you up at night (oh, is that pager duty calling at 1 AM?). I see you, I know your pain.

While I enjoy IT group therapy, being vulnerable and talking about the fever dreams that keep us up at night, here at TrueMark Technologies, we’re in the business of providing solutions. So, let me offer you one.

Introducing AutoAlarm: Tag-Based Observability Automation

On behalf of TrueMark Technologies, allow me to introduce you to a load off your shoulders. We call it AutoAlarm.

To address the logistical nightmares associated with standardized monitoring and alerting we have built an event driven observability tool that dynamically manages CloudWatch and Amazon Managed Prometheus alarms via simple resource tagging.

We have abstracted the complexity of backend AWS configurations into a straightforward tag schema that when applied and/or modified, or when resources are decommissioned, AutoAlarm intelligently updates or removes the corresponding monitoring in seconds.

I’m talking about:

  • Minimal Integration Overhead: Manually tag a resource or add tagging definition inline in your IaC resource declarations for instant observability at deployment. That’s it
  • Default Alarms and Granular Customization: Pre-configured monitoring
  • templates with granular customization options
  • Standardized Monitoring: Maintain consistent alarm standards across your entire AWS infrastructure and across team boundaries
  • Built-in Failsafes to Prevent Missed Critical Alerts: Proactive alert fail-safeing via no-touch automated alarm resets that can also be configured as needed
  • Standard Self-Monitoring: AutoAlarm monitors itself so that if there is ever a failure, you are alerted with specific and actionable context.

Real World Production Outcomes

In implementing AutoAlarm across our customer’s environments, we’ve seen :

  • Deployment to our customer’s accounts in less than 10 minutes
  • Catching Blocking errors on RDS instances and remediation before a major incident
  • Closed observability gaps across several services with single tag additions
  • Immediately catching critical system errors post maintenance resulting in faster patching before major production impact
  • Proactive system load alerting allowing TrueMark Operations to preemptively scale resources during major sales and promotions
  • Finding new and lurking issues on services providing the customer valuable context and enabling them to resolve bugs and implement resource sizing before imminent and major disruption
  • AutoAlarm in addition to other TrueMark managed services replaced, expensive and complex observability tools saving our customers hundreds of thousands of dollars

We have AutoAlarm currently deployed across hundreds of complex distributed services and dozens of AWS accounts all managing thousands of alarms for our customer base.

Let’s get into a bit more technical detail and talk specifics about AutoAlarm’s open source magic.

Wait, did I say opensource? Yes I did! You can view all the code and documentation at https://github.com/truemark/autoalarm.

AutoAlarm Architecture and Features

We built this automated observability management suite from the ground up for ease of deployment, performance and reliability, user simplicity, hands-off maintenance, and functional commonsense featuring.

Below, I’ve crafted a simple FlowChart that abstracts away most of the backend complexity so you have a high view frame of reference on how events and alarm management flows through AutoAlarm:

AutoAlarm Architecture Diagram

Now that you have a simple reference point for how events move through AutoAlarm, let’s give you a tour of the features and architectural design that makes AutoAlarm such a performant and robust Cloud Observability suite.

Core Building Blocks

At its core, AutoAlarm is an event driven application, reacting in real time to resource lifecycle events (state changes, tag modifications and system health events). Its core architecture shines in its simplicity:

  • IAM Roles and Policies: Out of the box, AutoAlarm provides the necessary access and permissions for AutoAlarm to securely interact with Resource APIs, CloudWatch, and other services.
  • Lambda Functions: Serve as the core engine that implements alarm creation, updates, and removal logic based on predefined rules and tag configurations.
  • EventBridge Rules: Capture specific events and state changes from Amazon EC2, Elastic Load Balancing, SQS, and more, triggering the Lambda functions to automate alarm management responses.
  • CloudWatch Logs: Centralized logging and troubleshooting data.
  • SQS Queues: Dead-letter queues handle event-processing failures, FIFO queues have built-in robust retry logic making AutoAlarm resilient to transient failures.
  • CloudWatch Alarms for Queue and Lambda Monitoring: AutoAlarm itself is monitored so your core observability tool is monitored and alerts in real time when there is an issue.

Performance and Reliability First Design

For those managing hundreds of microservices at scale and who need instant and efficient monitoring, we've baked in several performance and reliability features so you hit the ground running:

  • API Rate Throttle Handling: we’ve implemented exponential backoff in addition to layered and dynamically intervalled retries for large configuration deployments
  • Smart Queue Management: FIFO ordering with optimized batching ensures your events process efficiently without sacrificing temporal accuracy. We also have built in retry logic and categorical sorting.
  • Address Queue Blocking and Snowball Anti-Patterns: Fifo queues must be deployed with considerations and efficient error handling to prevent queue leader blocking and compound snowball processing during failure. AutoAlarm handles these issues so your queue items never get stuck in retry limbo.
  • Resilient Error Handling: When things do go sideways (and in distributed systems, they always do) our failure handling, dependency awareness and self-monitoring provide actionable context, not just generic failures.
  • Efficient Algorithmic SQS Message Parsing: Lamda invocations share resources so it’s critical to ensure that complex multi-faceted message and event parsing are flexible and resilient across dozens of payload types but resource responsible across hundreds of concurrent executions. Autoalarm accomplishes this.

The result? A monitoring system that scales with your infrastructure without becoming part of the problem.

Features and Ease of Use

As we built out the feature set for AutoAlarm, we did so with the intent of creating a tool that is simple enough to eliminate 99% of operational overhead for standardized implementation across teams but robust enough to provide comprehensive monitoring for engineers that know the services they maintain and know best what needs to be monitored and how. We really wanted you to be able to have your cake and eat it too. To achieve this goal, AutoAlarm comes with:

  • Comprehensive CloudWatch Integration: This is not a watered down abstraction. Any possible Cloudwatch Alarm configuration can be easily applied with a single tag on a resource.
  • AWS Managed Prometheus: Supported services can be configured using the same unified tagging schema to automatically configure and deploy prometheus monitoring in AWS.
  • Unified Tagging Schema: Every service supported by AutoAlarm can easily be configured for monitoring by following a simple tagging pattern.
  • ReAlarm: ReAlarm is a powerful feature included with AutoAlarm that automatically re-triggers active alarms, preventing missed alerts due to alert-fatigue or human error.

Supported AWS Services.

At the time of writing, AutoAlarm supports automated tag-based alarm management for the following services:

  • EC2
  • Application Load Balancer (ALB)
  • CloudFront
  • AWS OpenSearch
  • Relational Database Service (RDS):
    • RDS Instances
    • RDS Clusters
  • Simple Queue Service (SQS)
  • Target Groups
  • Step Functions
  • Route53Resolver
  • Transit Gateway (TGW)
  • AWS VPN

We are aggressively developing and adding features to AutoAlarm, so if you don’t see a service that fits your use case and needs, please feel free to contact us and let us know.

Installation and Deployment

Now that we’ve done a technical deep dive into how AutoAlarm is built, let’s accelerate and I’ll show you how to build and deploy AutoAlarm in a matter of minutes.

The following instructions are for manual install and deployment via CLI but we recommend using pipelines for larger multi-account deployments.

First, you’ll want to make sure that you have the following prerequisites set up before beginning deployment:

  • AWS CLI (configured)
  • AWS CDK installed
  • Node (22.x+)
  • Git
  • pnpm (version 9.1.4+)

Next, let’s grab the latest release and deploy:

  1. Clone the Repository:
git clone https://github.com/truemark/autoalarm.git
cd autoalarm
  1. Install deps
pnpm install
  1. Configure AWS region and credentials
export AWS_REGION="your-region"
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_SESSION_TOKEN="session-token-if-applicable"
  1. Bootstrap the Account
cdk bootstrap
  1. Build the project
pnpm -r build
  1. Deploy the stack
cd cdk ; cdk deploy

That's it! In less than 10 minutes, you'll have automated observability management up and running in your AWS environment.

Usage and Implementation

Now that we’ve deployed AutoAlarm to AWS, let’s go over how to effectively use all of AutoAlarm’s features. In this section, we will go over:

  1. Quick Start: Enabling AutoAlarm and Default Alarms
  2. How to use AutoAlarm’s Unified Tagging Schema to dynamically and granularly manage Cloudwatch Alarms when defaults are not enough
  3. Short-hand techniques to simplify alarm management even further using undefined, null and implicit values
  4. Where to access AutoAlarms and how to identify them
  5. How to configure ReAlarm with a single tag to re-trigger alerting so that you never miss critical alarms again

Quick Start: Enabling AutoAlarm with Default Monitoring

There may be instances when you want instant no-touch monitoring. While observability is not a one-size-fits-all solution, we have provided default settings that will create multiple alarms per resource tagged with sane baseline values whenever AutoAlarm is enabled.

  • Pro Tip: For a complete reference of all default alarm configurations, review the “AutoAlarm Tag Configuration for Supported Resources” section in the project README

To enable monitoring, set the following tag on a supported service:

autoalarm:enabled=true

To disable all alarms (both default and configured) you can simply remove that tag or change the value to `false`.

This is standard across all supported services.

Advanced Usage and Configuration: Using AutoAlarm’s Unified Tagging Schema

Each service uses a standardized, straightforward pattern for alarm management. The pattern is as follows:

autoalarm:<metric>=<warning threshold>/<critical threshold>/<period>/<evaluation periods>/<statistic>/<datapoints to alarm>/<comparison operator>/<missing data treatment>

Let’s break each part down of the tag values.

Tag Value Component Explanation Example Value
Warning Threshold Numeric threshold at which a warning alarm triggers (typically lower-severity alerts). 80% CPU Utilization
Critical Threshold Numeric threshold at which a critical alarm triggers (higher-severity alerts, indicating immediate action). 95% CPU Utilization
Period The duration (in seconds) of each evaluation interval. Valid values: 10, 30 seconds, or multiples of 60 seconds (e.g., 60, 120, 300). 60
Evaluation Periods Number of consecutive periods over which a metric must breach thresholds before triggering the alarm. 5
Statistic Specifies how CloudWatch calculates the data for the alarm evaluation period. E.g., Average, Maximum, Minimum, Sum, Percentile (pXX). Maximum
Datapoints To Alarm How many of the evaluated data points (periods) must breach threshold to trigger an alarm. (Usually equal or less than Evaluation Periods). 5
Comparison Operator Logical operator that determines how CloudWatch interprets the threshold. GreaterThanThreshold
Missing Data Treatment Defines alarm behavior if data points are missing. Possible values: missing, ignore, breaching, or notBreaching. ignore

In practice, setting a cpu utilization alarm on a RDS cluster might look like:

autoalarm:cpu=80/95/60/5/Maximum/5/GreaterThanThreshold/ignore

Let’s break down the tag values once again for our practical example.

Tag Value Component Example Meaning
Warning Threshold 80 Warning triggered at 80% CPU usage.
Critical Threshold 95 Critical alarm triggered at or above 95% CPU usage.
Period 60 seconds CloudWatch checks CPU utilization every 60 seconds.
Evaluation Periods 5 CloudWatch evaluates the CPU utilization over the last 5 periods of 60 seconds each (total of 5 minutes).
Statistic Maximum CloudWatch uses the maximum reported CPU usage in each period.
Datapoints To Alarm 5 All 5 consecutive data points must breach the threshold to trigger this alarm (consistent elevated CPU usage required).
Comparison Operator GreaterThanThreshold CloudWatch triggers the alarm if the CPU usage is greater than the threshold.
Missing Data Treatment ignore If CPU utilization data is missing for one or more periods, CloudWatch simply ignores that missing data period, rather than triggering or preventing an alarm.

Short-hand Tagging Using Undefined, Null and Implicit Tag Values

You can see how simple tagging resources can be using the unified tagging schema but typing out a long string of values can be cumbersome. Let me show you how you can short-hand tag values to save keystrokes and time.

Practical Example #1

Consider an EC2 instance with default monitoring enabled via (autoalarm:enabled=true) This automatically creates Warning and Critical alarms for CPU, memory, and storage utilization. Now imagine you want to make the following custom adjustments while keeping other default values:

  • Warning threshold
    • cpu: 77
    • memory: 87
    • storage: 75
  • Comparison Operator
    • cpu: LessThanOrEqualToThreshold
  • Data Points to Alarm
    • storage: 3
  • Missing data treatment
    • cpu: missing
    • storage: notbreaching

In this scenario, we can use empty/undefined values to keep the defaults and define the values we want to change. Here is an example:

autoalarm:cpu=77//120/3///LessThanOrEqualToThreshold/missing
autoalarm:memory=87//120/4
autoalarm:storage=75//90/2//3//notbreaching

Now all three alarms are uniquely configured according to your use case.

Understanding Implicit Values

AutoAlarm uses implicit defaults when a value is undefined or invalid (each value is separated by a /). Order is important in your tag definition - provide either a defined value or undefined placeholder up to the last value you want to change. If you skip positional values, AutoAlarm will use default values for any improperly positioned values so make sure you follow the order of positional values.

Using the Null Character to Remove Alarms

The null character “-” tells AutoAlarm to either ignore an alarm or to remove one that already exists. Let’s walk through another short practical example of how to use the null character:

Practical Example #2

For CloudFront distributions, you might want anomaly detection for 5xx errors, but only need Critical alerts with a custom critical threshold but with all other default alarm configs:

autoalarm:5xx-errors-anomaly=-/5
  • Note: For For anomaly detection alarms, Warning and Critical thresholds represent standard deviations outside the normal band range. Specifics on this can be found the “Customizing Alarms with Tags” section in the project README under “Anomaly Detection Alarms”

Pretty simple, right? By now, you can see how concise and easy setting tags can be.

To wrap this tagging schema tutorial, let’s pull everything together and demonstrate how short hand tagging with AutoAlarm’s unified tagging schema can make managing alarms a breeze in a final practical example:

Practical Example #4

Say you have an application you’re testing on an EC2 instance but you only want critical CPU alarms with a custom threshold, duration, evaluation periods and data treatment. You can easily implement this by setting the following tags:

autoalarm:enabled=true
autoalarm:cpu=-/99/180/1/Average///ignore
autoalarm:memory=-/-
autoalarm:storage=-/-

All it took was enabling AutoAlarm and setting 1 tag for each of these alarms to get exactly what we need without clicking through the AWS console, figuring out SDK syntax, or tracking down an indent, curly brace or semi-colon in your IaC…

If you need to set up quick and dirty monitoring for this use case on a new EC2 instance, you can also use AWS CLI to set up in seconds and tear down in seconds like so:

aws ec2 create-tags --resources i-1234567890abcdef0 --tags \
Key=autoalarm:enabled,Value=true \
Key=autoalarm:cpu,Value="-/99/180/1/Average///ignore" \
Key=autoalarm:memory,Value="-/-" \
Key=autoalarm:storage,Value="-/-"

// And to tear down all the alarms, we only need to set autoalarm:enabled to //'false

aws ec2 create-tags --resources i-1234567890abcdef0 --tags Key=autoalarm:enabled,Value=false

How to Access Monitoring

AutoAlarm integrates with AWS’s CloudWatch Service. Any alarm created or managed by AutoAlarm will populate in the CloudWatch console along with any other monitoring you have configured.

AutoAlarms can easily be identified by naming convention. Each AutoAlarm is named using the following pattern:

AutoAlarm-”Service Type (e.g. EC2)”-”Resource ID (e.g. arn or resource name)”- “MetricName”-”Warning/Critical”

Here are a few examples for reference:

AutoAlarm-EC2-i-00e123456789bcdef-CPUUtilization-Warning AutoAlarm-EC2-i-00e123456789bcdef-CPUUtilization-Critical AutoAlarm-RDS-veryfast-performant-dbname-DBLoad-anomaly-Warning AutoAlarm-RDSCluster-larger-rdscluster1-SwapUsage-anomaly-Critical Autoalarm-OS-opensearch-clustername-JVMMemoryPressure-Critical

This naming convention eliminates confusion and provides immediate context about which service and monitoring dimension requires attention when an alarm triggers.

ReAlarm - Automated Alerting Failsafe

Often, critical alerts are missed due to various reasons such as alert fatigue, maintenance confusion, tired on-call engineers, or a NOC technician otherwise occupied watching funny cat videos on youtube. Regardless of the cause, missed alerts can result in catastrophic outcomes and unnecessary downtime.

How ReAlarm Works

By default, ReAlarm runs on a 2-hour interval without requiring any configuration. It identifies CloudWatch alarms in an "ALARM" state, resets them to "OK," which allows them to retrigger if the condition persists. This creates a subsequent alert notification for follow-up.

AutoScaling and ReAlarm

ReAlarm intelligently excludes alarms associated with AutoScaling actions from this reset process. This prevents interference with critical scaling activities for mission-critical and redundant services.

Configuring ReAlarm

ReAlarm offers two simple configuration options via resource tagging

// Disable ReAlarm for Specific Resource Alarms:
autoalarm:re-alarm-enabled=false

// Customize the Reset Interval for Specific Resource Alarms
autoalarm:re-alarm-minutes=30

In cases where you know you are addressing an issue and do not need the extra noise or in instances where ReAlarm is too aggressive or not aggressive enough, you can use these tags to configure ReAlarm according to your needs.

Now, you are equipped with everything you need to stand up AutoAlarm in AWS and roll out either ad hoc monitoring or mass observability included with your next IaC deployment.

Final Parting Thoughts

If you made it this far, I offer my sincerest thank you. AutoAlarm represents countless hours building and solving difficult at scale problems to make sure this tool is everything our customers need it to be.

As mentioned, AutoAlarm is publicly available for free. Please download the project, give it a try and let us know what you think.

If AutoAlarm aligns perfectly with your cloud observability needs, contact TrueMark Technologies to discuss deployment. If AutoAlarm meets most—but not all—of your requirements, we welcome your feedback. Our team prioritizes customer-driven enhancements and would be glad to discuss tailored features or support options to ensure AutoAlarm fully supports your business goals.

Usage Guide and Reference

Here’s a quick usage and reference guide for convenience when you’re getting started.

  • Quickly enable and create default monitoring on supported services by setting the AutoAlarm Enabled tag:
    • autoalarm:enabled=true
  • Using AutoAlarm’s Unified Tagging Schema, you can quickly, efficiently and granularly manage CloudWatch alarms for services by following the pattern and example outlined below:
    • autoalarm:<metric>=<warning threshold>/<critical threshold>/<period>/<evaluation periods>/<statistic>/<datapoints to alarm>/<comparison operator>/<missing data treatment>
    • autoalarm:cpu=80/95/60/5/Maximum/5/GreaterThanThreshold/ignore
  • Use undefined, implicit and null values in short-hand tagging to remove categories of alarms (Warning and Critical) dynamically and set custom alert config values in places you want and maintain default values when they don’t need to be changed.
    • autoalarm:cpu=-/99/1/Average///ignore
    • autoalarm:memory=-/-
    • autoalarm:storage=-/-
  • AutoAlarms can be found in cloudwatch and identified by the naming convention “AutoAlarm-Service type-Service Identifier-Metric-Criticality
    • AutoAlarm-EC2-i-00e123456789bcdef-CPUUtilization-Critical
  • ReAlarm resets alarms for subsequent retriggers for alarms not associated with AutoScaling actions every two hours by default. ReAlarm can be disabled for alarms associated with a resource by setting a disable tag. ReAlarm can also be configured to run on a custom schedule by setting a schedule tag on a desired service.
    • autoalarm:re-alarm-enabled=false
    • autoalarm:re-alarm-minutes=30

Notes

  • AutoAlarm’s CloudWatch integration is account agnostic but the extended prometheus integration is for TrueMark Customers using a custom AWS Managed Prometheus configuration.
  • As you implement AutoAlarm be sure to reference the README in our project repository. It contains neatly categorized tables that clearly list every supported service, alarm and available configuration.
  • Feedback is welcome. If you see a feature opportunity, run into issues or have questions, please contact us and let us know.