This post is not about setting up CloudWatch alarms. You can find it on Google easily. Instead, I am going to talk about how to set up alarms that can make you sleep well at night.
It is my journey of setting up alarms for my company’s website. It may not be a brilliant solution, but I think it is simple enough to give you some insight without considering complex architectures.
My company had a web server that always goes down when there are traffic spikes. The website is not so important that it deserves a dedicated support team nor spending money to revamp it. But it just needs to stay here for people to browse.
Instead of waiting for customers’ report, we need to make sure that when things go wrong, we find them out before our customers do. That’s where the journey starts.
To get the state of the webserver, I use CloudWatch log agent to ship Apache access log to CloudWatch, then use metric filter to count the HTTP 5XX error. I am not going to talk about this part. Instead, I want to focus on the alarm strategy. If you are interested, here are the tutorials from AWS:
Installing the CloudWatch Agent Using the Command Line
Installing the CloudWatch Agent Using the Command Line — Amazon CloudWatch Use the following topics to download…
Stage 1: Setting up email notification
For starters, the easiest way to set up an alarm is sending an email notification when things go wrong. AWS also know that, so when you create an alarm, the configuration page has a box for you to input your email address. And it creates the SNS topic and subscription automatically.
Here is my first setup. Whenever the server generates 10+ errors in 15 minutes, the alarm is triggered, and it sends an email to me. Then, it’s my turn to figure out what’s happening and fix it.
As you may think, the problem quickly emerged. Every 1 or 2 weeks, I received the alarm email. In the beginning, I would go to check if something’s wrong with the server after receiving the alarm. But after several weeks, I got used to it and sometimes just ignored it, thinking that it may be a false alarm.
Sometimes, the alarm was triggered when I was hanging out with my friends, or when I was sleeping. I could not immediately fix or even check with the problem.
Stage 2: Automate remediation action
After several wake-up calls, I found out that most of the time, the error can be solved easily by restarting the MySQL daemon (The web app was poorly designed that it leaves many zombie connections and drains out the memory).
If this simple action can fix the error in most of the time, why don’t we make it automatic?
I created another SNS subscription to invoke a Lambda function instead of sending email to me. Then, the Lambda function use SSM to run a shell command in the webserver to restart the MySQL server.
Again, I am not going to talk about how to run shell commands from SSM. If you are interested, here is the tutorial:
Stage 3: Human intervention if automation fails
After setting the automation, things got more annoying. Before automation, I knew the server needs me whenever there was an alarm. But now, the alarms just wake me up and say “Hey! You know what, the server was down. But I ‘ve just handled it. I just want to wake you up and tell you that!” After several nonsense calls, I completely ignored those email, regardless of what’s happening.
I could turn off the email notification, but it makes me nervous too. If there is no email at all, I just keep thinking every night if the server is good. I am afraid that automation may not fix the incidents.
How I solve this dilemma is to set up a 2-tier alarm. If the server goes wrong, the Lambda function tries to restart the server first. If the problem persists, the second alarm calls me in and rescue the world (No, just the server).
To make it clear, I call the first alarm, which triggers Lambda, as Remediate alarm. And the second alarm, which sends an email to me, as Notification alarm.
The Notification alarm can be implemented by Datapoints to alarm setting. We can find this setting under Additional Configuration section.
This setting specifies how many evaluation periods need to be breached before triggering the alarm.
For example, if I set the alarm to trigger after 2 out of 2 data points are breached. When there are more than 10 errors happened in the first 15 minutes, but no error (or less than 10 errors) in the second 15 minute period. The Notification alarm is not triggered because only 1 out of 2 data points is breached.
In most cases, when the server goes wrong, the Remediate alarm triggers Lambda to fix it. If the problem is fixed, there is no error afterwards, and the Notification alarm will not be triggered.
If the Lambda function cannot fix the problem, the server keeps generating errors. Thus, there is a threshold breach in the next 15 minute period.
In this case, because 2 out of 2 data points are breached, the Notification alarm will be triggered, and I will receive an email.
After setting these 2 alarms, the Lambda function takes care of the server in most time. If a severe problem happens, I will be notified to fix it manually.
Phase 4: Alarm when there are too many alarms
“Too good to be true”
If I haven’t received any Notification alarm for a long time, it’s “too good to be true”. But which part of my system is too good? The remediation action or the server itself?
If the Lambda function can fix the server’s problem every time, I would be happy. But I would rather hope the server itself is good and don’t need the Lambda function to fix problem frequently.
If the Remediate alarm is triggered frequently, that should be something wrong with the server that I need to investigate. To track the frequency of Remediate alarm triggered, we can set the third alarm based on the NumberOfMessagesPublished metric of the Remediate alarm’s SNS topic.
In the above graph, we can see that the Remediate alarm of my server is triggered between 0 and 2 times a day. I can set up an alarm based on this metric: If the number exceeds 5, it alerts me to investigate if there are any significant changes to the server that makes it fail more frequently.
At first, I just want to set up a simple alarm, and I didn’t expect that throughout my journey, I would create a 3-tier alarm system:
- Auto remediation
- Human intervention
- Incident frequency tracking
“Everything Fails All the Time”
— Werner Vogels
Some of my implementations may not be good, but the idea is that failure is inevitable, and this try-and-error journey will never end.
If things go wrong, how can we handle it smoothly? How can we minimize the frequency of failure? These questions are not easy to answer, but without the insight of how our system performs, we can never get the answer.
It’s never too late to start the journey on improving the reliability of your systems. What you need to do is starting the first step.