Datadog 101 - The Right Alerts

# Creating the Right Alerts
---
# Two Rules for Effective Alerting
1.  Alert liberally; page judiciously
1.  Page on symptoms, rather than causes
--
# What does that mean?
* Create lots of alerts for everything.
* Alerts become the living history of your infrastructure
* But only notify people about the **Work Metrics** going awry
---
# Paged Alerts Should Always Be Easily Actionable
## Alerts should be:
* Grokkable at 3AM, drunk, with one eye closed
* Filled with all the info you need
  *     Including who to wake up if you have trouble
* Consumable by the non-experts
---
# Levels of Alerting Urgency
## Alerts as Records (low severity)
Use to document the system. Helpful when trying to troubleshoot later.
## Alerts as Notifications (moderate severity)
These are things that require intervention, but not right away
## Alerts as Pages (high severity)
Wake the right people up and address it immediately!
---
# How to Determine the Right Level of Urgency?
## Is the issue real?
Don't notify on things that shouldn't be important, like:
* Test environments
* System down during planned upgrade
--
# How to Determine the Right Level of Urgency?
## Does the issue require attention?
* If you can automate a response, do it
* The costs of calling someone away from work/sleep/personal time is significant. Avoid if you can.
* If it's **real** and **require's attention**, notify and let the engineer prioritize.
--
# How to Determine the Right Level of Urgency?
## Is the issue urgent?
* IIF the issue is **real** AND **require's attention** AND is **urgent**, generate a page.
---
# Page on Symptoms
## **Work metrics** not Resource metrics
![](../../images/monitoring101/workresourceevents-4.png)
---
<br><br><br>
# ...except when its an early warning...
---
# Early Warning Signs
Also page on the early warning signs that come before really bad things:
* If you are about to run out of disk
* If you are about to hit a quota limit
* etc.
---
# How To Build an Alert
## https://app.datadoghq.com/monitors#/create
![](https://cl.ly/360B3g2C2s3s/Image%202017-09-25%20at%2011.01.33%20AM.public.png)
--
# Eight Types

* **Host** - notify on the status of the agent heartbeat
* **Metric** - metrics collected by agent or API can trigger alerts
* **Integration** - same as metrics above, applied to specific integrations
* **Process** - check if a process is running or not
* **Network Service** - check if a network endpoint is active or not
* **Custom Check** - run a custom script and alert on the results
* **Event** - trigger alerts if the quantity of events goes over a threshold
* **Outlier** - detect when a member of a group is different than the rest
--
# Define the Metric
![](https://cl.ly/1x443n10051i/Image%202017-09-25%20at%2011.04.31%20AM.public.png)
--
# Set the conditions
![](https://cl.ly/1X0D1P1j3a1d/Image%202017-09-25%20at%2011.08.41%20AM.public.png)
--
# Preview What the Monitor Sees
![](https://cl.ly/1T3R1Q1x1g1h/Image%202017-09-25%20at%2011.09.20%20AM.public.png)
--
# Enter a Message
![](../../images/therightalerts/say.png)
--
# Make it Dynamic
![](../../images/therightalerts/makeitdynamic.png)
--
# Choose Who To Notify
![](../../images/therightalerts/notify.png)
---
# View the Triggered Monitors
https://app.datadoghq.com/monitors/triggered
![](https://cl.ly/1u0y1F3W201P/Image%202017-09-25%20at%2011.12.34%20AM.public.png)
--
# Click on One
![](https://cl.ly/3r040L3G3Z3V/Image%202017-09-25%20at%2011.13.55%20AM.public.png)
--
# Schedule Downtime
![](https://cl.ly/36081A1U100H/Image%202017-09-25%20at%2011.15.01%20AM.public.png)

Back to the Agenda

Notes | Hands On Instruction

Session Agenda

Presentations

Notes

Hands On