[ Datadog ] Utilizing Datadog for AIOps #1 - Key Features of Watchdog

Z CARE 修改于：星期三, 六月 11, 2025 在 11:44 AM

Contens

Watchdog is Datadog’s AI engine that provides algorithmic capabilities for APM, Infrastructure, and Logs. It continuously monitors trends and patterns in metrics and logs to identify anomalous behavior and automatically detect potential issues.

Check Watchdog Alerts

First, you can check the list of alerts automatically detected by Datadog from the Datadog Console > Watchdog menu.

Log : Detected when logs indicating warnings or errors appear, or when such logs suddenly increase
APM & USM : Detects anomalies in Error Rate, Latency, and Hits (Request Rate)
Infra : Detects anomalies in infrastructure metrics

* Reference Docs : Watchdog Alerts coverage

By default, Watchdog detects anomalies based on existing data. For logs, a minimum of 24 hours of data is required,
and for metrics, two weeks of data are necessary. Anomaly detection begins after this minimum data is accumulated.

More data and longer time periods improve the accuracy of anomaly detection.

If you want to receive alerts for automatically detected events, click New monitor to create a monitor.

For how to configure a Watchdog Monitor, refer to [ Datadog ] Utilizing Datadog for AIOps #2 - Watchdog Alarm Configuration.

Watchdog Impact Analysis

When both APM and RUM are used, Watchdog provides Watchdog Impact Analysis in the Watchdog menu when an APM-related event occurs.

It displays affected services, views, and users (see item #6 in the image below).

Watchdog alert cards provide the following information:

Status : ongoing, resolved, expired. (expired means lasting over 48 hours)
Timeline : Describes the duration of the event
Message : Describes the phenomenon
Graph : Visual representation of the event
Tags : Shows the scope of the event
Impact : Summary of the effect caused by the event ( Watchdog Impact Analysis )

Watchdog Insights

You can also view Watchdog Insights from the explorer screens of Infra, APM, and Log.

Clicking on a card shown in Watchdog Insights reveals the following:

Time series of error logs containing the field
Tags frequently associated with the error logs
Comprehensive list of log patterns

Watchdog RCA(Root Cause Analysis)

Watchdog Root Cause Analysis (RCA) helps reduce Mean Time to Recovery (MTTR) by automating preliminary investigations during incident classification.
The Watchdog AI engine identifies interdependencies between components related to application performance anomalies, deriving causal relationships between symptoms.
Whenever Watchdog detects an APM anomaly, it initiates a root cause analysis to provide deeper insight into both the cause and the result of the anomaly.

This feature requires the use of APM and mandatory configuration of env, service, and version tags (Unified Service Tagging).

Watchdog RCA considers the following data sources during analysis:

APM metrics such as error rate, latency, and request rate
APM deployment traces
APM traces
Agent-based infrastructure metrics such as CPU usage, memory usage, and disk usage
AWS instance status check metrics
Log pattern anomalies

Watchdog Automatic Faulty Deployment Detection

Automatic faulty deployment detection reduces Mean Time to Detect (MTTD) by identifying faulty code deployments within minutes.
Each time new code is deployed, Watchdog compares its performance with the previous version and detects newly introduced error types or increases in error rates.
If Watchdog determines that the new deployment is faulty, detailed information will be displayed on the APM service page and on the resource page for the affected endpoint.

If Watchdog finds an issue in the currently active version, a pink banner will appear at the top of the service details page, as shown in the screenshot.
The deployment table at the bottom of the page shows the deployment history for the service, including any past versions flagged as faulty.

Clicking “View details” in the banner opens a slide-out panel containing more information about the faulty deployment, including:

Graph showing the increase in error rate
Error types newly detected in the deployment
Affected endpoints
HTTP status codes

Watchdog Automatic Faulty Cloud & SaaS API Detection

Automatic detection of faulty cloud and SaaS APIs helps reduce Mean Time to Detect (MTTD) by identifying issues in third-party vendors (e.g., payment gateways, cloud providers) within minutes.
Watchdog uses APM telemetry to continuously monitor error rates in requests to external vendors such as AWS, OpenAI, Slack, and Stripe,
and detects service degradation as soon as it begins.

This proactive detection helps you identify and mitigate issues before they escalate, significantly reducing time spent on root cause analysis and improving response time.

When Watchdog identifies a problem with an external provider in use, it shows which services are affected and the scope of the outage.
This helps distinguish between internal and external issues.
Datadog also provides direct links to the provider’s status page and support channels for quick follow-up if needed.