[ Datadog ] Utilizing Datadog for AIOps #1 - Key Features of Watchdog

Print

Contents


Watchdog is Datadog’s AI engine that provides algorithmic capabilities for APM, Infrastructure, and Logs. It continuously monitors trends and patterns in metrics and logs to identify anomalous behavior and automatically detect potential issues. 


Check Watchdog Alerts 


First, you can check the list of alerts automatically detected by Datadog from the Datadog Console > Watchdog menu. 

  • Log : Detected when logs indicating warnings or errors appear, or when such logs suddenly increase  
  • APM & USM : Detects anomalies in Error Rate, Latency, and Hits (Request Rate)  
  • Infra : Detects anomalies in infrastructure metrics  

* Reference Docs : Watchdog Alerts coverage 

By default, Watchdog detects anomalies based on existing data. For logs, a minimum of 24 hours of data is required,
and for metrics, two weeks of data are necessary. Anomaly detection begins after this minimum data is accumulated.

More data and longer time periods improve the accuracy of anomaly detection.



If you want to receive alerts for automatically detected events,  click New monitor to create a monitor. 

For how to configure a Watchdog Monitor, refer to [ Datadog ] Utilizing Datadog for AIOps #2 - Watchdog Alarm Configuration.


Watchdog Impact Analysis


When both APM and RUM are used, Watchdog provides  Watchdog Impact Analysis in the Watchdog menu when an APM-related event occurs. 

It displays affected services, views, and users (see item #6 in the image below).

Watchdog alert cards provide the following information:

  1. Status :  ongoingresolvedexpired. (expired means lasting over 48 hours)
  2. Timeline : Describes the duration of the event 
  3. Message : Describes the phenomenon 
  4. Graph : Visual representation of the event 
  5. Tags : Shows the scope of the event 
  6. Impact : Summary of the effect caused by the event ( Watchdog Impact Analysis )


Watchdog Insights


You can also view Watchdog Insights from the explorer screens of Infra, APM, and Log.


Clicking on a card shown in Watchdog Insights reveals the following: 

  • Time series of error logs containing the field 
  • Tags frequently associated with the error logs 
  • Comprehensive list of log patterns 


Watchdog RCA(Root Cause Analysis)


Watchdog Root Cause Analysis (RCA) helps reduce Mean Time to Recovery (MTTR) by automating preliminary investigations during incident classification.
The Watchdog AI engine identifies interdependencies between components related to application performance anomalies, deriving causal relationships between symptoms.
Whenever Watchdog detects an APM anomaly, it initiates a root cause analysis to provide deeper insight into both the cause and the result of the anomaly.

This feature requires the use of APM and mandatory configuration of env, service, and version tags (Unified Service Tagging).


Watchdog RCA considers the following data sources during analysis: 

  • APM metrics such as error rate, latency, and request rate

  • APM deployment traces

  • APM traces

  • Agent-based infrastructure metrics such as CPU usage, memory usage, and disk usage

  • AWS instance status check metrics

  • Log pattern anomalies




Watchdog Automatic Faulty Deployment Detection


Automatic faulty deployment detection reduces Mean Time to Detect (MTTD) by identifying faulty code deployments within minutes.
Each time new code is deployed, Watchdog compares its performance with the previous version and detects newly introduced error types or increases in error rates.
If Watchdog determines that the new deployment is faulty, detailed information will be displayed on the APM service page and on the resource page for the affected endpoint.

If Watchdog finds an issue in the currently active version, a pink banner will appear at the top of the service details page, as shown in the screenshot.
The deployment table at the bottom of the page shows the deployment history for the service, including any past versions flagged as faulty.

Clicking “View details” in the banner opens a slide-out panel containing more information about the faulty deployment, including:

  • Graph showing the increase in error rate

  • Error types newly detected in the deployment

  • Affected endpoints

  • HTTP status codes



Watchdog Automatic Faulty Cloud & SaaS API Detection


Automatic detection of faulty cloud and SaaS APIs helps reduce Mean Time to Detect (MTTD) by identifying issues in third-party vendors (e.g., payment gateways, cloud providers) within minutes.
Watchdog uses APM telemetry to continuously monitor error rates in requests to external vendors such as AWS, OpenAI, Slack, and Stripe,
and detects service degradation as soon as it begins.

This proactive detection helps you identify and mitigate issues before they escalate, significantly reducing time spent on root cause analysis and improving response time.

When Watchdog identifies a problem with an external provider in use, it shows which services are affected and the scope of the outage.
This helps distinguish between internal and external issues.
Datadog also provides direct links to the provider’s status page and support channels for quick follow-up if needed.

Czy ta odpowiedź była pomocna? Tak Nie

Wyślij opinię
Przykro nam, że nie mogliśmy Ci pomóc. Pomóż nam dopracować ten artykuł, pozostawiając informacje zwrotne.