[Datadog] Subnet Idle IP Monitoring Guide

Print

Subnet Inspection 

This guide helps prevent failures caused by CSP platform updates in subnets with insufficient available IPs by identifying usable subnets and maintaining a minimum threshold of available IP addresses. 

Utilizing Subnet IP Availability Metrics 

  • If Datadog is integrated with CSPs like AWS or Azure, you can use metrics related to available IP addresses in subnets. 

  • Set alerts to notify relevant stakeholders when the number of available IP addresses drops below 5. 

Subnet Metric Collection by CSP 

AWS

Metric

Description

Notes

aws.vpc.subnet.available_ip_address_count

Number of available IP addresses in the subnet

Main metric

aws.vpc.subnet.total_ip_address_count

Total number of IP addresses in the subnet

Prerequisite: Requires enabling metric collection 

  • To collect aws.vpc.subnet.* metrics, a ticket must be submitted to Datadog to enable data collection for the account. 

  • These metrics are collected via EC2 Describe* API calls, so EC2 resource access must be enabled. 

Metrics Reference Guide : Amazon VPC

Azure

Metric

Description

Notes

azure.network_virtualnetworks.subnets.available_addresses

Number of available addresses in the subnet 

Main metric

azure.network_virtualnetworks.subnets.signed_addresses

Number of assigned addresses in the subnet 

Metrics Reference Guide  : Microsoft Azure Virtual Network

Alert Configuration 

Create alerts based on threshold metrics. 

Metric Monitor Configuration 

image-20240228-020712.png

  1. Choose the Detection Method
    : Select Threshold Alert for subnet monitoring. 

    image-20240228-022300.png
    • Threshold Alert : Compares metric values to static thresholds.

    • Change Alert: Compares changes over time. 

    • Anomaly Detection: Detects unusual behavior based on historical data. 

    • Qutliers Alert : Detects outliers among grouped resources. 

    • Forecast Alert : Predicts future behavior and compares with thresholds. 

    • Watchdog : Datadog AI automatically detects issues. 

  2. Define the metric
    : Select a metric and configure query/formula, grouping, and evaluation period.

    AWS Metric - aws.vpc.subnet.available_ip_address_count

    image-20240228-041603.png

    Azure Metric - azure.network_virtualnetworks.abaiable_subnet_addresses

    image-20240228-041626.pngimage-20240228-041704.png
    • Specify the metric.

    • For group by, select subnet for AWS or subnet_name for Azure.
      You may also add tags (such as accountsubscription, etc.) to include additional information you'd like to check when alerts are triggered. 

    • Select the aggregation function and evaluation window.
      - If you choose average / last 5 minutes,
      the system calculates the average of the last 5 minutes of data every minute and compares it to the threshold.
      ※ The evaluation frequency depends on the selected evaluation window:

      • If the window is less than 24 hours → evaluated every 1 minute

      • Less than 48 hours → every 10 minutes

      • 48 hours or more → every 30 minutes

  3. Set Alert Conditions 

    image-20240228-043420.png
    • Set the comparison operator for threshold evaluation. For subnet monitoring, set it to "below or equal to". 

      • Supported operators: belowabovebelow or equal toabove or equal to

    • Set the alert threshold to 5. 

    • For Nodata alerts, choose between "Do not notify" and "Notify". 

    • If set to "Notify", a Nodata alert will be triggered when no data is received.
      You can configure a separate recovery threshold for alerts triggered by the alert threshold.
      If not set, the alert will be cleared once the value exceeds the alert threshold.

    • You can configure automatic alert recovery.
      If the alert condition persists, it will be cleared automatically after the specified time.
      However, if the condition still persists after being cleared, the alert will be triggered again. 

    • Apply a waiting period before applying the monitor to newly added targets. 

    • Set an evaluation delay time to account for collection intervals and network delays. 

      • AWS: 10 minute  Delay

      • Azure: 2 minute  Delay

    • Set whether to calculate only when data is fully collected during the evaluation window. 

      • do not require: calculation proceeds if there is at least one data point 

      • require: calculation proceeds only if the data is complete in the evaluation window 

  4. Notify your team: Notification Settings 

    image-20240228-044832.png
    • Alert Title : This is the subject of the message sent when an alert is triggered.
      - Example: [Warning] Subnet is running low on available IPs 

    • Alert Message
      - This is the content of the message sent when an alert is triggered.
      - 예시

      {{#is_alert}}  

      Triggered at(KST): {{local_time 'last_triggered_at' 'Asia/Seoul'}}

      ## Available IPs in {{subnet.name}} under {{aws.account.name}} are less than or equal to 5. Please check.


      {{/is_alert}}  


      {{#is_alert_recovery}}


      Triggered at (KST): {{local_time 'last_triggered_at' 'Asia/Seoul'}} 
        
      ## [Recovered] Available IPs in {{subnet.name}} under {{aws.account.name}} have increased to {{value}} (above 5).


      {{/is_alert_recovery}}
    • Use Message Template Variables
      You can check how to use templates and variables in the alert title and message body.
      Reference for available variables : https://docs.datadoghq.com/monitors/notify/variables/?tab=is_alert

    • Notify your services and your team members
      Integrated channels such as Opsgenie, Slack, Teams, webhook, and email will be displayed.
      Select the channels or email addresses to notify when the alert is triggered. 

    • Content displayed (message content settings)
      You can choose whether to include automatically attached content such as the query or snapshot in the alert message. 

    • Include Triggering tags in notification title
      This adds the tags of the affected resource to the alert title in the notification. 

    • Aggregation Settings
      If a group was selected in "Set alert conditions," this will be automatically set as a multi-alert.

    • Renotification Settings
      If the alert (or Nodata) condition continues, renotification will be sent at the interval you select. 

    • Tags Settings
      You can set tags for monitors that are used in the Monitor list, Downtime scheduling, etc. 

    • Priority Settings
      Set the severity (importance) of the alert from P1 to P5.
      Set the priority according to standardized criteria. 

  5. Define permission and audit notifications: Set monitor edit permissions and change notifications

    image-20240228-045356.pngimage-20240228-045431.png
    • Restrict editing

      • Set the permission to edit the alert. 

      • When you select a role, all users with that role will have permission to edit. 

    • Test Notifications

      • Click this button to send a test alert with the current settings to the selected channels. 

    • Create

      • Click this button to save the configured settings. 

AWS Integration

EC2 Filtering 

To avoid charges caused by collecting data from EC2 instances without a Datadog Agent installed after AWS Integration, tag-based filtering is available.

Additional Billing after AWS Integration 

  • EC2 instances collected through AWS Integration are subject to billing, but instances with the Datadog Agent installed will not be billed twice.

  • Billing may also apply for Fargate and Lambda resources.

Reference Link

Limit Metric Collection to Specific Resources

  • Supports filtering AWS metrics collected from specific services such as EC2 and Lambda based on tags. 

  • Target Services : EC2, Lambda, ELB, Application ELB, Network ELB, RDS, SQS, CloudWatch custom metrics 

  • Settings Path :  Integrations > Amazon Web Services > Select Account > Metric Collection Tab

    image-20240228-045755.png
  • How to Configure: Filtering can be done using either blacklist or whitelist methods

    • Blacklist : Excludes resources that contain specified tags.
      Example) !datadog:no

    • Whitelist : Collects only resources with specified tags. When multiple conditions are added, they are applied in an OR relationship.
      Example) datadog:monitored,env:production,instance-type:c1.*

    • You can also use a mix of blacklist and whitelist filtering. 

    • Uppercase letters are converted to lowercase, and spaces are replaced with underscores (_).
      Example: Tag Team:Frontend App should be applied as team:frontend_app.

Czy ta odpowiedź była pomocna? Tak Nie

Wyślij opinię
Przykro nam, że nie mogliśmy Ci pomóc. Pomóż nam dopracować ten artykuł, pozostawiając informacje zwrotne.