[Datadog] Automation using Datadog #2 - Disk Management

Print

Introducing automation using Datadog's Workflow Automation.

You can use collected data or triggered events during monitoring with Datadog in workflows.


This example introduces a sample scenario where, during disk monitoring, files need to be deleted when disk usage is high.

The scenario involves selecting files for deletion based on frequently deleted directories, requesting approval for file deletion, and sharing the deletion list with the executor.


Requirements before creating the workflow:


1. Prepare to monitor the folder where files are usually deleted. You can check the list of files and their sizes in the folder.

    - Reference Datadog docs : Directory Integration


2. Create a monitor to observe disk usage. In the ORG created by SK C&C, a default monitor is available: [Warning] Disk usage of {{host.name}} server's {{device_name.name}} is high. You may use this monitor or clone and modify it as needed.

If not available, create a monitor using system.disk.used and system.disk.total to detect usage.


3. Slack integration and the channel for message delivery must be completed in advance.



Creating and setting up the workflow:


1. In Datadog Console, go to Actions > Workflow Automation, and click New Workflow to create a new workflow.

Click on the workflow name to edit it. 

 

2. Select the trigger for the workflow.

    (Trigger detail Docs : Trigger a workflow )

- For this workflow, we will use a monitor as the trigger.



3. Define the handle name to be called from the monitor.



This handle name will be set under Configure notifications & automations in the monitor settings to call the workflow when an event occurs.


Alternatively, click @ Add mention or Add workflow in the Trigger workflow section to add it.


4. After planning your workflow, place and configure the appropriate actions.

1)  The workflow to be configured follows this structure:
     It's helpful to write it out in text or draw a diagram beforehand. 

  • When disk usage exceeds 95%, the alarm triggers the workflow.
  • The triggered monitor extracts tag information (host and device).
  • A query is written to fetch data from the directory metric.
    (Extract about 5 files from the folder with the largest usage.)
  • Based on the directory metric data, add logic to identify actual deletion target files.
  • Request approval for the deletion list via Slack.
  • If approved, send the deletion list to the operator via Slack.

 2) Once the configuration is planned, locate and arrange the matching actions.

  • Before building, here is how to use data generated after Trigger and Action execution.
    Detailed explanations with screenshots are provided during Action setup. 
    • When a workflow is triggered by a monitor, data related to the monitor and alert is included.
      To use this data in JSON format, use {{ }} to reference data generated in previous blocks.
    • After selecting the trigger, click Source on the right panel to check the approximate JSON structure passed by the monitor.

      You can copy the variable path by hovering over a tag or value and clicking the icon that appears.

      Use {{copied path}} where needed. (Typing {{ opens path selection, too.)

    • If you need to transform or modify data, use Javascript Action for correction, or apply quick adjustments with ${logic}.

  • When configured according to 4-1), the workflow looks like the following.

    Each Action will be explained in detail.

    • From the monitor alert passed by the trigger, extract the tag information of the target.

      To extract metric information in the next action, use the extracted tag info to create a query.

      Use a JS Function Action to write the query based on the retrieved data.

      If you want to directly write the query in the next action using monitor values,

      this step may be skipped. More details in the next Action.
      * If you're not familiar with Javascript, you can use the "Write code with AI" feature above the script editor to get AI assistance.

    • Using the query from the JS Function, fetch directory monitoring info with the Datadog Get timeseries point Action.

      Directory monitoring is configured through the conf.d/directory.d/conf.yaml file in the Datadog agent.

      Directory monitoring allows checking file count, size, and age in the designated folder.

      With this data, you can set criteria (size, age) for identifying files to delete.


      To retrieve the query from the previous Get Query step, use {{ Steps.Get_Query.data.query }}.
      To use Get timeseries point directly from the Trigger without JS Function, input this in the Query field:

      top(avg:system.disk.directory.file.bytes{host: {{ Source.monitor.event.host }}, dirfilename:{{ Source.monitor.event.tag_value.device_name }}*} by {dirfilename,host}.rollup(max, 3600), 5, 'max', 'desc')

      (This query retrieves the 5 largest files by size.)
      You can write queries in the Metric Explorer or copy from a dashboard.
      Click the </> icon to copy the query as a string, and click again to toggle to block mode.

      When entering a condition in the from clause, ensure there is no space between tag and value.

      • To fetch the oldest files, use this query:

        top(avg:system.disk.directory.file.created_sec_ago{host: {{ Source.monitor.event.host }}, dirfilename:{{ Source.monitor.event.tag_value.device_name }}*} by {dirfilename,host}.rollup(max, 3600), 5, 'max', 'desc')

      • This Action behaves like Datadog's Query timeseries points API.
    • After retrieving data, use If Condition Action to handle cases when data exists or does not.
      • In the Get timeseries point step, check the series field for data.

        Since series is an array, check its length to verify data presence.

    • If data exists, use JS Function to create a file deletion list from the series data for Slack.

      This code is a sample, and the Slack message content can be built based on logic.


         

    • Once the file list is ready, use Slack's Make a decision Action to request approval.
      • Slack workspace must be pre-integrated via Datadog console > Integration > Slack.

        The channel must also be added after workspace integration to be selectable.

      • Use {{ Steps.Get_delete_file_list.data }} to output the file list in the message.
      • You can style Slack messages using the formatting tools provided.
      • To change icons for buttons or in messages, use Slack's :icon_name: format.
      • For example, :x: changes to a reject icon in the message.
    • Use Send message Action to inform operators or approvers of outcomes (no data, approved, rejected).


By adding the workflow to the monitor, the workflow will execute automatically when a monitor alert occurs.

This sample workflow demonstrates commonly used actions.


Other actions include controlling (create, delete, start, stop, restart) or retrieving data from AWS, Azure resources, and more.

Most configurations can be completed via selection and input.

Adding logic further expands the usability.


Especially, logic creation in Datadog is supported by AI tools, enabling easier configuration.


If you face difficulties in writing workflows, please contact us via the support portal. We'll be happy to assist you.

此回答是否有所帮助?

Send feedback
抱歉没能帮到您。欢迎您给出反馈以帮助我们改善本文档。