Dash 2021

Welcome to the Shoreline Datadog integration workshop at Dash 2021!

We're excited to welcome you to this workshop! This handbook walks you through the hands-on workshop activities. There's a lot to cover, so feel free to reference this guide both during and after the workshop.

Intro to Shoreline

Production ops is the new engineering bottleneck
Despite tools, diagnosis & repair remain manual
The virtuous cycle of Incident Automation

Installation

Log In to Shoreline

  1. Please enter your assigned workshop attendee email or ID in the field below.

    We'll use this value throughout the content to provide accurate credentials, links, and information.

  2. Click here to navigate to your Shoreline cluster, or manually copy and paste the URL below.
  3. Log in with your assigned Shoreline email and password.

  4. When prompted to do so, please change your password after initially logging in.

  5. Click here to navigate back to your Shoreline cluster and log in with your new password.

Connect Datadog to Shoreline

Shoreline needs a valid Datadog API key and application key to establish a connection and create your workshop environment.

  1. Click here to open the Shoreline Integrations page.
  2. Click the Configure button in the Datadog section.

    datadog-integration-card
  3. Paste your assigned Datadog API key into the Datadog API key field.

  4. Paste your assigned Datadog Application key into the Datadog Application key field.

    configure-datadog-page
  5. Click the Apply button to save the configuration.

Log In to Datadog

  1. Click here to navigate to the Datadog login page.

  2. Log in with your assigned Datadog email and password.

Install the Shoreline widget

Remediation Dashboard

With the Shoreline Widget installed the Remediation Dashboard is the primary mechanism through which users can access Shoreline functionality right from inside the Datadog UI.

Monitors

The Remediation Dashboard displays the list of all of your Datadog monitors including their name, status, tags, and their Shoreline auto-remediation status.

  1. Click the icon at the top-right of the Shoreline widget.

    The widget editor screen shows the current list of Datadog monitors along with configurable options at the bottom.

  2. Scroll down to the Widget options section.

  3. The Shoreline URL field should contain your assigned Shoreline cluster URL. You should not need to edit this value.

  4. (Optional) Enter a query in the Monitors query field to filter the monitors displayed in this Shoreline widget.

    You can filter by the monitor's status or a specific tag using the <type>:<value> format. Consider the following examples.

    ValueDescription
    status:alertDisplays alerting monitors
    status:okDisplays OK monitors
    tag:shorelineDisplays monitors with a shoreline tag
    tag:"env:prod"Displays monitors with a env:prod tag
  5. (Optional) Enter a title in the Widget title field.

  6. Click the button to save your changes and return to your dashboard.

Debug Tab

Shoreline's WebCLI is integrated in the remediation panel, allowing you to execute Op commands to debug your entire fleet.

  1. To get started, click the button at the top of the remediation panel.

    The Debug panel initially displays a single WebCLI cell.

    webcli
  2. Click inside the Op cell, which emulates a normal Op CLI terminal window.

  3. Type a valid Op command and press the Enter key to execute the command.

    For example, using the most basic hosts command returns all active host Resources.

    op>
    hosts
  4. Use the pods Op command to list all pods.

    op>
    pods
  5. Enter the following Resource query to return only bookstore pods.

    op>
    pods | namespace="java-bookstore"

    The pods servicing the bookstore app have a tag of namespace="java-bookstore". Op uses the familiar bash syntax. The Op pipe operator passes the result of the previous command to the following command, just like the pipe operator in the Unix shell environment.

  6. Create a Metric shorthand called pod_cpu.

    op>
    metric pod_cpu = ddmq("avg:kubernetes.cpu.usage.total{*}by{host,pod_name} / 1000000") | window(20s)

    The ddmq command ingests a built-in Datadog metric directly into Shoreline.

  7. Get your pods' CPU usage using the ingested Datadog kuberenetes.cpu.usage.total metric.

    op>
    pods | pod_cpu
  8. Use the namespace tag to filter the above Metric query to only bookstore pods.

    op>
    pods | namespace="java-bookstore" | pod_cpu
    Resource and Metric Queries under the hood
  9. Create a named Resource to simplify future commands, so that typing bookstore invokes the full hosts | pods | namespace="java-bookstore" Resource query.

    op>
    resource bookstore = hosts | pods | namespace="java-bookstore"
    op>
    bookstore

    Shoreline also supports a feature called Dynamic Filters. With simple resource queries you can retrieve Resources based on type or tag.

  10. Try using a Dynamic Filter to show all host Resources with over 15% CPU usage.

    op>
    hosts | filter(cpu_usage > 15)
    dynamic-filter
  11. Execute the top Linux command on the bookstore Resources.

    op>
    bookstore | `top -n 1 -b`

    The output dialogs show any stdout, stderr, and exit status.

Automate Tab

The Automate tab is where you can:

During this workshop, we'll focus on the creation and execution of Actions triggered by an alerting Datadog monitor.

  1. Click the button at the top of the remediation panel.

    The Automate tab is where you associate an Action with the selected Datadog monitor.

  2. Click the button to start creating your custom Action.

    The Automate panel is divided into sections to adjust various aspects of the current Action. Click to on the optional section headers below to learn more.

  1. Enter generate_load in the name field of your Action.

    This first Action will increase the load on the bookstore pods.

  2. (Optional) In the Description field, enter a value to describe this Action elsewhere in the Datadog UI.

  1. Enter `curl localhost:8000/bookstore` in the Op Statement field.

    The Op Statement field defines the core command you'll use to remediate the Datadog monitor's alerting status. Just like standard Actions, this command executes on all Resources associated with this monitor.

  1. Click the button.

    This saves your Action and performs any necessary background jobs (e.g., creating a Bot, associating with the Datadog monitor, uploading artifacts, etc).

Debug and Repair

Create a bookstore Monitor

Set up the monitor
  1. Open the Datadog monitor creation page.
  2. Select the Metric monitor type.
  3. Open the Choose a detection method section and click the button. Alternatively, click on this inline button to open the metric monitor creation page.

Configure the metric

  1. Open the Define a Metric panel.

  2. Click in the Metric field and choose the targeted metric.

    For example, here we're monitoring Kubernetes total CPU usage by selecting the kubernetes.cpu.usage.total metric: monitor-define-the-metric-metric

  3. Click in the from field to select what resources to monitor.

    In this case, find the pods with the java-bookstore namespace: monitor-from-pod-name-bookstore

  4. Click in the field next to avg by to adjust the aggregation group tags.

    Here, we're grouping by three tags of host, pod_name, and kube_container_name: monitor-define-the-metric-aggregate

  5. Click the Advanced... link on the right-side.

  6. Set the the metric query calculation in the Express these queries as field to a / 1000000.

    This ensures our threshold units are easier to work with. Learn more by clicking the header below.

Adjust thresholds

  1. Open the Set alert conditions panel.

  2. In the Alert threshold field, enter a value of 800, which is the millicore CPU usage that'll trigger an alert status.

  3. Enter 600, 500, and 400 values in the Warning threshold, Alert recovery threshold, and Warning recovery threshold fields, respectively.

    monitor-thresholds

Integrate Shoreline

  1. Open the Say what's happening panel.

  2. Enter a java_bookstore_monitor in the Example Monitor Name field.

  3. Enter @webhook-undefined in the Example Monitor Message field.

  4. Click the button to finalize your monitor.

Generate load

In this section we'll generate some load on the java-bookstore namespaced pods using the generate_load Action.

  1. Click the button at the top of the remediation panel.

  2. Execute the generate_load Action you created earlier.

    op>
    bookstore | generate_load

    You should see a stdout indicating that the request was received.

    The above uses both the previously-defined bookstore named Resource and generate_load Action. However, you could also substitute one or both of those with their base query definitions, e.g.:

    op>
    bookstore | `curl localhost:8000/bookstore`
    op>
    pod | namespace="java-bookstore" | `curl localhost:8000/bookstore`
  3. Execute the pod_cpu metric query a few times over the next few seconds and you'll see the pod CPU usage climb dramatically.

    op>
    bookstore | pod_cpu

    After a few moments you'll see the Datadog monitor enter an Alert state.

Debug and repair the monitor

The next step is to debug and manually repair the issue.

  1. Click the button at the top of the remediation panel.

  2. Use Linux commands against the bookstore named Resource to help find the issue.

    op>
    bookstore | `top -n 1 -b`
    Using the top command, we're able to see the java process using 100% of the CPU.
    op>
    bookstore | `top -n 1 -b -H -p 1`
  3. We can also check the Java garbage collection logs to get more insight into the issue.

    op>
    bookstore | `cat /tmp/gc_logs.txt | tail -n 500 | grep "Pause Full"`
  4. We've pre-loaded a custom jvm_dumps.sh shell script that exports all garbage collection, heap, thread, and deadlock data to AWS S3 and kills the problematic process.

    Enter the following Op command to execute the jvm_dumps.sh script and resolve the issue.

    op>
    bookstore | `cd /java && ./jvm_dumps.sh`
  5. After a moment, verify that CPU usage is reduced, as expected.

    op>
    bookstore | pod_cpu

    Additionally, after a few moments the Datadog monitor should resolve its alarm status.

Shoreline Deep Dive

Interactive debugging and repair with Shoreline and Datadog
Shoreline architecture

Automate

Now that you've seen how to do things manually, the final step is to automate the entire remediation process.

Create an Action and Associated Bot

We'll create a new Action that automatically executes the jvm_dumps.sh shell script when the monitor enters an alert state.

  1. Click the button at the top of the remediation panel.
  2. Click the button to create a new Action.
  1. Enter jvm_remediator in the Action name field.
  2. (Optional) In the Description field, enter a value to describe this Action elsewhere in the Datadog UI.
  1. Enter `cd /java && ./jvm_dumps.sh` in the Op Statement field.

  1. Tick the Associate and execute automatically checkbox.
  1. Click the button to create the Action.

    By ticking Associate and execute automatically, Shoreline creates a unique Bot that associates the Datadog monitor with this new jvm_remediator Action.

Generate load

Once again, execute the generate_load Action to test the incident automation you just created.

  1. Click the button at the top of the remediation panel.

  2. Execute the generate_load Action you created earlier.

    op>
    bookstore | generate_load

    You should see a stdout indicating that the request was received.

  3. Execute the pod_cpu metric query a few times over the next few seconds and you'll see the pod CPU usage climb dramatically.

    op>
    bookstore | pod_cpu
  4. Wait a few moments for the Datadog monitor to enter an Alert state.

Automated remediation with Shoreline and Datadog
Software control loops
  1. After a few more minutes, check back to confirm that the incident automation loop executed and the Datadog monitor is resolved.

Event Stream

You can view Shoreline-related events in the Datadog event stream.

  1. Click here to open the event stream.

  2. Enter jvm_remediator in the Search events... field and press Enter.

    This should find the jvm_remediator Action execution event generated by the automated Shoreline Bot.

    The event output should look something like the following:

    Shoreline bot executed an Op statement to remediate the Datadog monitor, "bookstore_monitor" : 
    "host | instance_id="i-0d6cb759ed0c13bab" | pod | pod_name="java-bookstore" | container | container_name="java-bookstore" | jvm_remediator()"
    #monitor_id:41486260
    
    Stdout: using jcmd to take heap dump.
    using jstack to take thread dump.
    using jstat to get gc info.
    using jmap to get heap stats.
    
    ExitStatus: 0
    Stderr:
    

    The top line shows the name of the triggering Datadog monitor and the full Op statement that was executed. We also see any stdout, exit status, and stderr values returned by the Action execution.