Metrics

In the Resources article, we learned how to discover resources, filter on them, drill down for more detail, and learn what the relationships are.  But beyond just discovering the resources, we also want to know how our resources are operating.  We want to know host cpu, disk, memory, and network utilization.  We also want to know service latency and error rates.  To get this information, we need metrics. Op metrics are time series data with name, resource, and other tag metadata. Let's get the value for a metric right now.  Let’s get the average cpu utilization for each of our hosts for the last 30 seconds:

op>
host | cpu_usage | window(30s) | resolution=10 | mean(3)
 ID | TYPE | NAME                | TIMESTAMPS          | CPU_USAGE
 1  | HOST | i-08442999c268bb61d | 2021/07/02 12:27:30 |      4.69
 2  | HOST | i-08269143cfca5afb4 | 2021/07/02 12:27:30 |     11.21
 3  | HOST | i-0714d77e82ae5486e | 2021/07/02 12:27:30 |      7.75

Op resources and metrics naturally mesh together.  Prefixing a metric query with a resource query narrows the metrics returned down to only those metrics associated with the returned resources.

Creating New Metrics

Like resources, we can also define and save useful metric queries for later.  Let's define the concept of average cpu over an interval:

op>
metric cpu_2_min = cpu_usage | window(120s)

As shown above, our definition statements are fully parameterized.  Op's macro system allows for expression of complex substitution, but with a familiar syntax.

Op also lets users create custom derived metrics. Let's create a new metric called "cpu_usage_new":

op>
metric cpu_usage_new = (100 - 100 * (metric_query(metric_names="node_cpu_seconds_total", tags={"mode":"idle"}) | irate(2) | group() | mean)) | lower_bound(0) | upper_bound(100)
If you need to update the formula, simply over-write with the following:
op>
cpu_usage_new.val = [new formula]
To set unit of measurement (uom) for the metric:
op>
cpu_usage_new.units = "percent"
Now we can leverage both of the Op commands we have built up.  Let's get the average cpu utilization over the last 30 seconds for each of our hosts:
op>
host | cpu_2_min

The above examples really show the power of Op.  Multiple layers of substitution allow you to very succinctly express a complex query.  And in the heat of a Sev 1 or similarly critical incident, you will be able to easily remember the commands and execute efficiently.

List Existing Metrics

Op allows the user to list all previously defined metrics. Imagine you are interactively debugging your cluster using Op, and you want the cpu_2_min metric, defined above, on each of your hosts, but you don’t remember the name of the defined metric. Let’s list our existing metrics.

op>
list metrics

To view just a metric by name:

op>
list metrics | name="cpu_usage_new"
We see that we have an entry in the table for our cpu_2_min metric. We know what this metric represents since we defined it and provided it with an appropriate name. However, what if someone new to the ops team wants more information about the metric? Let’s add a description to our metric.
op>
cpu_2_min.description = "real time cpu usage over last 2 min"
Now, if we list the metrics again:
op>
list metrics

They will now see our metric description alongside the metric name and formula. Op supports standard CRUD (create, read, update, and delete) operations over metric definitions. Please refer to the Op Commands Glossary for more information regarding syntax and supported operations. These features enable an operations team to rapidly build up a shared statement bank of commonly used metrics, which allows for the dissemination of operations knowledge and the ability to quickly gather valuable system information without wasting time reinventing or misremembering commonly used metric formulas.

Metrics Exporter Support

Op plugs directly into the Prometheus exporter ecosystem. Op can pull metrics from any Prometheus exporter as well as Prometheus itself.  Exporters such as Envoy and cAdvisor are auto-discovered and ingestible by the Shoreline agent.

For further information on Envoy, please see Envoy Overview.

For examples of cAdvisor metrics, please see Monitoring container metrics using cAdvisor.