Data Metrics, Collection, and Visualization

Key Performance Indicators

Key performance indicators (KPIs) help determine how good or how bad some practices or initiatives are. Unlike the metrics that support (or refute) the hypothesis, KPIs are the hypothesis and ideally defined at the initiation of the project. KPIs are statements that concentrate on business goals and the impact of improvements to the value stream.

Good Key Performance Indicators (KPIs) should not be made in a vacuum. They should represent the business goals, as well as the teams, and be prominently displayed for all to see. But what makes a good KPI, and how are they established? How many should each team have? What should be included or excluded? Are they different from goals?

One key point to remember is that people tend to change behavior when they are measured, leading to finding the shortest path to meeting the requirement. Sadly, this has unintended side effects. KPIs are not just for management. When setting goals, it is about driving towards the desired change and minimizing the unintended consequences. From a management perspective, two of the most important missions are setting priorities and providing the resources required to get the priorities accomplished. KPIs then are a crucial way to verify that the priorities are achieved.

KPIs come in different names. Informally they can be referred to as metrics, which may be accurate. But they represent an entirely different type of metric than commonly discussed. KPIs, when used to examine individual performance, are sometimes referred to as OKRs Objectives and Key Results. John Doerr, a venture capitalist, popularized the term OKR.

Title Running Time Description Persona
Measuring DevOps: The Key Metric That Matters 29m 31s Having the right goals, asking the right questions, and learning by doing are paramount to achieving success with DevOps. Having specific milestones and shared KPIs play a critical role in guiding your DevOps adoption and lead to continuous improvement—toward realizing true agility, improved quality, and faster time to market throughout your organization. This session will walk you through a practical framework for implementing measurement and tracking of your DevOps efforts and software delivery performance that will provide you with data you can act on! Management
DevOps Quality Metrics that Matter 48m 35s The way that we develop and deliver software has changed dramatically in the past 5 years—but the metrics we use to measure quality remain largely the same. Every other aspect of application delivery has been scrutinized and optimized as we transform our processes for DevOps. Why not put quality metrics under the microscope as well? Tricentis commissioned Forrester to research the topic. Forrester analyzed how DevOps leaders use and value 94 common quality metrics—then identified which metrics matter most for DevOps success. Management
If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There 33m 30s The best-performing organizations have the highest quality, throughput, and reliability while also delivering value. They are able to achieve this by focusing on a few key measurement principles, which Nicole and Jez will outline in this talk. These include knowing your outcome measuring it, capturing metrics in tension, and collecting complementary measures… along with a few others. Nicole and Jez explain the importance of knowing how (and what) to measure—ensuring you catch successes and failures when they first show up, not just when they’re epic, so you can course correct rapidly. Measuring progress lets you focus on what’s important and helps you communicate this progress to peers, leaders, and stakeholders, and arms you for important conversations around targets such as SLOs. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to be great SREs and move performance in the right direction. Management
If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There 33m 30s The best-performing organizations have the highest quality, throughput, and reliability while also delivering value. They are able to achieve this by focusing on a few key measurement principles, which Nicole and Jez will outline in this talk. These include knowing your outcome measuring it, capturing metrics in tension, and collecting complementary measures… along with a few others. Nicole and Jez explain the importance of knowing how (and what) to measure—ensuring you catch successes and failures when they first show up, not just when they’re epic, so you can course correct rapidly. Measuring progress lets you focus on what’s important and helps you communicate this progress to peers, leaders, and stakeholders, and arms you for important conversations around targets such as SLOs. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to be great SREs and move performance in the right direction. Management
Identifying key performance indicators 3m 35s It is important, in business and in our application development, that we have a clear set of measurable and obtainable objectives. These items we call key performance indicators, or KPIs and should be ways we can measure the effectiveness and success of our applications. Management
Oracle BI 11g: Scorecarding and Strategy Management This course teaches you how to create and use KPIs and scorecards, which are components of the Business Intelligence Foundation Suite, a complete, open, and integrated solution for all enterprise business intelligence needs, including reporting, ad hoc queries, OLAP, dashboards, and scorecards. Management
KPIs What They Are and Why Your Organization Needs Them 32m 24s In this 30-minute session, we will review some of the fundamental questions that you need to address in order to get started. Management
Chapter 19: Creating KPIs (Practice of Cloud System Administration) 27m 0s Setting KPIs is quite possibly the most important thing that a manager does. It is often said that a manager has two responsibilities: setting priorities and providing the resources to get those priorities done. Setting KPIs is an important way to verify that those priorities are being met. The effectiveness of the KPI itself must be evaluated by making measurements before and after introducing it and then observing the differences. This changes management from a loose set of guesses into a set of scientific methods. We measure the quality of our system, set or change policies, and then measure again to see their effect. This is more difficult than it sounds. Management
12 DevOps KPIs you should track to gauge improvement 6m 0s It’s no small task to transform an IT organization to integrate development, operations and quality assurance teams. A DevOps methodology requires team and process changes and then, once everything is in place, the onus is on IT to create DevOps KPIs and measure the outcomes. Everyone
How to Use Value Stream Mapping in DevOps 17m 37s At a high-level, there are some basic metrics that will be collected as part of the Value Stream Mapping exercise. Value added (VA): Value added time is the amount of time that a team actually spends working on the project (as opposed to, for example, the time that a project or request sits in the queue). Whenever there is no change in the product, it is considered non-value added time. Lead time (LT): Lead time represents the total time it takes a person or team to complete a task—it is the combination of value added and non-value added. % Complete/accurate (%C/A): This is the percentage of information-based work that is complete and accurate the first time and requires no re-work by downstream processes. There are numerous other metrics that can be collected for a VSM exercise and will depend on value stream being mapped. Management

Service Level Indicators/Service Level Objectives

SLIs are, among other things, the critical measure of a system’s availability. At the same time, SLOs can be considered the goals we set for how much availability we can expect out of a system. Together, these two values help engineering teams make better decisions. They provide critical information about how hard you can push your system, and if code improvements have the desired effects.

Today’s systems are complex with hundreds, if not thousands of nodes comprising everything from databases to web servers. And with so many nodes, the idea of a system boundary becomes blurred, making it even more imperative to measure the performance of individual components, be they physical or virtual components.

Before you can apply the concepts of SLIs to a system, you need a plain language definition of availability and a description of system boundaries that everyone can agree with. Remember that these definitions will change over time as new systems arrive, and old systems are retired, as well as changes to business needs and operational realities. Once arrived at SLIs, become broad proxies for availability and are metrics that will help determine the health of an active system. Most services consider request latency—how long it takes to return a response to a request—as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret. For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. Choosing an appropriate SLO is complex. To begin with, you don’t always get to choose its value! For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is primarily determined by the desires of your users, and you cannot set an SLO for that. Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service is slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the views held by the people designing and operating the service.

This track will discuss various SLIs, how to determine SLOs, and practical examples to better cement the idea of how the two work together.

Title Running Time Description Persona
Building a Culture of Metrics 3m 19s This video explains how to select the right metrics Everyone
SLAs and SLOs 5m 21s This video explains what SLAs and SLOs are and why these are important for business needs, with examples. Everyone
Defining reliability 2m 34s Let’s start by defining the basic building block for measuring the reliability of a service, the service-level indicator. A service-level indicator, or SLI for short is an indicator of the level of service you provide via your service, ideally expressed as a ratio of two numbers. Monitoring Operations
Implementing measurements 2m 6s Once you have determined what about your service makes users happy and have a specific assessment in mind for that behavior, you then need to find a concrete way to implement the assessment. This concrete measurement of an SLI specification is referred to as an SLI implementation. SLIs should be specific and measurable. Monitoring Operations
Common measurements 4m 19s One common way to get started with service level indicators is to think about your system abstractly and segment it into pieces based on common component types. Typically, components of a system will fall into one of three categories; one, request-driven systems also referred to as user-facing serving systems; two, big data systems also referred to as data processing pipelines; and three, storage systems. Monitoring Operations
Measurement and calculation 4m 8s The most important question you can ask to generate SLIs is what do your users care about? That said, often what users care about is difficult or impossible to measure. So, you’ll need to think critically about how to approximate the users’ needs. Monitoring Operations
Objectives vs. indicators 2m 59s A service-level objective builds off of a service-level indicator to provide a target level of reliability for a service as customers. It takes the SLI, and adds both a threshold and a time window, making it a metric that can be evaluated at a set cadence. Setting service-level objectives frames service-performance expectations. Monitoring Operations
Making measurements meaningful 3m 57s Once you have a high quality service level indicator, there are two things needed to turn it into a service level objective. A time window, and a threshold. Monitoring Operations
Documenting SLOs 1m 10s Service level objectives that are defined and have received stakeholder buy in should be properly documented in a centralized easy to update space where other teams and stakeholders can review them. Monitoring Operations
25 Examples of a Service Level Objective 5m 3s A service level objective is a criteria that is used to evaluate the performance of a business or technology service. In many cases, service level objectives are specified in a contract such as a master service agreement. This article contains common examples of service level objectives Monitoring Operations
Service Level Indicators in Practice 8m 0s Service level indicators are literally the most important piece you need in order to apply SRE principles. Even if you think you have them, they might not be high quality enough for you to accurately gauge your customer’s experience, and they will mislead you. Management/Monitoring Operations
SLOs & You 12m 30s Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals? Monitoring Operations
Setting SLOs and SLIs in the Real World 38m 36s Clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity. Monitoring Operations
Real World SLOs and SLIs: A Deep Dive 37m 02s If you’ve read almost anything about SRE best practices, you’ve probably come across the idea that clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity. But in the real world, SLOs and SLIs can be challenging to define and implement. In this talk, we’ll dive into the nitty-gritty of how to define SLOs that support different reliability strategies and modalities of service failure. We’ll start by looking at key questions to consider when defining what “reliability” means for your organization and platform. Then we’ll dig into how those choices translate into specific SLI/SLO measurement strategies in the context of different architectures (for example, hard-sharded vs. stateless random-workload systems) and availability goals. Management
Latency SLOs Done Right 27m 12s Latency is a key indicator of service quality, and important to measure and track. However, measuring latency correctly is not easy. In contrast to familiar metrics like CPU utilization or request counts, the “latency” of a service is not easily expressed in numbers. Percentile metrics have become a popular means to measure the request latency, but have several shortcomings, especially when it comes to aggregation. The situation is particularly dire if we want to use them to specify Service Level Objectives (SLOs) that quantify the performance over a longer time horizons. In the talk we will explain these pitfalls, and suggest three practical methods how to implement effective Latency SLOs. Operations
The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It 45m 21s SLOs are a wonderfully intuitive concept: a quantitative contract that describes expected service behavior. These are often used in order to build feedback loops that prioritize reliability, communicate expected behavior when taking on a new dependency, and synchronize priorities across teams with specialized responsibilities when problems occur, among other use cases. However, SLOs are built on an implicit model of service behavior, with a raft of simplifying assumptions that don’t universally hold. Both
Automatic Metric Screening for Service Diagnosis 18m 35s When a service is experiencing an incident, the oncall engineers need to quickly identify the root cause in order to stop the loss as soon as possible. The procedure of diagnosis usually consists of examining a bunch of metrics, and heavily depends on the engineers’ knowledge and experience. As the scale and complexity of the services grow, there could be hundreds or even thousands of metrics to investigate and the procedure becomes more and more tedious and error-prone. Operations
Performance Checklists for SREs ?? The Netflix Example of checklist and associated performace metric (dated?) Operations
Using deemed SLIs to measure customer reliability ?? If you do run a platform, it’s going to break sooner or later. Some breakages are large and easy to understand, such as no one being able to reach websites hosted on your platform while your company’s failure is frequently mentioned on social media. However, other kinds of breakage may be less obvious to you—but not to your customers. What if you’ve accidentally dropped all inbound network traffic from Kansas, for example? Both
Why Percentiles Don’t Work the Way you Think 14m 15s They’re not asking for the 99th percentile of a metric, they’re asking for a metric of 99th percentile. This is very common in systems like Graphite, and it doesn’t achieve what people sometimes think it does. This blog post explains how percentiles might trick you, the degree of the mistake or problem (it depends), and what you can do if percentile metrics aren’t right for you. Both
Monitoring Isn’t Observability 6m 4s Observability is all the rage, an emerging term that’s trending up very quickly in certain circles even while it remains unknown in others. As such, there isn’t a single widely understood meaning for the term, and much confusion is inevitably following. What is observability? What does it mean? Perhaps just as importantly, what is observability NOT? Both
The Problem with Pre-aggregated Metrics: Part 1, the “Pre” 2m 48s Pre-aggregated, or write-time, metrics are efficient to store, fast to query, simple to understand… and almost always fall short of being able to answer new questions about your system. This is fine when you know the warning signs in your system, can predict those failure modes, and can track those canary-in-a-coal-mine metrics. Both
The Problem with Pre-aggregated Metrics: Part 2, the “aggregated” 3m 34s The nature of pre-aggregated time series is such that they all ultimately rely on the same general steps for storage: a multidimensional set of keys and values comes in the door, that individual logical “event” (say, an API request or a database operation) gets broken down into its constituent parts, and attributes are carefully sliced in order to increment discrete counters. Operations
The Problem with Pre-aggregated Metrics: Part 3, the “metrics” 3m 45s Finally, we arrive at discussing “metrics.” Terminology in the data and monitoring space is incredibly overloaded, so for our purposes, “metrics” means: a single measurement, used to track and assess something over a period of time. Both
It All Adds Up 8m 50s Statistical analysis is a critical – but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. Monitoring Operations
Quantifying your SLOs 9m 19s Service Level Objectives (SLOs) are essential performance indicators for organizations that want a real understanding of how their systems are performing. However, these indicators are driven by vast amounts of raw data and information. That being said, how do we make sense of it all and quantify our SLOs? Monitoring Operations
Measuring and evaluating Service Level Objectives (SLOs) 4m 18s Managing services is hard for both service owners and stakeholders. To make things easier for everyone, define a clear set of expectations from the beginning. This helps measure and evaluate the health of services easier. Monitoring Operations
Chapter 14: Create Telemetry to Enable Seeing and Solving (DevOps Handbook) 28m 33s The Microsoft Operations Framework (MOF) study in 2001 found that organizations with the highest service levels rebooted their servers twenty times less frequently than average and had five times fewer “blue screens of death.” In other words, they found that the best-performing organizations were much better at diagnosing and fixing service incidents, in what Kevin Behr, Gene Kim, and George Spafford called a “culture of causality” in The Visible Ops Handbook. High performers used a disciplined approach to solving problems, using production telemetry to understand possible contributing factors to focus their problem solving, as opposed to lower performers who would blindly reboot servers. Monitoring Operations
Chapter 15: Analyze Telemetry to Better Anticipate Problems and Achieve Goals (DevOps Handbook) 15m 41s A great example of analyzing telemetry to proactively find and fix problems before customers are impacted can be seen at Netflix, a global provider of streaming films and television series. Netflix had revenue of $6.2 billion from seventy-five million subscribers in 2015. One of their goals is to provide the best experience to those watching videos online around the world, which requires a robust, scalable, and resilient delivery infrastructure. Roy Rapoport describes one of the challenges of managing the Netflix cloud-based video delivery service: “Given a herd of cattle that should all look and act the same, which cattle look different from the rest? Or more concretely, if we have a thousand-node stateless compute cluster, all running the same software and subject to the same approximate traffic load, our challenge is to find any nodes that don’t look like the rest of the nodes.” Monitoring Operations
Chapter 6: Numbers Lead the Way (Achieving DevOps) 17m 31s + 39m 8s [First Section, and first section in Behind the Story] Dashboarding and monitoring should have been far more prominent in the team’s journey; in this chapter, they finally begin to pay more attention to the numbers that matter most to their business partners, which reflect global value. Why is this important? And how do feature flags help enable a more continuous flow of value without increasing risk? Monitoring Operations
Chapter 16: Monitoring Fundamentals (Practice of Cloud System Administration) 22m 24s You can observe a lot by just watching. —Yogi Berra - Monitoring is the primary way we gain visibility into the systems we run. It is the process of observing information about the state of things for use in both short-term and long-term decision making. The operational goal of monitoring is to detect the precursors of outages so they can be fixed before they become actual outages, to collect information that aids decision making in the future, and to detect actual outages. Monitoring is difficult. Organizations often monitor the wrong things and sometimes do not monitor the important things. Monitoring Operations
Chapter 17: Monitoring Architecture and Practice (Practice of Cloud System Administration) 38m 58s A monitoring system has many different parts. A measurement flows through a pipeline of steps. Each step receives its configuration from the configuration base and uses the storage system to read and write metrics and results. Monitoring Operations
Appendix 3: Monitoring and Metrics (Practice of Cloud System Administration) 3m 9s Monitoring and Metrics covers collecting and using data to make decisions. Monitoring collects data about a system. Metrics uses that data to measure a quantifiable component of performance. This includes technical metrics such as bandwidth, speed, or latency; derived metrics such as ratios, sums, averages, and percentiles; and business goals such as the efficient use of resources or compliance with a service level agreement (SLA). Monitoring Operations
Chapter 4: Service Level Objectives (SRE Book) 18m 59s It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product. Management/Monitoring Operations
Chapter 2: Implementing SLOs (SRE Workbook) 41m 44s It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product. Monitoring Operations
Chapter 3: SLO Engineering Case Studies (SRE Workbook) 30m 11s SLOs are fundamental to the SRE model. Since we launched the Customer Reliability Engineering (CRE) team—a group of experienced SREs who help Google Cloud Platform (GCP) customers build more reliable services—almost every customer interaction starts and ends with SLOs. Monitoring Operations
Best Practices for Setting SLOs and SLIs for Modern, Complex Systems 12m 16s At New Relic, defining and setting Service Level Indicators (SLIs) and Service Level Objectives (SLOs) is an increasingly important aspect of our site reliability engineering (SRE) practice. It’s not news that SLIs and SLOs are an important part of high-functioning reliability practices, but planning how to apply them within the context of a real-world, complex modern software architecture can be challenging, especially figuring out what to measure and how to measure it. Everyone

Monitoring

Monitoring is the collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. Monitoring can include many types of data, including metrics, text logging, structured event logging, distributed tracing, and event introspection. While all of these approaches are useful in their own right, this chapter mostly addresses metrics and structured logging. These two data sources are best suited to fundamental monitoring needs.

At the most basic level, monitoring allows you to gain visibility into a system, which is a core requirement for judging service health and diagnosing a service when things go wrong, such as:

There are several types of monitoring. White box monitoring, which is monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics. Black box monitoring, which is testing externally visible behavior as a user would see it.

Alerting

After setting up the monitoring solution, establishing the values to collect and ensuring that the monitoring solution is receiving the appropriate data, the next, obvious question is Now What? Most teams monitor their equipment to ensure that it is up and operational, and when it is not, to ALERT them when it fails. Or is performing outside established thresholds. If monitoring is the act of observation, then alerting is one method of delivering the data. It is not the only one, and should not be the first method of providing the data. The problem with alerting is getting it right, effective, and ensuring it is not ignored. And it is so easy to ignore an alert.

Alerts fall into to categories. The first is an alert as a For your information. This alert is one where no immediate action is required, but someone should be informed. The backup job failed is an FYI type of alert. Someone needs to investigate, but the level of urgency does not rise to drop everything you are doing. The second type of alert is drop everything you are doing. This is an alert that is meant to wake people up in the dead of night[1]. Occasionally a middle-range alert is created between the system is down and get to it while you can. Resist the temptation to create this middle type of alert. Either it is a problem that demands immediate action, or it is not. Too often, the middle of the road alert leads to alerts being ignored. Use these criteria to evaluate your alerts.

The question then is around the strategy for delivery and resolution of alerts.

First, stop using email for alerts. Most DevOps/SREs get too much email. If the alert is of the FYI variety, send it to a group notification (chat) system for follow up. If the alert requires immediate action, choose the method that works best for a quick response. SMS, pager duty, etc. Not everyone uses or pays attention to their SMS alerts during the day or at night, so ensure the team has an agreed-upon method for how these alerts will be sent and acted upon. One shop used a flashing red light in the middle of the work area. Use what works. It is also important to log all alerts for later reporting. A unique sequence number that can be referred back to in reports, knowledge base articles, and post mortems is critical. It will also aid in SLA reporting. Ensure that you have a runbook/checklist. In the Checklist Manifesto, the value of having a checklist and what makes a good checklist details actions that are valuable, whether you are in surgery, trying to land a plane, or troubleshoot a problem.

Everyday items such as:

This will get people focused on the job at hand and prevent the inevitable issue of but I thought…. Make sure to keep your checklists updated as systems change. If you are finding the runbook/checklist solution, provides the actual answer, then automate the runbook! Self-healing should be the first response of any alerting system in the modern age. If a human has to respond to an alert, it is more than likely already too late.

Another useful alerting tool is to delete alerts or tune them for value. Do not be afraid of removing alerts. If an alert is being ignored, and ignoring the alert is not causing a system issue, consider evaluating why the alert was created in the first place. Is it still relevant? Is it still needed? Threshold alerts, in particular, should be reviewed frequently and with a critical eye. Just because an alert fires on a threshold does not mean the threshold is valid. If an alert triggers on a disk utilization at 90% capacity, is there an underlying problem if the disk goes from zero to 90% in an irrational amount of time? Is the monitoring system triggering on that sort of issue? Should it? When establishing threshold alerts, their reason for existence should be discussed and evaluated, but other what if scenarios should also be considered, and potentially rated a higher risk for alerting. Reducing alert fatigue will lead to more effective responses and fewer false alarms.

It should be common sense to disable or toggle alerting during a maintenance window, but more often than not, spurious alerts are generated. Again, this can lead to alert fatigue and the perhaps misdiagnosis that because system X is under maintenance, any alert generated by the system is related to this maintenance, when in fact, it may not be and should be investigated.

This course looks at monitoring as a system requirement, and how to turn collected metrics into viable dashboards for effective data visualization and system alerting.

Title Running Time Description Persona
DevOps Metrics and Dashboards 18m 40s This DevOps Tutorial explains what are the various DevOps metrics that need to be monitored and measured along with the various DevOps dashboards to do the same. While you develop software products, you need some mechanism , some way of measuring or validating and what you are doing to meet customer expectations, that is where DevOps Metrics come in to place. Operations
What is Monitoring 4m 5s The dictionary definition of monitoring is to observe and check the progress or quality of something over a period of time; keep under systematic review. - In tech that something can be a service you provide, pieces of hardware and software that are part of the service, user activity, your bill process, or any other activity you can observe. Everyone
Observability in a DevOps World 4m 41s The hot new term in monitoring is observability. Nearly every DevOps monitoring company is using it in their marketing material nowadays but what does it really mean? Everyone
GitOps Part 3 - Observability 11m 59s In this post: if developers can learn to love testing, then they can learn to love user happiness in production. Observability helps achieve this. Git provides a source of truth for the desired state of the system, and Observability provides a source of truth for the actual production state of the running system. In GitOps we use both to manage our applications. Both
Monitoring: What does it all mean? 3m 11s Monitoring is a magic and it’s hard. It’s a data science that requires logical rigor. If you think taking statistics in school was a waste, think again. - We’ve found that using established engineering definitions for our monitoring work helps us to keep straight about what’s going on within our systems. Monitoring Operations
Monitoring: Math is required 3m 53s Understanding your exact metrics and what they mean is a big part of monitoring. - As Mark Twain said, “Facts are stubborn things, ”but statistics are pliable." Monitoring Operations
Modeling your system 4m 55s There is a saying in Bulgaria, the wolf changes its coats, but not its habits. Complex systems may be unique, but they all obey the same rules. With this in mind, let’s explore some ways to model your monitoring system effectively. Monitoring Operations
Increasing Deployment Safety with Metric Based Canaries 9m 20s Key KPIs can be adapted for measuring deployment success. In particular, Deployment time, or how long it takes to do a deployment from beginning to successful end. Another mettic is the number of failed deployments, regardless of cause. You can use canary deployments to gather additional metrics related to the deployment process. A canary is a deployment process in which change is partially rolled out then evaluated against the current deployment Deployment Operations
Chapter 6: Monitoring Distributed Systems 21m 14s Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break. When the system isn’t able to automatically fix itself, we want a human to investigate the alert, determine if there’s a real problem at hand, mitigate the problem, and determine the root cause of the problem. System monitoring is also helpful in supplying raw input into business analytics and in facilitating analysis of security breaches. Management/Monitoring Operations
Chapter 4: Monitoring (SRE Workbook) 22m 18s Monitoring can include many types of data, including metrics, text logging, structured event logging, distributed tracing, and event introspection. While all of these approaches are useful in their own right, this chapter mostly addresses metrics and structured logging. In our experience, these two data sources are best suited to SRE’s fundamental monitoring needs. At the most basic level, monitoring allows you to gain visibility into a system, which is a core requirement for judging service health and diagnosing your service when things go wrong. Operations
Code Coverage 9m 20s In computer science, test coverage is a measure used to describe the degree to which the source code of a program is executed when a particular test suite runs. A program with high test coverage, measured as a percentage, has had more of its source code executed during testing, which suggests it has a lower chance of containing undetected software bugs compared to a program with low test coverage.[1][2] Many different metrics can be used to calculate test coverage; some of the most basic are the percentage of program subroutines and the percentage of program statements called during execution of the test suite. Operations Management
About Code Coverage 4m 54s Code coverage is the percentage of code which is covered by automated tests. Code coverage measurement simply determines which statements in a body of code have been executed through a test run, and which statements have not. In general, a code coverage system collects information about the running program and then combines that with source information to generate a report on the test suite’s code coverage. Operations Management
What I Wish I Knew before Going On-call 1h 25m 28s Firefighting a broken system is time-sensitive and stressful but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we’ll share common myths among new on-call engineers and the Do’s and Don’ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes. Both

Types of Monitoring Instrumentation

For convenience, monitoring instrumentation is broken out from monitoring operations. The act of instrumenting an application generally falls to the development team, while instrumenting a system is typically the responsibility of the operations or infrastructure team. Both groups will learn something from this series.

Title Running Time Description Persona
Instrumentation Is About Making People Awesome 3m 12s The nuts and bolts of metrics, events, and logs are really interesting to me. So interesting, perhaps that I get mired in these technical bits. I keep thinking of ways to process more data, allow for more fields or finer precision. I think about this so much that I drift in to worrying more about the work than the outcome. Both
Making Instrumentation Extensible 6m 21s Observability-driven development requires both rich query capabilities and sufficient instrumentation in order to capture the nuances of developers’ intention and useful dimensions of cardinality. When our systems are running in containers, we need an equivalent to our local debugging tools that is as easy to use as Printf and as powerful as gdb. We should empower developers to write instrumentation by ensuring that it’s easy to add context to our data, and requires little maintenance work to add or replace telemetry providers after the fact. Instead of thinking about individual counters or log lines in isolation, we need to consider how the telemetry we might want to transmit fits into a wider whole. Both
Instrumentation: The First Four Things You Measure 2m 56s This is the very basic outline of instrumentation for services. The idea is to be able to quickly identify which components are affected and/or responsible for All The Things Being Broken. The purpose isn’t to shift blame, it’s just to have all the information you need to see who’s involved and what’s probably happening in the outage. Developer Management
Instrumentation: What does ‘uptime’ mean? 4m 33s Everybody talks about uptime, and any SLA you have probably guarantees some degree of availability. But what does it really mean, and how do you measure it? Both
Instrumentation: Worst case performance matters 2m 52s After a few false starts and blaming the network as per SOP[2], we decided to take a look at sample-based CPU profiling, which confirmed that 15% of time was going to user data record decompression – largely in line with our past experience. User data records include a long list of “segments.” These are used for targeted advertising, like “people who want to buy a planet ticket to Europe soon;” this adds up to lot of data, so it’s stored in a proprietary compressed format. Developer Management
Instrumentation: Measuring Capacity Through Utilization 3m 54s One of my favorite concepts when thinking about instrumenting a system to understand its overall performance and capacity is what I call “time utilization”. Both
Our monitoring system 1m 17s Optional Let’s do an overview of the system and application we’re going to use throughout our demos. Monitoring Operations
Synthetic monitoring: Is it up? 5m 3s Synthetic monitoring dates back to the prehistoric era, when cavemen threw rocks and poked things with sticks. It’s a simple and reliable way to tell if something’s alive or not. And one of the most basic things we need our monitoring to tell us is whether our service is even up. Management/Monitoring Operations/Development
Synthetic monitoring in action 6m 34s Optional Using Pingdom. Monitoring Operations/Development
End user monitoring: What do users see? 5m 8s Even though our servers and apps are up, real user experience can vary by geolocation, browser, and diverse input from real users. But is user experience really a DevOps concern? Management/Monitoring Operations/Development
End user monitoring instrumentation 7m 18s In this segment we’ll cover how to capture real user data, and what are useful things to measure about the user experience. Management/Monitoring Operations/Development
End user monitoring in action 7m 8s Now that our end user sessions are instrumented let’s dive into the end user monitoring data and review a few key things to measure. Everyone
System monitoring: See the box 4m 53s System monitoring is where a lot of sys admins are tempted to start, with the almighty CPU and memory graph Everyone
System monitoring in action 8m 22s Optional System Monitoring with Datadog Monitoring Operations
Network monitoring 5m 35s Modern systems are heavily interconnected and without network visibility, we’re blind to communications related issues, which are pretty common. Management/Monitoring Operations/Development
Software metrics: What’s that doing? 3m 22s While system stats are all well and good, the systems are there to run software. And most software surfaces metrics more willingly than what it can extract from them at the OS Level Development/Monitoring Operations
Software metrics in action 6m 28s A review of systems, functions, and the metrics they transmit. Monitoring Operations
Application monitoring 5m 18s I believe it’s a Bulgarian proverb that says, “If an application falls over, and no one monitors it, does it make a sound?” Yup. It’s the sound of your business screeching to a halt. Development/Monitoring Operations
Application monitoring in action 9m 8s So let’s take a look at some basic app performance analysis and monitoring techniques. Development/Monitoring Operations
Log monitoring 5m 44s We generally think of logs as containing events, rather than metrics. Events can carry a lot more information. - You can also emit metrics into a log file and use it as a slightly less efficient ingestion channel. Monitoring Operations
Log monitoring in action 5m 50s Log monitoring with Splunk SaaS Everyone

Monitoring Technique

Monitoring is a skill that has to be learned. This series of videos will expose teams to proper technique and provide a brush up for those with practical experience already.

Title Running Time Description Persona
Implementing monitoring 4m 46s Let’s talk about implementing monitoring the pragmatic way. We should start with a Bulgarian proverb. A united band can lift a mountain. So in the spirit of dev ops, we’ll focus on people first. The goal of all these tools is to assist human operators in ensuring that a service is functioning. What type of skills and culture do we need to build and organize a monitoring practice? Management/Monitoring Operations
Using monitors: Visualization 4m 31s Alright, now you’ve implemented all kinds of great monitoring instrumentation, how do you use it? There are several common ways of consuming monitoring data, graphs and alerts I’m sure come immediately to mind. Let’s talk visualization. While most monitoring tools will show you loads of graphs, that may not be the best way to get the job done. Monitoring Operations
Using monitors: Alerting 5m 10s You don’t want to just sit there gathering data and thinking about it. You want it to summon you to action, and that’s where alerting comes in. On the other hand, to many people, monitoring is often misused to just mean the sending of alerts. And while proper monitoring is much more than that, good quality alerts are a very important part of life. Monitoring Operations/Response Operations/Development
Monitoring challenges 5m 14s Let’s address the system, human, and tools obstacles you’ll encounter on your observability journey. This is the epoch of the observer, the observed, and the observatory. Let’s start with the observed, or the system we’re trying to monitor. Monitoring Operations

Thread Tracing

Thread tracing is not acomplished by a download or a program. Distributed tracing requires that software developers add instrumentation to the code of an application, or to the frameworks used in the application. Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

Title Running Time Description Persona
Tracing for Observability 1m 20s You have implemented a microservices architecture to scale better. But now it’s hard to diagnose problems because of the additional complexity. This course will cover the techniques required to debug, and find problems. Development Manager
Why do we need distributed tracing now? 4m 2s Every minute of performance degradation or downtime in your applications, could cost millions of dollars to your business. Weaving in the right diagnostic capabilities in your apps is key, to allow for rapid triage of problems. This course will discuss what distributed tracing is and why it’s a critical part of system observability. Distributed tracing is a relatively new technique compared to the canonical log and application KPI metrics, that most people use for monitoring. Distributed tracing is becoming a necessity nowadays with the increasing complexity of apps, built on top of middleware and microservices. Development Manager
Tracing libraries and agents 5m 6s An external library loads with your app and introspects the code at run time. The agent discovers any of the code executing in a process with the ability to configure what and how calls are recorded. It determines what called to record and what metadata to extract. It propagates context for distributed calls. There’s no open standard, so different technologies implement these three steps differently. Development Manager
Overview of Zipkin and Jaeger 4m 3s Zipkin and Jaeger are the two leading are the two leading open projects on distributed tracing. Development Manager
Persisting trace data for long term analysis 3m 49s Setting up Zipkin and Jaeger with in memory storage is an easy way to get started. In production environments, it is common to configure the Zipkin client to receive data from Kafka and persist into a Cassandra cluster. Here are the steps to configure the Zipkin client with Kafka and Cassandra. The instructions assume you already have the Kafka broker and Cassandra clusters set up. Development Manager
Integrating trace data with the rest of your monitoring systems 2m 56s TraceData is a great source for diagnostics, and will augment your monitoring capability. As adoption of distributed tracing grows in your organization, you may end up with a variety of instrumentation points, because different projects who have different requirements. Development Manager

Data Visualization and Data Driven Decision Making

Data visualization is a critical skill to master, and not one that is taught. Anyone can make a pie chart, but making an effective pie chart is a skill that has to be learned over time. Before we throw numbers on a spreadsheet, let’s look at how to do it effectively. This track is foundational for anyone that has to develop or maintain dashboards. Dashboard consumers may find some of the sections interesting, either for conversations with dashboard maintainers or for other data presentation aspects of their work. We will review how data should be visualized, explain how visualization can enhance monitoring and improve response to issues, and how to improve data visualization.

Title Running Time Description Persona
Data Visualization (The Quick Course) 5m 30s Data visualizations help you understand analytics, what’s working and what isn’t working. This video focuses on the impact of marketing, but is a good overview of how to make an impact with visual data. Managers/Monitoring Operations
Visibility Drives Data-Driven Decision-Making 4m 22s A common data fabric provides the objective measurement and shared visibility critical for a data-driven DevOps approach. With comprehensive and continuous visibility into key performance measures, provided through a shared data fabric, DevOps teams can isolate “waste,” detect and correct slowdowns, and deliver applications faster. Correlate test and QA outcomes to find more problems sooner and improve code quality. React faster to detect and address problems that do get through to production and use real-time insight to measure business impact and iterate faster on good change Manager
Straighten the Flow of DevOps with Data 7m 4s DevOps practitioners are finding more in common with the cowboys of the Old West than modern-day, process-obsessed enterprise architects. These DevOps practitioners or the “shoot from the hip” cowboys of IT, base their decisions on speed, gut-feelings and output, which has been largely successful thus far. Management
Actionable Metrics for Data-driven Decision Making 1m 29s Whether your IT organization operates like a well-oiled machine or you’ve got some inefficiencies, the same holds true: what you don’t measure, you can’t manage. But in software development, what do you measure; how often; and how do you synthesize the data into something that can move your team forward? Management
Using Monitors: Visualization 4m 31s You always have to start with asking yourself, what is the goal of a specific visualization? Is it detecting issues, troubleshooting issues, capacity planning, SLA reporting, and who’s the consumer? Is it for you, for other engineers, business users, customers? You need to display the same metrics in fundamentally different ways, depending on what it is you’re trying to show. Monitoring Operations
How to design and build a great dashboard 5m 49s You don’t need to be a designer to build a dashboard that clearly communicates your key goals and metrics. Whether you’re just getting started or have a dashboard in need of a rethink, our checklist will help you achieve the results you’re after. Monitoring Operations
How Dashboards Are Changing Human Behavior in DevOps 8m 4s What is a dashboard? Something readily visible to everyone that needs it that shows how something is functioning or progressing. When used right — whether on a car, aircraft or DevOps team — a dashboard is simple to read at a glance and a very powerful way to know you’re heading in the right direction. Monitoring Operations
The Difference Between Capacity and Scalability Planning 3m 7s The reason we invest time in capacity and scalability planning is simple: We want to make sure that system resources such as compute, storage and memory are not the cause of an application outage. It is a response to expected and unexpected increases in application usage, as well as the steady growth of application adoption. Development Management
Data Driven Decision Making 18m 04s How do you know if you are making the right decisions on your product or project? How do you know that you’re measuring the correct metrics to determine success? In this talk, two Cloud Foundry Product Managers will describe how to design the correct measurable indicators to build an understanding of your product, how to monitor those metrics over time, and how to build feedback loops to course-correct and determine the success of short-term initiatives and features. Management

Data Visualization Tips and Tricks Course

Once you understand how vital data visualization is, actually visualizing the data is another skill that has to be learned. This series of videos covers data visualization in general, not just monitoring data. This series is useful whether you have to build dashboards or include charts in other business-related reports.

Title Running Time Description Persona
The Rules of Effective Data Visualization 2h 41m There are many tools out there that allow you to create beautiful, interactive data visualizations that help you to see and understand your data. Make the right choice, and the visualization can enhance the understanding of your data. The entire course is recommended for all who have to visualize data. Managers/Monitoring Operations
Why visualize data? 6m 34s Everybody’s used to summary statistics, things like what’s the average height of a class, what’s the average sales per customer. These numbers can be a useful summary of a data set but they can also hide detail within the data. There is a danger that relying on summaries can lead to misleading or even incorrect answers. Management/Operations
What kind of visualization should you make? 3m 50s There are many ways to visualize your data, and the choice of it is going to largely depend on your data and what you’re trying to achieve and that’s always going to be the starting point. No matter what data you have, and what visualization you decide to use, there’s some general guidelines that will stand you in good stead. Now broadly, there are two kinds of data visualization, exploratory dashboards and infographics. Monitoring Operations/Dashboard Builders
Visualize comparisons in data 10m 10s There are many ways to compare data, depending on the kind of data we have and the type of questions we are asked. Three great ways to compare data sets are bar charts, line charts, and highlight tables. Monitoring Operations
Bar charts across categories 2m 46s When it comes to comparing categorical data, it’s really hard to beat the bar chart. Although it’s a very simple chart and often used, it’s one of the best ways of visualizing data. Monitoring Operations/Dashboard Builders
Line charts over time 4m 52s For comparing changes over time the line chart is the best choice. This works by plotting the date on the x-axis and one or more measures that you’re interested in on the y-axis. So let’s look at this in action. Monitoring Operations/Dashboard Builders
Spark lines for important events 4m 48s Most of the time, when we’re visualizing data, we’re interested in details within our data set. However, sometimes we need a higher-level overview of what’s going on. Sometimes too much information’s just too much. Monitoring Operations/Dashboard Builders
Gantt charts and time difference 4m 39s Durations and time are one the trickiest things to visualize. Especially when we’re trying to compare durations across a dimension. Typically we’re looking to see how long process took compared to all the other processes. Now this could be achieved by doing some simple date difference calculations. Monitoring Operations/Dashboard Builders
Tree maps for long-tail data 5m 6s One of the reasons why visualizing data is so effective is that it enables us to look at large volumes of data that would have been impossible to analyze in a traditional spreadsheet. We can see trends, outliers, and we can compare measures across a range of dimensions. Monitoring Operations/Dashboard Builders
Highlight tables and heat maps 5m 58s At first glance a table might not look like an effective data visualization. In fact many people wonder whether it’s one at all. However, under the right circumstances a table can be a really useful vis type. The reason is, it gives us immediate access to the underlying data. Monitoring Operations/Dashboard Builders
Slope charts for change between dates 3m 12s Line charts are a really effective way of visualizing time based data. This example here, we’re looking at the amount of sales per month across all of our products. Now we can see that overall, it looks like sales are increasing, as the general trends are going up, but there is a lot of clutter in between. Monitoring Operations/Dashboard Builders
Optimize dashboard layout with small multiples 3m 31s It’s often tempting, when we’re trying to compare data that changes across dimensions, to keep adding them into a single view, using things like color, shape, and size, in order to encode extra information. However, sometimes, this can make for a very unattractive, and also, a very difficult view to understand. Monitoring Operations/Dashboard Builders
Visualize relationships in data 6m 24s Analyzing relationships with a data set enables us to look for patterns in clusters. How does an increase in one value affect the others? Let’s see this in action. A good data vis always starts with a question. Monitoring Operations/Dashboard Builders
Compare multiple variables within scatter plots 5m 27s One of the best ways of visualizing the relationship between measures is to use a scatter plot. A scatter plot works by looking at the relationship between two different measures and subdividing it one or more different categories or dimensions. Monitoring Operations/Dashboard Builders
Visualize data distributions 5m 39s When we talk about distributions in terms of data, what we usually mean is how many times a value appears or its frequency. One of the most common ways to visualize a distribution is with a histogram. A histogram is a plot that lets you show the underlying frequency distribution or shape of a continuous data set. Monitoring Operations/Dashboard Builders
Histograms for a single measure 5m 17s When it comes to looking at distributions in our data, what we’re looking at is actually the frequency of a particular value as it appears in our data set. Now a really common and good way of doing that is with a histogram, which is basically a kind of modified bar chart. Monitoring Operations/Dashboard Builders
Box plots for multiple dimensions 6m 29s Now, a really common way of visualizing a distribution is using a histogram. Now, they’re great for showing an overview, but they’re only really good at showing a single dimension. We can’t look at multiple dimensions and how that affects our distribution. If we want to look at a distribution across multiple dimensions and look in a bit more detail, then we can utilize a box plot. Monitoring Operations/Dashboard Builders
Visualize data composition 4m 19s When it comes to looking at the composition of a data set there’s one visualization that is used more than any other. And it’s probably the most misused one as well. And that’s the pie chart. Now it’s fair to say there’s not a lot of love for pie charts. It’s been called a bad data visualization. I don’t think that’s true. Pie charts are not bad, bad pie charts are bad. Monitoring Operations/Dashboard Builders
Improve the use of pie charts 5m 44s Of all the ways to visualize the composition of a data set, the pie chart is probably the most commonly used. But it’s also one of the most abused visualization types. When used correctly, it can be effective, but far too often it gets misused and used the wrong way. So let’s look at some examples of both good and bad pie charts and some of the things you should avoid if you are going to use them. Monitoring Operations/Dashboard Builders
Stacked bar charts 4m 53s The default way of showing the composition of a data set, tends to be the pie chart. However, there are some big problems when we use pie charts to compare changes across different categories. Let’s look at an example of this in action then see how the stacked bar chart is a much better option. Monitoring Operations/Dashboard Builders
100% stacked bars 5m 9s Although the pie chart is a common way of visualizing the composition of a dataset, it doesn’t work too well if we’re comparing across multiple categories. For example, this pie chart is looking at the breakdown of sales according to the segments of our customers, so we can see that consumer segment banks have pretty much half of all of our sales, but what if we wanted to look at, say, regional variation. Monitoring Operations/Dashboard Builders
Stacked area chart 3m 26s One of the ways of visualizing changes in composition over time is use a stacked area chart. Now, at first glance, a stacked area chart looks very similar to a line chart, but there are some important differences to understand between the two. Monitoring Operations/Dashboard Builders
100% stacked area chart 2m 24s When we want to look at changes over time, we usually use a line chart, however, in some cases, 100% area chart can be a better choice if we’re looking at part-to-whole relationships and how each individual element contributes to the whole. Monitoring Operations/Dashboard Builders
Visualize geographic data 9m 16s Whenever people have geographic values in their data there’s a big temptation to create a map. Maps are a great looking visualization, but it is always the right choice for our data? In some cases, yes, it’s absolutely the right decision. But depending on the data and the question we’re asking it can also be the wrong visualization. Monitoring Operations/Dashboard Builders
When to map geographic data 4m 21s There’s a temptation that where we have geographic data we put it into a map. Now in some cases that could be a really good choice and can enhance your visualization, but other times it can actually be a hindrance to understanding what’s going on in the data. Monitoring Operations/Dashboard Builders
Compare filled maps and symbol maps 2m 21s Once you’ve decided you’re going to make a map, the next choice is what kind of map you’re going to use. Commonly, there’s either a filled map or a symbol map. Both of them have their pros and cons and which one you decide to use largely depends on both the data and the kind of question that you’re asking. Monitoring Operations/Dashboard Builders
Alternative techniques with tile maps 4m 33s There are two common ways of mapping data. First is the filled map, and second a symbol map. Both of them have their pros and cons. There’s a third map type that isn’t quite as common, but can be really useful, and that’s a tiled map. Monitoring Operations/Dashboard Builders

Grafana

This video series looks at how Grafana, a platform for monitoring and metric analysis, can be used to create and analyze beautiful graphs and other visualizations of your system, providing so many features without the cost of proprietary monitoring solutions. With these skills in your toolkit, you will walk away with everything you need to deepen your understanding of your systems and their business value.

Title Running Time Description Persona
Grafana basics 3m 34s There are two common ways of mapping data. First is the filled map, and second a symbol map. Both of them have their pros and cons. There’s a third map type that isn’t quite as common, but can be really useful, and that’s a tiled map. Monitoring Operations
Installing Grafana 2m 23s Optional Installing Grafana. In this video, we will install Grafana on our Ubuntu 16.04 virtual machine. Assuming we are logged into our machine, we will first add a key to our app installation’s list of trusted keys, which will allow us to download the Grafana package. Operations
Grafana security basics 3m 21s Optional Grafana Security Basics. Once Grafana is installed we will want to implement a few security measures to our setup right away. This includes changing the administrative login password and ensuring that new account registration and anonymous access are turned off. Operations
Adding data sources 4m 13s Adding Data Sources. In order to visualize metrics Grafana will need to get data from a storage back end which it refers to as a data source. In this video, we’ll cover how to add our first data source. Monitoring Operations
Creating dashboards 4m 37s Creating dashboards. With our first data source added it’s time to explore creating dashboards. If you are following along and used Docker Compose in the previous video, you can terminate that setup now using the command docker-compose down and hit enter. Monitoring Operations
Additional dashboard configurations 3m 42s Additional dashboard configurations. In this session, we’ll continue configuring our dashboard. For the panel we’re on, we may also want to specify the units associated with our graph. For example, a duration such as milliseconds, or a percent from zero to 100. Monitoring Operations
Deep dive: Grafana panel types 4m 54s Deep dive: Grafana panel types. In this section we will review the basic panel types offered by Grafana in greater detail. As described in previous videos, Grafana offers seven panel types, the graph, single-stat, dashboard list, table, text block, heatmap, and alert list. Monitoring Operations
High-availability Grafana 3m 35s Optional High-availability Grafana. In this video, we will examine the configuration changes that are necessary if we want to create a highly available Grafana infrastructure. This includes configuring Grafana to use an external database so that multiple instances of Grafana can use the same database as well as managing user sessions appropriately. Operations

  1. This type of alert should cause heads to pop up and people to move immediately to fix the problem. A wind up air raid siren sound is a great alert tone for this sort of issue.  ↩