Key performance indicators (KPIs) help determine how good or how bad some practices or initiatives are. Unlike the metrics that support (or refute) the hypothesis, KPIs are the hypothesis and ideally defined at the initiation of the project. KPIs are statements that concentrate on business goals and the impact of improvements to the value stream.
Good Key Performance Indicators (KPIs) should not be made in a vacuum. They should represent the business goals, as well as the teams, and be prominently displayed for all to see. But what makes a good KPI, and how are they established? How many should each team have? What should be included or excluded? Are they different from goals?
One key point to remember is that people tend to change behavior when they are measured, leading to finding the shortest path to meeting the requirement. Sadly, this has unintended side effects. KPIs are not just for management. When setting goals, it is about driving towards the desired change and minimizing the unintended consequences. From a management perspective, two of the most important missions are setting priorities and providing the resources required to get the priorities accomplished. KPIs then are a crucial way to verify that the priorities are achieved.
KPIs come in different names. Informally they can be referred to as metrics, which may be accurate. But they represent an entirely different type of metric than commonly discussed. KPIs, when used to examine individual performance, are sometimes referred to as OKRs Objectives and Key Results. John Doerr, a venture capitalist, popularized the term OKR.
Title | Running Time | Description | Persona |
---|---|---|---|
Measuring DevOps: The Key Metric That Matters | 29m 31s | Having the right goals, asking the right questions, and learning by doing are paramount to achieving success with DevOps. Having specific milestones and shared KPIs play a critical role in guiding your DevOps adoption and lead to continuous improvement—toward realizing true agility, improved quality, and faster time to market throughout your organization. This session will walk you through a practical framework for implementing measurement and tracking of your DevOps efforts and software delivery performance that will provide you with data you can act on! | Management |
DevOps Quality Metrics that Matter | 48m 35s | The way that we develop and deliver software has changed dramatically in the past 5 years—but the metrics we use to measure quality remain largely the same. Every other aspect of application delivery has been scrutinized and optimized as we transform our processes for DevOps. Why not put quality metrics under the microscope as well? Tricentis commissioned Forrester to research the topic. Forrester analyzed how DevOps leaders use and value 94 common quality metrics—then identified which metrics matter most for DevOps success. | Management |
If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There | 33m 30s | The best-performing organizations have the highest quality, throughput, and reliability while also delivering value. They are able to achieve this by focusing on a few key measurement principles, which Nicole and Jez will outline in this talk. These include knowing your outcome measuring it, capturing metrics in tension, and collecting complementary measures… along with a few others. Nicole and Jez explain the importance of knowing how (and what) to measure—ensuring you catch successes and failures when they first show up, not just when they’re epic, so you can course correct rapidly. Measuring progress lets you focus on what’s important and helps you communicate this progress to peers, leaders, and stakeholders, and arms you for important conversations around targets such as SLOs. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to be great SREs and move performance in the right direction. | Management |
If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There | 33m 30s | The best-performing organizations have the highest quality, throughput, and reliability while also delivering value. They are able to achieve this by focusing on a few key measurement principles, which Nicole and Jez will outline in this talk. These include knowing your outcome measuring it, capturing metrics in tension, and collecting complementary measures… along with a few others. Nicole and Jez explain the importance of knowing how (and what) to measure—ensuring you catch successes and failures when they first show up, not just when they’re epic, so you can course correct rapidly. Measuring progress lets you focus on what’s important and helps you communicate this progress to peers, leaders, and stakeholders, and arms you for important conversations around targets such as SLOs. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to be great SREs and move performance in the right direction. | Management |
Identifying key performance indicators | 3m 35s | It is important, in business and in our application development, that we have a clear set of measurable and obtainable objectives. These items we call key performance indicators, or KPIs and should be ways we can measure the effectiveness and success of our applications. | Management |
Oracle BI 11g: Scorecarding and Strategy Management | This course teaches you how to create and use KPIs and scorecards, which are components of the Business Intelligence Foundation Suite, a complete, open, and integrated solution for all enterprise business intelligence needs, including reporting, ad hoc queries, OLAP, dashboards, and scorecards. | Management | |
KPIs What They Are and Why Your Organization Needs Them | 32m 24s | In this 30-minute session, we will review some of the fundamental questions that you need to address in order to get started. | Management |
Chapter 19: Creating KPIs (Practice of Cloud System Administration) | 27m 0s | Setting KPIs is quite possibly the most important thing that a manager does. It is often said that a manager has two responsibilities: setting priorities and providing the resources to get those priorities done. Setting KPIs is an important way to verify that those priorities are being met. The effectiveness of the KPI itself must be evaluated by making measurements before and after introducing it and then observing the differences. This changes management from a loose set of guesses into a set of scientific methods. We measure the quality of our system, set or change policies, and then measure again to see their effect. This is more difficult than it sounds. | Management |
12 DevOps KPIs you should track to gauge improvement | 6m 0s | It’s no small task to transform an IT organization to integrate development, operations and quality assurance teams. A DevOps methodology requires team and process changes and then, once everything is in place, the onus is on IT to create DevOps KPIs and measure the outcomes. | Everyone |
How to Use Value Stream Mapping in DevOps | 17m 37s | At a high-level, there are some basic metrics that will be collected as part of the Value Stream Mapping exercise. Value added (VA): Value added time is the amount of time that a team actually spends working on the project (as opposed to, for example, the time that a project or request sits in the queue). Whenever there is no change in the product, it is considered non-value added time. Lead time (LT): Lead time represents the total time it takes a person or team to complete a task—it is the combination of value added and non-value added. % Complete/accurate (%C/A): This is the percentage of information-based work that is complete and accurate the first time and requires no re-work by downstream processes. There are numerous other metrics that can be collected for a VSM exercise and will depend on value stream being mapped. | Management |
SLIs are, among other things, the critical measure of a system’s availability. At the same time, SLOs can be considered the goals we set for how much availability we can expect out of a system. Together, these two values help engineering teams make better decisions. They provide critical information about how hard you can push your system, and if code improvements have the desired effects.
Today’s systems are complex with hundreds, if not thousands of nodes comprising everything from databases to web servers. And with so many nodes, the idea of a system boundary becomes blurred, making it even more imperative to measure the performance of individual components, be they physical or virtual components.
Before you can apply the concepts of SLIs to a system, you need a plain language definition of availability and a description of system boundaries that everyone can agree with. Remember that these definitions will change over time as new systems arrive, and old systems are retired, as well as changes to business needs and operational realities. Once arrived at SLIs, become broad proxies for availability and are metrics that will help determine the health of an active system. Most services consider request latency—how long it takes to return a response to a request—as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret. For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. Choosing an appropriate SLO is complex. To begin with, you don’t always get to choose its value! For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is primarily determined by the desires of your users, and you cannot set an SLO for that. Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service is slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the views held by the people designing and operating the service.
This track will discuss various SLIs, how to determine SLOs, and practical examples to better cement the idea of how the two work together.
Title | Running Time | Description | Persona |
---|---|---|---|
Building a Culture of Metrics | 3m 19s | This video explains how to select the right metrics | Everyone |
SLAs and SLOs | 5m 21s | This video explains what SLAs and SLOs are and why these are important for business needs, with examples. | Everyone |
Defining reliability | 2m 34s | Let’s start by defining the basic building block for measuring the reliability of a service, the service-level indicator. A service-level indicator, or SLI for short is an indicator of the level of service you provide via your service, ideally expressed as a ratio of two numbers. | Monitoring Operations |
Implementing measurements | 2m 6s | Once you have determined what about your service makes users happy and have a specific assessment in mind for that behavior, you then need to find a concrete way to implement the assessment. This concrete measurement of an SLI specification is referred to as an SLI implementation. SLIs should be specific and measurable. | Monitoring Operations |
Common measurements | 4m 19s | One common way to get started with service level indicators is to think about your system abstractly and segment it into pieces based on common component types. Typically, components of a system will fall into one of three categories; one, request-driven systems also referred to as user-facing serving systems; two, big data systems also referred to as data processing pipelines; and three, storage systems. | Monitoring Operations |
Measurement and calculation | 4m 8s | The most important question you can ask to generate SLIs is what do your users care about? That said, often what users care about is difficult or impossible to measure. So, you’ll need to think critically about how to approximate the users’ needs. | Monitoring Operations |
Objectives vs. indicators | 2m 59s | A service-level objective builds off of a service-level indicator to provide a target level of reliability for a service as customers. It takes the SLI, and adds both a threshold and a time window, making it a metric that can be evaluated at a set cadence. Setting service-level objectives frames service-performance expectations. | Monitoring Operations |
Making measurements meaningful | 3m 57s | Once you have a high quality service level indicator, there are two things needed to turn it into a service level objective. A time window, and a threshold. | Monitoring Operations |
Documenting SLOs | 1m 10s | Service level objectives that are defined and have received stakeholder buy in should be properly documented in a centralized easy to update space where other teams and stakeholders can review them. | Monitoring Operations |
25 Examples of a Service Level Objective | 5m 3s | A service level objective is a criteria that is used to evaluate the performance of a business or technology service. In many cases, service level objectives are specified in a contract such as a master service agreement. This article contains common examples of service level objectives | Monitoring Operations |
Service Level Indicators in Practice | 8m 0s | Service level indicators are literally the most important piece you need in order to apply SRE principles. Even if you think you have them, they might not be high quality enough for you to accurately gauge your customer’s experience, and they will mislead you. | Management/Monitoring Operations |
SLOs & You | 12m 30s | Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals? | Monitoring Operations |
Setting SLOs and SLIs in the Real World | 38m 36s | Clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity. | Monitoring Operations |
Real World SLOs and SLIs: A Deep Dive | 37m 02s | If you’ve read almost anything about SRE best practices, you’ve probably come across the idea that clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity. But in the real world, SLOs and SLIs can be challenging to define and implement. In this talk, we’ll dive into the nitty-gritty of how to define SLOs that support different reliability strategies and modalities of service failure. We’ll start by looking at key questions to consider when defining what “reliability” means for your organization and platform. Then we’ll dig into how those choices translate into specific SLI/SLO measurement strategies in the context of different architectures (for example, hard-sharded vs. stateless random-workload systems) and availability goals. | Management |
Latency SLOs Done Right | 27m 12s | Latency is a key indicator of service quality, and important to measure and track. However, measuring latency correctly is not easy. In contrast to familiar metrics like CPU utilization or request counts, the “latency” of a service is not easily expressed in numbers. Percentile metrics have become a popular means to measure the request latency, but have several shortcomings, especially when it comes to aggregation. The situation is particularly dire if we want to use them to specify Service Level Objectives (SLOs) that quantify the performance over a longer time horizons. In the talk we will explain these pitfalls, and suggest three practical methods how to implement effective Latency SLOs. | Operations |
The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It | 45m 21s | SLOs are a wonderfully intuitive concept: a quantitative contract that describes expected service behavior. These are often used in order to build feedback loops that prioritize reliability, communicate expected behavior when taking on a new dependency, and synchronize priorities across teams with specialized responsibilities when problems occur, among other use cases. However, SLOs are built on an implicit model of service behavior, with a raft of simplifying assumptions that don’t universally hold. | Both |
Automatic Metric Screening for Service Diagnosis | 18m 35s | When a service is experiencing an incident, the oncall engineers need to quickly identify the root cause in order to stop the loss as soon as possible. The procedure of diagnosis usually consists of examining a bunch of metrics, and heavily depends on the engineers’ knowledge and experience. As the scale and complexity of the services grow, there could be hundreds or even thousands of metrics to investigate and the procedure becomes more and more tedious and error-prone. | Operations |
Performance Checklists for SREs | ?? | The Netflix Example of checklist and associated performace metric (dated?) | Operations |
Using deemed SLIs to measure customer reliability | ?? | If you do run a platform, it’s going to break sooner or later. Some breakages are large and easy to understand, such as no one being able to reach websites hosted on your platform while your company’s failure is frequently mentioned on social media. However, other kinds of breakage may be less obvious to you—but not to your customers. What if you’ve accidentally dropped all inbound network traffic from Kansas, for example? | Both |
Why Percentiles Don’t Work the Way you Think | 14m 15s | They’re not asking for the 99th percentile of a metric, they’re asking for a metric of 99th percentile. This is very common in systems like Graphite, and it doesn’t achieve what people sometimes think it does. This blog post explains how percentiles might trick you, the degree of the mistake or problem (it depends), and what you can do if percentile metrics aren’t right for you. | Both |
Monitoring Isn’t Observability | 6m 4s | Observability is all the rage, an emerging term that’s trending up very quickly in certain circles even while it remains unknown in others. As such, there isn’t a single widely understood meaning for the term, and much confusion is inevitably following. What is observability? What does it mean? Perhaps just as importantly, what is observability NOT? | Both |
The Problem with Pre-aggregated Metrics: Part 1, the “Pre” | 2m 48s | Pre-aggregated, or write-time, metrics are efficient to store, fast to query, simple to understand… and almost always fall short of being able to answer new questions about your system. This is fine when you know the warning signs in your system, can predict those failure modes, and can track those canary-in-a-coal-mine metrics. | Both |
The Problem with Pre-aggregated Metrics: Part 2, the “aggregated” | 3m 34s | The nature of pre-aggregated time series is such that they all ultimately rely on the same general steps for storage: a multidimensional set of keys and values comes in the door, that individual logical “event” (say, an API request or a database operation) gets broken down into its constituent parts, and attributes are carefully sliced in order to increment discrete counters. | Operations |
The Problem with Pre-aggregated Metrics: Part 3, the “metrics” | 3m 45s | Finally, we arrive at discussing “metrics.” Terminology in the data and monitoring space is incredibly overloaded, so for our purposes, “metrics” means: a single measurement, used to track and assess something over a period of time. | Both |
It All Adds Up | 8m 50s | Statistical analysis is a critical – but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. | Monitoring Operations |
Quantifying your SLOs | 9m 19s | Service Level Objectives (SLOs) are essential performance indicators for organizations that want a real understanding of how their systems are performing. However, these indicators are driven by vast amounts of raw data and information. That being said, how do we make sense of it all and quantify our SLOs? | Monitoring Operations |
Measuring and evaluating Service Level Objectives (SLOs) | 4m 18s | Managing services is hard for both service owners and stakeholders. To make things easier for everyone, define a clear set of expectations from the beginning. This helps measure and evaluate the health of services easier. | Monitoring Operations |
Chapter 14: Create Telemetry to Enable Seeing and Solving (DevOps Handbook) | 28m 33s | The Microsoft Operations Framework (MOF) study in 2001 found that organizations with the highest service levels rebooted their servers twenty times less frequently than average and had five times fewer “blue screens of death.” In other words, they found that the best-performing organizations were much better at diagnosing and fixing service incidents, in what Kevin Behr, Gene Kim, and George Spafford called a “culture of causality” in The Visible Ops Handbook. High performers used a disciplined approach to solving problems, using production telemetry to understand possible contributing factors to focus their problem solving, as opposed to lower performers who would blindly reboot servers. | Monitoring Operations |
Chapter 15: Analyze Telemetry to Better Anticipate Problems and Achieve Goals (DevOps Handbook) | 15m 41s | A great example of analyzing telemetry to proactively find and fix problems before customers are impacted can be seen at Netflix, a global provider of streaming films and television series. Netflix had revenue of $6.2 billion from seventy-five million subscribers in 2015. One of their goals is to provide the best experience to those watching videos online around the world, which requires a robust, scalable, and resilient delivery infrastructure. Roy Rapoport describes one of the challenges of managing the Netflix cloud-based video delivery service: “Given a herd of cattle that should all look and act the same, which cattle look different from the rest? Or more concretely, if we have a thousand-node stateless compute cluster, all running the same software and subject to the same approximate traffic load, our challenge is to find any nodes that don’t look like the rest of the nodes.” | Monitoring Operations |
Chapter 6: Numbers Lead the Way (Achieving DevOps) | 17m 31s + 39m 8s | [First Section, and first section in Behind the Story] Dashboarding and monitoring should have been far more prominent in the team’s journey; in this chapter, they finally begin to pay more attention to the numbers that matter most to their business partners, which reflect global value. Why is this important? And how do feature flags help enable a more continuous flow of value without increasing risk? | Monitoring Operations |
Chapter 16: Monitoring Fundamentals (Practice of Cloud System Administration) | 22m 24s | You can observe a lot by just watching. —Yogi Berra - Monitoring is the primary way we gain visibility into the systems we run. It is the process of observing information about the state of things for use in both short-term and long-term decision making. The operational goal of monitoring is to detect the precursors of outages so they can be fixed before they become actual outages, to collect information that aids decision making in the future, and to detect actual outages. Monitoring is difficult. Organizations often monitor the wrong things and sometimes do not monitor the important things. | Monitoring Operations |
Chapter 17: Monitoring Architecture and Practice (Practice of Cloud System Administration) | 38m 58s | A monitoring system has many different parts. A measurement flows through a pipeline of steps. Each step receives its configuration from the configuration base and uses the storage system to read and write metrics and results. | Monitoring Operations |
Appendix 3: Monitoring and Metrics (Practice of Cloud System Administration) | 3m 9s | Monitoring and Metrics covers collecting and using data to make decisions. Monitoring collects data about a system. Metrics uses that data to measure a quantifiable component of performance. This includes technical metrics such as bandwidth, speed, or latency; derived metrics such as ratios, sums, averages, and percentiles; and business goals such as the efficient use of resources or compliance with a service level agreement (SLA). | Monitoring Operations |
Chapter 4: Service Level Objectives (SRE Book) | 18m 59s | It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product. | Management/Monitoring Operations |
Chapter 2: Implementing SLOs (SRE Workbook) | 41m 44s | It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product. | Monitoring Operations |
Chapter 3: SLO Engineering Case Studies (SRE Workbook) | 30m 11s | SLOs are fundamental to the SRE model. Since we launched the Customer Reliability Engineering (CRE) team—a group of experienced SREs who help Google Cloud Platform (GCP) customers build more reliable services—almost every customer interaction starts and ends with SLOs. | Monitoring Operations |
Best Practices for Setting SLOs and SLIs for Modern, Complex Systems | 12m 16s | At New Relic, defining and setting Service Level Indicators (SLIs) and Service Level Objectives (SLOs) is an increasingly important aspect of our site reliability engineering (SRE) practice. It’s not news that SLIs and SLOs are an important part of high-functioning reliability practices, but planning how to apply them within the context of a real-world, complex modern software architecture can be challenging, especially figuring out what to measure and how to measure it. | Everyone |
Monitoring is the collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. Monitoring can include many types of data, including metrics, text logging, structured event logging, distributed tracing, and event introspection. While all of these approaches are useful in their own right, this chapter mostly addresses metrics and structured logging. These two data sources are best suited to fundamental monitoring needs.
At the most basic level, monitoring allows you to gain visibility into a system, which is a core requirement for judging service health and diagnosing a service when things go wrong, such as:
There are several types of monitoring. White box monitoring, which is monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics. Black box monitoring, which is testing externally visible behavior as a user would see it.
After setting up the monitoring solution, establishing the values to collect and ensuring that the monitoring solution is receiving the appropriate data, the next, obvious question is Now What? Most teams monitor their equipment to ensure that it is up and operational, and when it is not, to ALERT them when it fails. Or is performing outside established thresholds. If monitoring is the act of observation, then alerting is one method of delivering the data. It is not the only one, and should not be the first method of providing the data. The problem with alerting is getting it right, effective, and ensuring it is not ignored. And it is so easy to ignore an alert.
Alerts fall into to categories. The first is an alert as a For your information. This alert is one where no immediate action is required, but someone should be informed. The backup job failed is an FYI type of alert. Someone needs to investigate, but the level of urgency does not rise to drop everything you are doing. The second type of alert is drop everything you are doing. This is an alert that is meant to wake people up in the dead of night[1]. Occasionally a middle-range alert is created between the system is down and get to it while you can. Resist the temptation to create this middle type of alert. Either it is a problem that demands immediate action, or it is not. Too often, the middle of the road alert leads to alerts being ignored. Use these criteria to evaluate your alerts.
The question then is around the strategy for delivery and resolution of alerts.
First, stop using email for alerts. Most DevOps/SREs get too much email. If the alert is of the FYI variety, send it to a group notification (chat) system for follow up. If the alert requires immediate action, choose the method that works best for a quick response. SMS, pager duty, etc. Not everyone uses or pays attention to their SMS alerts during the day or at night, so ensure the team has an agreed-upon method for how these alerts will be sent and acted upon. One shop used a flashing red light in the middle of the work area. Use what works. It is also important to log all alerts for later reporting. A unique sequence number that can be referred back to in reports, knowledge base articles, and post mortems is critical. It will also aid in SLA reporting. Ensure that you have a runbook/checklist. In the Checklist Manifesto, the value of having a checklist and what makes a good checklist details actions that are valuable, whether you are in surgery, trying to land a plane, or troubleshoot a problem.
Everyday items such as:
This will get people focused on the job at hand and prevent the inevitable issue of but I thought…. Make sure to keep your checklists updated as systems change. If you are finding the runbook/checklist solution, provides the actual answer, then automate the runbook! Self-healing should be the first response of any alerting system in the modern age. If a human has to respond to an alert, it is more than likely already too late.
Another useful alerting tool is to delete alerts or tune them for value. Do not be afraid of removing alerts. If an alert is being ignored, and ignoring the alert is not causing a system issue, consider evaluating why the alert was created in the first place. Is it still relevant? Is it still needed? Threshold alerts, in particular, should be reviewed frequently and with a critical eye. Just because an alert fires on a threshold does not mean the threshold is valid. If an alert triggers on a disk utilization at 90% capacity, is there an underlying problem if the disk goes from zero to 90% in an irrational amount of time? Is the monitoring system triggering on that sort of issue? Should it? When establishing threshold alerts, their reason for existence should be discussed and evaluated, but other what if scenarios should also be considered, and potentially rated a higher risk for alerting. Reducing alert fatigue will lead to more effective responses and fewer false alarms.
It should be common sense to disable or toggle alerting during a maintenance window, but more often than not, spurious alerts are generated. Again, this can lead to alert fatigue and the perhaps misdiagnosis that because system X is under maintenance, any alert generated by the system is related to this maintenance, when in fact, it may not be and should be investigated.
This course looks at monitoring as a system requirement, and how to turn collected metrics into viable dashboards for effective data visualization and system alerting.
Title | Running Time | Description | Persona |
---|---|---|---|
DevOps Metrics and Dashboards | 18m 40s | This DevOps Tutorial explains what are the various DevOps metrics that need to be monitored and measured along with the various DevOps dashboards to do the same. While you develop software products, you need some mechanism , some way of measuring or validating and what you are doing to meet customer expectations, that is where DevOps Metrics come in to place. | Operations |
What is Monitoring | 4m 5s | The dictionary definition of monitoring is to observe and check the progress or quality of something over a period of time; keep under systematic review. - In tech that something can be a service you provide, pieces of hardware and software that are part of the service, user activity, your bill process, or any other activity you can observe. | Everyone |
Observability in a DevOps World | 4m 41s | The hot new term in monitoring is observability. Nearly every DevOps monitoring company is using it in their marketing material nowadays but what does it really mean? | Everyone |
GitOps Part 3 - Observability | 11m 59s | In this post: if developers can learn to love testing, then they can learn to love user happiness in production. Observability helps achieve this. Git provides a source of truth for the desired state of the system, and Observability provides a source of truth for the actual production state of the running system. In GitOps we use both to manage our applications. | Both |
Monitoring: What does it all mean? | 3m 11s | Monitoring is a magic and it’s hard. It’s a data science that requires logical rigor. If you think taking statistics in school was a waste, think again. - We’ve found that using established engineering definitions for our monitoring work helps us to keep straight about what’s going on within our systems. | Monitoring Operations |
Monitoring: Math is required | 3m 53s | Understanding your exact metrics and what they mean is a big part of monitoring. - As Mark Twain said, “Facts are stubborn things, ”but statistics are pliable." | Monitoring Operations |
Modeling your system | 4m 55s | There is a saying in Bulgaria, the wolf changes its coats, but not its habits. Complex systems may be unique, but they all obey the same rules. With this in mind, let’s explore some ways to model your monitoring system effectively. | Monitoring Operations |
Increasing Deployment Safety with Metric Based Canaries | 9m 20s | Key KPIs can be adapted for measuring deployment success. In particular, Deployment time, or how long it takes to do a deployment from beginning to successful end. Another mettic is the number of failed deployments, regardless of cause. You can use canary deployments to gather additional metrics related to the deployment process. A canary is a deployment process in which change is partially rolled out then evaluated against the current deployment | Deployment Operations |
Chapter 6: Monitoring Distributed Systems | 21m 14s | Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break. When the system isn’t able to automatically fix itself, we want a human to investigate the alert, determine if there’s a real problem at hand, mitigate the problem, and determine the root cause of the problem. System monitoring is also helpful in supplying raw input into business analytics and in facilitating analysis of security breaches. | Management/Monitoring Operations |
Chapter 4: Monitoring (SRE Workbook) | 22m 18s | Monitoring can include many types of data, including metrics, text logging, structured event logging, distributed tracing, and event introspection. While all of these approaches are useful in their own right, this chapter mostly addresses metrics and structured logging. In our experience, these two data sources are best suited to SRE’s fundamental monitoring needs. At the most basic level, monitoring allows you to gain visibility into a system, which is a core requirement for judging service health and diagnosing your service when things go wrong. | Operations |
Code Coverage | 9m 20s | In computer science, test coverage is a measure used to describe the degree to which the source code of a program is executed when a particular test suite runs. A program with high test coverage, measured as a percentage, has had more of its source code executed during testing, which suggests it has a lower chance of containing undetected software bugs compared to a program with low test coverage.[1][2] Many different metrics can be used to calculate test coverage; some of the most basic are the percentage of program subroutines and the percentage of program statements called during execution of the test suite. | Operations Management |
About Code Coverage | 4m 54s | Code coverage is the percentage of code which is covered by automated tests. Code coverage measurement simply determines which statements in a body of code have been executed through a test run, and which statements have not. In general, a code coverage system collects information about the running program and then combines that with source information to generate a report on the test suite’s code coverage. | Operations Management |
What I Wish I Knew before Going On-call | 1h 25m 28s | Firefighting a broken system is time-sensitive and stressful but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we’ll share common myths among new on-call engineers and the Do’s and Don’ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes. | Both |
For convenience, monitoring instrumentation is broken out from monitoring operations. The act of instrumenting an application generally falls to the development team, while instrumenting a system is typically the responsibility of the operations or infrastructure team. Both groups will learn something from this series.
Title | Running Time | Description | Persona |
---|---|---|---|
Instrumentation Is About Making People Awesome | 3m 12s | The nuts and bolts of metrics, events, and logs are really interesting to me. So interesting, perhaps that I get mired in these technical bits. I keep thinking of ways to process more data, allow for more fields or finer precision. I think about this so much that I drift in to worrying more about the work than the outcome. | Both |
Making Instrumentation Extensible | 6m 21s | Observability-driven development requires both rich query capabilities and sufficient instrumentation in order to capture the nuances of developers’ intention and useful dimensions of cardinality. When our systems are running in containers, we need an equivalent to our local debugging tools that is as easy to use as Printf and as powerful as gdb. We should empower developers to write instrumentation by ensuring that it’s easy to add context to our data, and requires little maintenance work to add or replace telemetry providers after the fact. Instead of thinking about individual counters or log lines in isolation, we need to consider how the telemetry we might want to transmit fits into a wider whole. | Both |
Instrumentation: The First Four Things You Measure | 2m 56s | This is the very basic outline of instrumentation for services. The idea is to be able to quickly identify which components are affected and/or responsible for All The Things Being Broken. The purpose isn’t to shift blame, it’s just to have all the information you need to see who’s involved and what’s probably happening in the outage. | Developer Management |
Instrumentation: What does ‘uptime’ mean? | 4m 33s | Everybody talks about uptime, and any SLA you have probably guarantees some degree of availability. But what does it really mean, and how do you measure it? | Both |
Instrumentation: Worst case performance matters | 2m 52s | After a few false starts and blaming the network as per SOP[2], we decided to take a look at sample-based CPU profiling, which confirmed that 15% of time was going to user data record decompression – largely in line with our past experience. User data records include a long list of “segments.” These are used for targeted advertising, like “people who want to buy a planet ticket to Europe soon;” this adds up to lot of data, so it’s stored in a proprietary compressed format. | Developer Management |
Instrumentation: Measuring Capacity Through Utilization | 3m 54s | One of my favorite concepts when thinking about instrumenting a system to understand its overall performance and capacity is what I call “time utilization”. | Both |
Our monitoring system | 1m 17s | Optional Let’s do an overview of the system and application we’re going to use throughout our demos. | Monitoring Operations |
Synthetic monitoring: Is it up? | 5m 3s | Synthetic monitoring dates back to the prehistoric era, when cavemen threw rocks and poked things with sticks. It’s a simple and reliable way to tell if something’s alive or not. And one of the most basic things we need our monitoring to tell us is whether our service is even up. | Management/Monitoring Operations/Development |
Synthetic monitoring in action | 6m 34s | Optional Using Pingdom. | Monitoring Operations/Development |
End user monitoring: What do users see? | 5m 8s | Even though our servers and apps are up, real user experience can vary by geolocation, browser, and diverse input from real users. But is user experience really a DevOps concern? | Management/Monitoring Operations/Development |
End user monitoring instrumentation | 7m 18s | In this segment we’ll cover how to capture real user data, and what are useful things to measure about the user experience. | Management/Monitoring Operations/Development |
End user monitoring in action | 7m 8s | Now that our end user sessions are instrumented let’s dive into the end user monitoring data and review a few key things to measure. | Everyone |
System monitoring: See the box | 4m 53s | System monitoring is where a lot of sys admins are tempted to start, with the almighty CPU and memory graph | Everyone |
System monitoring in action | 8m 22s | Optional System Monitoring with Datadog | Monitoring Operations |
Network monitoring | 5m 35s | Modern systems are heavily interconnected and without network visibility, we’re blind to communications related issues, which are pretty common. | Management/Monitoring Operations/Development |
Software metrics: What’s that doing? | 3m 22s | While system stats are all well and good, the systems are there to run software. And most software surfaces metrics more willingly than what it can extract from them at the OS Level | Development/Monitoring Operations |
Software metrics in action | 6m 28s | A review of systems, functions, and the metrics they transmit. | Monitoring Operations |
Application monitoring | 5m 18s | I believe it’s a Bulgarian proverb that says, “If an application falls over, and no one monitors it, does it make a sound?” Yup. It’s the sound of your business screeching to a halt. | Development/Monitoring Operations |
Application monitoring in action | 9m 8s | So let’s take a look at some basic app performance analysis and monitoring techniques. | Development/Monitoring Operations |
Log monitoring | 5m 44s | We generally think of logs as containing events, rather than metrics. Events can carry a lot more information. - You can also emit metrics into a log file and use it as a slightly less efficient ingestion channel. | Monitoring Operations |
Log monitoring in action | 5m 50s | Log monitoring with Splunk SaaS | Everyone |
Monitoring is a skill that has to be learned. This series of videos will expose teams to proper technique and provide a brush up for those with practical experience already.
Title | Running Time | Description | Persona |
---|---|---|---|
Implementing monitoring | 4m 46s | Let’s talk about implementing monitoring the pragmatic way. We should start with a Bulgarian proverb. A united band can lift a mountain. So in the spirit of dev ops, we’ll focus on people first. The goal of all these tools is to assist human operators in ensuring that a service is functioning. What type of skills and culture do we need to build and organize a monitoring practice? | Management/Monitoring Operations |
Using monitors: Visualization | 4m 31s | Alright, now you’ve implemented all kinds of great monitoring instrumentation, how do you use it? There are several common ways of consuming monitoring data, graphs and alerts I’m sure come immediately to mind. Let’s talk visualization. While most monitoring tools will show you loads of graphs, that may not be the best way to get the job done. | Monitoring Operations |
Using monitors: Alerting | 5m 10s | You don’t want to just sit there gathering data and thinking about it. You want it to summon you to action, and that’s where alerting comes in. On the other hand, to many people, monitoring is often misused to just mean the sending of alerts. And while proper monitoring is much more than that, good quality alerts are a very important part of life. | Monitoring Operations/Response Operations/Development |
Monitoring challenges | 5m 14s | Let’s address the system, human, and tools obstacles you’ll encounter on your observability journey. This is the epoch of the observer, the observed, and the observatory. Let’s start with the observed, or the system we’re trying to monitor. | Monitoring Operations |
Thread tracing is not acomplished by a download or a program. Distributed tracing requires that software developers add instrumentation to the code of an application, or to the frameworks used in the application. Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
Title | Running Time | Description | Persona |
---|---|---|---|
Tracing for Observability | 1m 20s | You have implemented a microservices architecture to scale better. But now it’s hard to diagnose problems because of the additional complexity. This course will cover the techniques required to debug, and find problems. | Development Manager |
Why do we need distributed tracing now? | 4m 2s | Every minute of performance degradation or downtime in your applications, could cost millions of dollars to your business. Weaving in the right diagnostic capabilities in your apps is key, to allow for rapid triage of problems. This course will discuss what distributed tracing is and why it’s a critical part of system observability. Distributed tracing is a relatively new technique compared to the canonical log and application KPI metrics, that most people use for monitoring. Distributed tracing is becoming a necessity nowadays with the increasing complexity of apps, built on top of middleware and microservices. | Development Manager |
Tracing libraries and agents | 5m 6s | An external library loads with your app and introspects the code at run time. The agent discovers any of the code executing in a process with the ability to configure what and how calls are recorded. It determines what called to record and what metadata to extract. It propagates context for distributed calls. There’s no open standard, so different technologies implement these three steps differently. | Development Manager |
Overview of Zipkin and Jaeger | 4m 3s | Zipkin and Jaeger are the two leading are the two leading open projects on distributed tracing. | Development Manager |
Persisting trace data for long term analysis | 3m 49s | Setting up Zipkin and Jaeger with in memory storage is an easy way to get started. In production environments, it is common to configure the Zipkin client to receive data from Kafka and persist into a Cassandra cluster. Here are the steps to configure the Zipkin client with Kafka and Cassandra. The instructions assume you already have the Kafka broker and Cassandra clusters set up. | Development Manager |
Integrating trace data with the rest of your monitoring systems | 2m 56s | TraceData is a great source for diagnostics, and will augment your monitoring capability. As adoption of distributed tracing grows in your organization, you may end up with a variety of instrumentation points, because different projects who have different requirements. | Development Manager |
Data visualization is a critical skill to master, and not one that is taught. Anyone can make a pie chart, but making an effective pie chart is a skill that has to be learned over time. Before we throw numbers on a spreadsheet, let’s look at how to do it effectively. This track is foundational for anyone that has to develop or maintain dashboards. Dashboard consumers may find some of the sections interesting, either for conversations with dashboard maintainers or for other data presentation aspects of their work. We will review how data should be visualized, explain how visualization can enhance monitoring and improve response to issues, and how to improve data visualization.
Title | Running Time | Description | Persona |
---|---|---|---|
Data Visualization (The Quick Course) | 5m 30s | Data visualizations help you understand analytics, what’s working and what isn’t working. This video focuses on the impact of marketing, but is a good overview of how to make an impact with visual data. | Managers/Monitoring Operations |
Visibility Drives Data-Driven Decision-Making | 4m 22s | A common data fabric provides the objective measurement and shared visibility critical for a data-driven DevOps approach. With comprehensive and continuous visibility into key performance measures, provided through a shared data fabric, DevOps teams can isolate “waste,” detect and correct slowdowns, and deliver applications faster. Correlate test and QA outcomes to find more problems sooner and improve code quality. React faster to detect and address problems that do get through to production and use real-time insight to measure business impact and iterate faster on good change | Manager |
Straighten the Flow of DevOps with Data | 7m 4s | DevOps practitioners are finding more in common with the cowboys of the Old West than modern-day, process-obsessed enterprise architects. These DevOps practitioners or the “shoot from the hip” cowboys of IT, base their decisions on speed, gut-feelings and output, which has been largely successful thus far. | Management |
Actionable Metrics for Data-driven Decision Making | 1m 29s | Whether your IT organization operates like a well-oiled machine or you’ve got some inefficiencies, the same holds true: what you don’t measure, you can’t manage. But in software development, what do you measure; how often; and how do you synthesize the data into something that can move your team forward? | Management |
Using Monitors: Visualization | 4m 31s | You always have to start with asking yourself, what is the goal of a specific visualization? Is it detecting issues, troubleshooting issues, capacity planning, SLA reporting, and who’s the consumer? Is it for you, for other engineers, business users, customers? You need to display the same metrics in fundamentally different ways, depending on what it is you’re trying to show. | Monitoring Operations |
How to design and build a great dashboard | 5m 49s | You don’t need to be a designer to build a dashboard that clearly communicates your key goals and metrics. Whether you’re just getting started or have a dashboard in need of a rethink, our checklist will help you achieve the results you’re after. | Monitoring Operations |
How Dashboards Are Changing Human Behavior in DevOps | 8m 4s | What is a dashboard? Something readily visible to everyone that needs it that shows how something is functioning or progressing. When used right — whether on a car, aircraft or DevOps team — a dashboard is simple to read at a glance and a very powerful way to know you’re heading in the right direction. | Monitoring Operations |
The Difference Between Capacity and Scalability Planning | 3m 7s | The reason we invest time in capacity and scalability planning is simple: We want to make sure that system resources such as compute, storage and memory are not the cause of an application outage. It is a response to expected and unexpected increases in application usage, as well as the steady growth of application adoption. | Development Management |
Data Driven Decision Making | 18m 04s | How do you know if you are making the right decisions on your product or project? How do you know that you’re measuring the correct metrics to determine success? In this talk, two Cloud Foundry Product Managers will describe how to design the correct measurable indicators to build an understanding of your product, how to monitor those metrics over time, and how to build feedback loops to course-correct and determine the success of short-term initiatives and features. | Management |
Once you understand how vital data visualization is, actually visualizing the data is another skill that has to be learned. This series of videos covers data visualization in general, not just monitoring data. This series is useful whether you have to build dashboards or include charts in other business-related reports.
Title | Running Time | Description | Persona |
---|---|---|---|
The Rules of Effective Data Visualization | 2h 41m | There are many tools out there that allow you to create beautiful, interactive data visualizations that help you to see and understand your data. Make the right choice, and the visualization can enhance the understanding of your data. The entire course is recommended for all who have to visualize data. | Managers/Monitoring Operations |
Why visualize data? | 6m 34s | Everybody’s used to summary statistics, things like what’s the average height of a class, what’s the average sales per customer. These numbers can be a useful summary of a data set but they can also hide detail within the data. There is a danger that relying on summaries can lead to misleading or even incorrect answers. | Management/Operations |
What kind of visualization should you make? | 3m 50s | There are many ways to visualize your data, and the choice of it is going to largely depend on your data and what you’re trying to achieve and that’s always going to be the starting point. No matter what data you have, and what visualization you decide to use, there’s some general guidelines that will stand you in good stead. Now broadly, there are two kinds of data visualization, exploratory dashboards and infographics. | Monitoring Operations/Dashboard Builders |
Visualize comparisons in data | 10m 10s | There are many ways to compare data, depending on the kind of data we have and the type of questions we are asked. Three great ways to compare data sets are bar charts, line charts, and highlight tables. | Monitoring Operations |
Bar charts across categories | 2m 46s | When it comes to comparing categorical data, it’s really hard to beat the bar chart. Although it’s a very simple chart and often used, it’s one of the best ways of visualizing data. | Monitoring Operations/Dashboard Builders |
Line charts over time | 4m 52s | For comparing changes over time the line chart is the best choice. This works by plotting the date on the x-axis and one or more measures that you’re interested in on the y-axis. So let’s look at this in action. | Monitoring Operations/Dashboard Builders |
Spark lines for important events | 4m 48s | Most of the time, when we’re visualizing data, we’re interested in details within our data set. However, sometimes we need a higher-level overview of what’s going on. Sometimes too much information’s just too much. | Monitoring Operations/Dashboard Builders |
Gantt charts and time difference | 4m 39s | Durations and time are one the trickiest things to visualize. Especially when we’re trying to compare durations across a dimension. Typically we’re looking to see how long process took compared to all the other processes. Now this could be achieved by doing some simple date difference calculations. | Monitoring Operations/Dashboard Builders |
Tree maps for long-tail data | 5m 6s | One of the reasons why visualizing data is so effective is that it enables us to look at large volumes of data that would have been impossible to analyze in a traditional spreadsheet. We can see trends, outliers, and we can compare measures across a range of dimensions. | Monitoring Operations/Dashboard Builders |
Highlight tables and heat maps | 5m 58s | At first glance a table might not look like an effective data visualization. In fact many people wonder whether it’s one at all. However, under the right circumstances a table can be a really useful vis type. The reason is, it gives us immediate access to the underlying data. | Monitoring Operations/Dashboard Builders |
Slope charts for change between dates | 3m 12s | Line charts are a really effective way of visualizing time based data. This example here, we’re looking at the amount of sales per month across all of our products. Now we can see that overall, it looks like sales are increasing, as the general trends are going up, but there is a lot of clutter in between. | Monitoring Operations/Dashboard Builders |
Optimize dashboard layout with small multiples | 3m 31s | It’s often tempting, when we’re trying to compare data that changes across dimensions, to keep adding them into a single view, using things like color, shape, and size, in order to encode extra information. However, sometimes, this can make for a very unattractive, and also, a very difficult view to understand. | Monitoring Operations/Dashboard Builders |
Visualize relationships in data | 6m 24s | Analyzing relationships with a data set enables us to look for patterns in clusters. How does an increase in one value affect the others? Let’s see this in action. A good data vis always starts with a question. | Monitoring Operations/Dashboard Builders |
Compare multiple variables within scatter plots | 5m 27s | One of the best ways of visualizing the relationship between measures is to use a scatter plot. A scatter plot works by looking at the relationship between two different measures and subdividing it one or more different categories or dimensions. | Monitoring Operations/Dashboard Builders |
Visualize data distributions | 5m 39s | When we talk about distributions in terms of data, what we usually mean is how many times a value appears or its frequency. One of the most common ways to visualize a distribution is with a histogram. A histogram is a plot that lets you show the underlying frequency distribution or shape of a continuous data set. | Monitoring Operations/Dashboard Builders |
Histograms for a single measure | 5m 17s | When it comes to looking at distributions in our data, what we’re looking at is actually the frequency of a particular value as it appears in our data set. Now a really common and good way of doing that is with a histogram, which is basically a kind of modified bar chart. | Monitoring Operations/Dashboard Builders |
Box plots for multiple dimensions | 6m 29s | Now, a really common way of visualizing a distribution is using a histogram. Now, they’re great for showing an overview, but they’re only really good at showing a single dimension. We can’t look at multiple dimensions and how that affects our distribution. If we want to look at a distribution across multiple dimensions and look in a bit more detail, then we can utilize a box plot. | Monitoring Operations/Dashboard Builders |
Visualize data composition | 4m 19s | When it comes to looking at the composition of a data set there’s one visualization that is used more than any other. And it’s probably the most misused one as well. And that’s the pie chart. Now it’s fair to say there’s not a lot of love for pie charts. It’s been called a bad data visualization. I don’t think that’s true. Pie charts are not bad, bad pie charts are bad. | Monitoring Operations/Dashboard Builders |
Improve the use of pie charts | 5m 44s | Of all the ways to visualize the composition of a data set, the pie chart is probably the most commonly used. But it’s also one of the most abused visualization types. When used correctly, it can be effective, but far too often it gets misused and used the wrong way. So let’s look at some examples of both good and bad pie charts and some of the things you should avoid if you are going to use them. | Monitoring Operations/Dashboard Builders |
Stacked bar charts | 4m 53s | The default way of showing the composition of a data set, tends to be the pie chart. However, there are some big problems when we use pie charts to compare changes across different categories. Let’s look at an example of this in action then see how the stacked bar chart is a much better option. | Monitoring Operations/Dashboard Builders |
100% stacked bars | 5m 9s | Although the pie chart is a common way of visualizing the composition of a dataset, it doesn’t work too well if we’re comparing across multiple categories. For example, this pie chart is looking at the breakdown of sales according to the segments of our customers, so we can see that consumer segment banks have pretty much half of all of our sales, but what if we wanted to look at, say, regional variation. | Monitoring Operations/Dashboard Builders |
Stacked area chart | 3m 26s | One of the ways of visualizing changes in composition over time is use a stacked area chart. Now, at first glance, a stacked area chart looks very similar to a line chart, but there are some important differences to understand between the two. | Monitoring Operations/Dashboard Builders |
100% stacked area chart | 2m 24s | When we want to look at changes over time, we usually use a line chart, however, in some cases, 100% area chart can be a better choice if we’re looking at part-to-whole relationships and how each individual element contributes to the whole. | Monitoring Operations/Dashboard Builders |
Visualize geographic data | 9m 16s | Whenever people have geographic values in their data there’s a big temptation to create a map. Maps are a great looking visualization, but it is always the right choice for our data? In some cases, yes, it’s absolutely the right decision. But depending on the data and the question we’re asking it can also be the wrong visualization. | Monitoring Operations/Dashboard Builders |
When to map geographic data | 4m 21s | There’s a temptation that where we have geographic data we put it into a map. Now in some cases that could be a really good choice and can enhance your visualization, but other times it can actually be a hindrance to understanding what’s going on in the data. | Monitoring Operations/Dashboard Builders |
Compare filled maps and symbol maps | 2m 21s | Once you’ve decided you’re going to make a map, the next choice is what kind of map you’re going to use. Commonly, there’s either a filled map or a symbol map. Both of them have their pros and cons and which one you decide to use largely depends on both the data and the kind of question that you’re asking. | Monitoring Operations/Dashboard Builders |
Alternative techniques with tile maps | 4m 33s | There are two common ways of mapping data. First is the filled map, and second a symbol map. Both of them have their pros and cons. There’s a third map type that isn’t quite as common, but can be really useful, and that’s a tiled map. | Monitoring Operations/Dashboard Builders |
This video series looks at how Grafana, a platform for monitoring and metric analysis, can be used to create and analyze beautiful graphs and other visualizations of your system, providing so many features without the cost of proprietary monitoring solutions. With these skills in your toolkit, you will walk away with everything you need to deepen your understanding of your systems and their business value.
Title | Running Time | Description | Persona |
---|---|---|---|
Grafana basics | 3m 34s | There are two common ways of mapping data. First is the filled map, and second a symbol map. Both of them have their pros and cons. There’s a third map type that isn’t quite as common, but can be really useful, and that’s a tiled map. | Monitoring Operations |
Installing Grafana | 2m 23s | Optional Installing Grafana. In this video, we will install Grafana on our Ubuntu 16.04 virtual machine. Assuming we are logged into our machine, we will first add a key to our app installation’s list of trusted keys, which will allow us to download the Grafana package. | Operations |
Grafana security basics | 3m 21s | Optional Grafana Security Basics. Once Grafana is installed we will want to implement a few security measures to our setup right away. This includes changing the administrative login password and ensuring that new account registration and anonymous access are turned off. | Operations |
Adding data sources | 4m 13s | Adding Data Sources. In order to visualize metrics Grafana will need to get data from a storage back end which it refers to as a data source. In this video, we’ll cover how to add our first data source. | Monitoring Operations |
Creating dashboards | 4m 37s | Creating dashboards. With our first data source added it’s time to explore creating dashboards. If you are following along and used Docker Compose in the previous video, you can terminate that setup now using the command docker-compose down and hit enter. | Monitoring Operations |
Additional dashboard configurations | 3m 42s | Additional dashboard configurations. In this session, we’ll continue configuring our dashboard. For the panel we’re on, we may also want to specify the units associated with our graph. For example, a duration such as milliseconds, or a percent from zero to 100. | Monitoring Operations |
Deep dive: Grafana panel types | 4m 54s | Deep dive: Grafana panel types. In this section we will review the basic panel types offered by Grafana in greater detail. As described in previous videos, Grafana offers seven panel types, the graph, single-stat, dashboard list, table, text block, heatmap, and alert list. | Monitoring Operations |
High-availability Grafana | 3m 35s | Optional High-availability Grafana. In this video, we will examine the configuration changes that are necessary if we want to create a highly available Grafana infrastructure. This includes configuring Grafana to use an external database so that multiple instances of Grafana can use the same database as well as managing user sessions appropriately. | Operations |
This type of alert should cause heads to pop up and people to move immediately to fix the problem. A wind up air raid siren sound is a great alert tone for this sort of issue. ↩