Introduction
For Platform Engineers and SREs, the health of your Kubernetes control plane can mean the difference between a smoothly running platform and a cascade of failures. While Amazon EKS abstracts away much of the complexity, understanding control plane observability remains critical - your users’ workloads can significantly impact control plane performance, potentially degrading cluster-wide stability. This is particularly evident in scenarios where poorly configured workloads overwhelm the API server’s rate limits or when resource-intensive operations impact etcd’s performance - challenges that persist even in managed Kubernetes environments.
At the upcoming KubeCon , we will be showcasing strategies for implementing control plane observability. We will demonstrate both open-source tools (Prometheus, Grafana) and AWS-native services like Amazon CloudWatch Container Insights. This post is a sneak peek into the topic and will help you understand the basics of control plane observability.
If you are attending KubeCon we invite you to stop by booth F1 where you can learn about Kubernetes best practices, strategies, and our latest innovations from AWS experts. The AWS booth will feature live, interactive product demonstrations focused on cost optimization, observability, security, governance, data and AI/ML, and platform strategy. A complete list of our sessions and talks can be found at AWS at KubeCon + CloudNativeCon North America 2024 .
Demystifying the Kubernetes Control Plane
The Kubernetes control plane consists of several critical components that work in concert to manage the cluster:
- kube-apiserver: The front-end interface for the control plane that exposes the Kubernetes API
- kube-scheduler: Assigns pods to nodes based on resource requirements and constraints
- kube-controller-manager: Runs controller processes (node controller, replication controller, endpoints controller, etc.)
- cloud-controller-manager: Manages cloud-specific control logic
- etcd: The distributed key-value store that maintains cluster state
The control plane components interact with each other and with the worker nodes to manage the cluster. If any of these components fail or become unresponsive, it can lead to degraded performance, unavailability of services, or even cluster-wide outages.
The EKS architecture is designed to eliminate single points of failure and takes care of scalability and high availability of the control plane components. It is also designed to take care of lifecycle activities such as etcd backup and compaction.
What could possibly go wrong?
While Amazon EKS greatly simplifies the operational overhead of managing the Kubernetes control plane, it is still important to monitor the control plane components for performance, resource utilization, and potential issues.
Here are some example of issues we want to get ahead of:
- HTTP 429 errors: Rate limiting errors from the API server, which could be caused by a poorly written controller that overloads the API server. An API Server request could get throttled either if the total API Server rate limit is exceeded or if the request is subject to API Priority and Fairness (APF) rate limit. APF sub-divides the total rate limit among different classes of request using the PriorityLevelConfiguration and FlowSchema resources. PriorityLevelConfiguration defines the priority levels and the rate limits for each level. FlowSchemas configures how inbound requests are mapped to the available request priorities. Monitoring helps to identify if requests are being dropped due to APF rate limit or the overall rate limit.
- High API server latency: Slow response times from the API server can impact the responsiveness of the cluster resulting in pod scheduling timeouts
- Cluster going into a read-only state: When the etcd database size limit is exceeded, the cluster becomes read-only. Amazon EKS has a built-in auto-recovery workflow for the no space alarm, but it is best to get ahead of the problem before it happens
The Amazon EKS Best Practices Guide provides guidance on the control plane metrics to pay close attention to. Next, let us look at how to implement a solution to scrape, store and visualize these metrics.
Solution overview
Kubernetes exposes a rich set of metrics that are useful for monitoring the control plane. There are two options to scrape, store and visualize the metrics - using open-source approach or using an AWS-native approach.
The open-source approach uses AWS Distro for Open Telemetry (ADOT) to scrape metrics, Amazon Managed Service for Prometheus to store the metrics and Amazon Managed Grafana for visualization.
The second, AWS-native approach, uses Amazon CloudWatch to store, query and visualize your observability data. To enable this we use the Amazon CloudWatch Observability EKS add-on . This add-on installs the CloudWatch agent to send infrastructure metrics from the cluster, installs Fluent Bit to send pod logs, and also enables CloudWatch Application Signals to send application performance telemetry. Note that for this discussion on control plane observability, we are primarily interested in the first part - the Container Insights metrics . Fluent Bit and Application Signals are more relevant for a discussion around application observability.
Control plane logs serve as a complementary tool for diving deeper into the cause of errors or spikes observed through the metrics. These audit and diagnostics logs can be sent to CloudWatch logs. Additional detail about the types of logs and how to enable or disable control plane logs is available here .
Conclusion
Implementing comprehensive control plane observability requires understanding both the metrics exposed by Kubernetes components (/metrics endpoints) and the interconnected nature of control plane services. While Amazon EKS handles the operational complexity, the responsibility for monitoring and responding to performance degradation patterns remains with platform teams. As we’ve explored, monitoring the control plane components can help you proactively identify and address issues such as API server overload, high latency, or etcd storage limits. By leveraging either open-source tools like Prometheus and Grafana or AWS-native services like Amazon CloudWatch, you can implement robust observability solutions for your Kubernetes control plane. These tools provide valuable insights into the performance and health of your cluster, enabling you to maintain optimal operations and quickly troubleshoot any issues that may arise.
If you’re attending KubeCon, don’t forget to visit the AWS booth (F1) for live demonstrations and expert insights on Kubernetes best practices, including control plane observability.
Further reading
- Amazon EKS Best Practices Guide is the best place to start for understanding best practices for Day 2 operations of Amazon EKS clusters. Here are the deep links for the reliability and scalability sections
- Monitor Amazon EKS Control Plane metrics using AWS Open Source monitoring service
- Managing etcd database size on Amazon EKS clusters
- Amazon EKS Workshop helps users learn about Amazon EKS features and integrations with popular open-source projects
- AWS Observability Acclerator for CDK