Metrics collection, clean up, down sampling, and aggregation is the core requirements to experience Magalix AI pipeline and the resources usage visualization.
Magalix Agent collects metrics about pods, containers, and nodes using the default installed components inside Kubernetes. It depends mainly on CAdvisor, Kubelet endpoints, and the Metrics Server. The Agent does not install any other external components to collect or clean up metrics. If the
--trace flag is enabled in the agent's arguments, the agent will generate detailed logs listing all metrics collected. (see the agent's about page).
The agent currently collects metrics at a 1-minute frequency and sends it as is to our backend.
Magalix Agent does not yet support any kind of local data buffering. We are working on a buffering feature to make the pipeline tolerant to network partitioning.
The 1-minute resolution provides a close look at the previous few hours and the next two hours to show any short-term patterns or changes. To capture hourly, daily, and seasonal patterns Magalix backend down samples the metrics into lower resolutions and stores along some statistical indicators that the AI uses for proper predictions and decision-making process. Below is the list of resolutions that are accessible from the Magalix console:
- 1 Minute
- 5 Minutes
- 30 minutes
- 1 hour
- 12 hours
Magalix backend aggregates metrics from container level all the way to the namespace and cluster level. Aggregate metrics provide a high-level overview of consumption trends and resources balance inside namespaces and clusters. All container-level metrics are aggregated to reflect resources consumption inside the connected clusters, such as memory, CPU, disk, network, etc. All node metrics are aggregated to reflect utilization of available capacity.
Metrics prediction is essential for proper proactive scalability decisions. Magalix backend builds its initial prediction model based on millions of training points our machine learning team accumulated over time. These models are trained frequently enough to reflect changes and updates in patterns of different metrics. For example, when the 1-minute prediction model exceeds the error threshold, it is automatically recalibrated and retrained on the latest data points from the collected metrics. Magalix currently provides predictions of these metrics:
- Disk consumption
- KPI metric (Pro plan only)
Sample predicted CPU and memory metrics. Blue line is the measure metric. the shaded area is the predicted lower and up range.
Initial predictions are generated after the accumulation of 300 - 500 historic data points, which in turn depends on the metrics resolution. For example, CPU metrics predictions for the 1-minute resolution will kick in after 300 - 500 minutes (6 to 8 hours) from the time the cluster is connected or container is created. The 1-hour resolution requires 300 - 500 hours of historical data, which is around 12 to 20 days of operations.
Learn best practices to manage Kubernetes scalability
|Defining Resources Requests & Limits|