Run Data Analysitcs on Kubernetes 2X times faster

Shay Naeh
5 min readMay 12, 2022

--

Shay Naeh@Cloudify, Petar Torre@Intel

Introduction

This blog describes how to provision cloud infrastructure (IaC), including Kubernetes clusters. On top of the provisioned infrastructure we orchestrate Kubernetes workloads, optimized for the Intel architecture, to achieve performance gains.

Cloudify, as an orchestrator, can get Service Level Objective (SLO) requirements, such as workload characteristic requirements for CPU, network or memory requirements.

Cloudify configures the underlying Kubernetes clusters depending on

workload characteristics, like compute or network intensive. Cloudify matches the best hardware features to make the workload running faster, optimizes workload requirements for the underlying Intel hardware architecture.

The figure below shows how Cloudify utilizes hardware based features like vector instructions, encryption offloading and network acceleration to perform faster.

This blog will focus on the Intel® Advanced Vector Instructions 512 (Intel® AVX-512) vector instructions as hardware acceleration available in Intel Xeon processors, and presents how this is achieved in Kubernetes based environments.

How it Works

We will describe in detail how the famous Monte Carlo simulation application benefits from Intel AVX-512, and how such containerized applications can run on Kubernetes clusters automated across multiple clouds.

The figure below shows a high-level view of the solution.

The Intel Xeon AVX-512 supports vector instructions.

Labeling the Kubernetes nodes is done via Node Feature Discovery

https://github.com/kubernetes-sigs/node-feature-discovery. Monte Carlo pods are matched to nodes via a node selector, defined in the Kubernetes pod deployment YAML manifest.

The system architecture looks like the figure below, where we used a Terraform plugin to create an AWS EKS cluster. Nodes are running Intel Xeon processors with both Intel AVX-512 and AVX2 vector instructions.

The Monte Carlo simulation code is hosted on GitHub, compiled using Intel® oneAPI C++ Compiler, being able to run either AVX512 or AVX2 instructions for performance comparison.

We run the tests on two EC2 instance type generations, the first C6i with 3rd generation Intel Xeon Processor, the second C5 with previous generation of Intel Xeon Processor, each one of them with AVX-512 and AVX2 instructions.

The metrics are pushed from each processor to a Prometheus PushGateway, from there scraped by Prometheus, and presented in a Grafana dashboard, as shown in the figure below.

Performance gains are nearly doubled with AVX-512 compared to AVX2 vector instructions. Current C6i instances are faster than their older C5 instances. This Monte Carlo simulation clearly shows the benefit of both Xeon processor generations with AVX-512 instructions.

Cloudify Orchestration — Blueprints and Deployments

Below you can see the Cloudify catalog with multiple blueprints:

  • IaC blueprint that creates the Kubernetes EKS cluster on AWS
  • NFD blueprint that labels the Kubernetes nodes, which supports vector processing (AVX-512)
  • The Monte Carlo simulation blueprint which runs the Monte Carlo simulation app.

The next figure shows the actual deployment of the different components and their execution task graph. Green boxes mean everything was executed successfully.

Cloudify knows to match the Monte Carlo pods with the right Kubernetes nodes as defined in the pod deployment specification. Monte Carlo pods with acceleration are matched to AVX-512 nodes, using a nodeSelector field as explained in details here. An example for a deployment spec is shown below.

The next figure shows the TOSCA (Cloudify’s DSL for blueprint definition) topology of the components deployed. We can see that in topology we have four components, with combinations of C6i and C5 instances and AVX-512 and AVX2 instructions.

Results are shown below in a Grafana dashboard. We see 4 lines, for different instance generations and AVX instructions. We see that for the Monte Carlo application performance is nearly doubled with AVX-512 compared to AVX2, for both instance generations. Also we see that the current C6i generation is faster than the previous C5 generation.

Summary

In this demo we demonstrated how we can fully automate cloud resource provisioning, configure them in the right way and orchestrate Kubernetes workloads (Monte Carlo simulations) to achieve performance gains.

Specifically we did the following:

1. Using Terraform configuration files, we created an AWS EKS managed Kubernetes cluster, with managed worker nodes.

2. Configured the cluster with Node Feature Discovery (NFD) that labels the nodes with detected hardware features and system configuration details.

3. Deployed a “Reporting pod” with containers for Prometheus Push Gateway, Prometheus, Grafana and an additional container for configuring the required metrics bundle and dashboard visualization.

4. Deployed the Monte Carlo simulation pods on the right Kubernetes nodes, using node selectors. We have four nodes, two Xeon version processors, with AVX-512 or AVX2 instructions.

5. Results show that for this Monte Carlo simulation the performance is nearly doubled with AVX-512 compared to AVX2, and that current C6i instance generation with 3rd Gen. Xeon is faster compared to the previous C5 generation of Xeon processors.

--

--