Introducing KDP, a big data platform on Kubernetes

8 min readMar 14, 2023

Introduction

The Hadoop ecosystem has been widely adopted by companies as their preferred big data platform since its open-source release in 2006. However, as usage has increased, challenges have arisen in its traditional infrastructure, such as difficulties in deploying new applications, resource sharing among big data applications, and inefficiencies in maintaining a data analytics platform. With the migration of business applications to Kubernetes, it’s increasingly clear that operating a big data platform outside of K8s results in resource waste and unnecessary complexity.

Here we present KDP (Kubernetes Data Platform), a big data platform on Kubernetes, developed at LinkTimeCloud. KDP offers a standard way to deploy, manage, and run big data applications such as Hive, Spark, Flink, HDFS, Kafka, and Minio. The platform has built-in observability, enabling data engineers to quickly identify and resolve issues, and offers robust security measures including enterprise-level authentication and authorization.

The problem of traditional big data platform

The traditional data platform, built on the Hadoop ecosystem and deployed on bare-metal or virtual machines, typically consists of systems such as Hadoop, Hive, Spark, Flume, and Kafka. However, this platform has several limitations:

Implementing resource isolation in a multi-tenant environment can be challenging. While using CGroups in YARN may limit CPU usage for containers, it is dependent on the Linux kernel and not accessible on all operating systems.
Deploying new applications can be difficult due to complex library dependencies. For instance, adding a new application that is not supported in packages such as CDH or HDP can consume a considerable amount of time.
The resource utilization in the traditional data platform is low. To avoid resource competition, Hadoop clusters are often dedicated to only running HDFS and YARN. While batch processing tasks occupy most resources during peak hours, little to no resources are used during off-peak hours.
Managing the traditional data platform requires extensive manual effort. For instance, adding a new datanode for HDFS is not possible in real-time, based on available capacity.

As a result, there is a need for an alternative technology to process big data and overcome these limitations effectively.

Big Data on Kubernetes

Kubernetes (K8s) was open-sourced by Google in 2014, leading to the formation of the Cloud Native Computing Foundation (CNCF). Today, K8s is widely supported by major cloud providers and has a thriving ecosystem, with 149 projects and 183K contributors from 187 countries. Some of these projects that are focused on areas such as databases, messaging, orchestration, and storage, will form the foundation for the future of cloud-native big data platforms.

In 2021, two significant advancements were made in the Cloud Native big data platform arena. March saw the release of Apache Spark 3.1, making Spark on Kubernetes widely available. In May, Confluent Inc. introduced Confluent for Kubernetes, bringing Cloud Native capabilities to data streams. As a result, more big data applications are being migrated to Kubernetes. It’s expected that MapReduce will be replaced by Spark and YARN will be replaced by Kubernetes. While object storage may serve as an alternative to HDFS in some cases, fully retiring HDFS may take longer. Nonetheless, it’s now possible to run all big data workloads on Kubernetes.

Introducing KDP: a Big Data Platform on K8s

The key components of KDP include KDP Core, Storage System, Processing Engines, and Scheduler. The KDP Core comprises an Integration Framework, Publish Service, Observability Service, and Cluster Manager. As shown in the above figure, components developed from scratch by the LinkTimeCloud engineering team are indicated in green, while those built on top of open-source software are in orange.

KDP Core

Running big data applications on K8s can be achieved with open-source operators and helm charts available in the community, such as the Spark operator by Google and the Kafka operator by Strimzi. However, these solutions do not address crucial production issues for K8s. They often lack enterprise-level authentication and authorization support, integration with an observability system, and a fine-grained scheduling system beyond the default K8s scheduler. Additionally, they do not natively support multi-tenancy, making it difficult to allocate resources automatically for new users.

The LinkTimeCloud engineering team created KDP Core to standardize big data application integration on K8s. It streamlines the deployment, upgrading, and management of big data applications by using a single configuration file and integrates those applications with other KDP services such as user management, data security, and observability services. The KDP Core includes the following key components:

Integration Framework: Integration Framework streamlines the integration process for big data applications on K8s. It centralizes integration details into one configuration file per application, encompassing the Publish Service, Observability Service, and Cluster Manager. The Framework simplifies the management of open-source operators and helm charts through KDP-standard configuration files, allowing for customizable deployment options like specifying namespaces and inter-application dependencies. Standardized monitoring, alerting, and logging configurations are also provided.
Publish Service: The Publish Service offers API-based deployment and management for big data applications in KDP, leveraging their standard configuration files. It streamlines the deployment process for big data applications by handling additional tasks such as loading monitoring dashboards in Grafana, mounting Kerberos keytabs as K8s secrets for authentication, establishing default authorization policies in Apache Ranger, and initializing resources for user accounts. The Publish Service also enables management operations for these applications, including version upgrades, pod scaling, backing up and restoring data, etc.
Observability Service: The Observability Service optimizes bug resolution, performance optimization, failure analysis, and security monitoring in KDP by collecting metrics, logs, and traces of big data applications. It is built using open-source software components like Prometheus, Grafana, and Loki. Integrating a big data application with this service requires defining relevant details in its configuration file.
Cluster Manager: The Cluster Manager is a web-based interface that provides a central view and management of all big data applications in KDP. It organizes big data applications into catalogs, such as one catalog for Hive (with applications like Hive Metastore and Hive Server2) and one for HDFS (with components like namenodes and datanodes). The information displayed includes both static information from configurations and real-time information collected from the Kubernetes control plane. The Cluster Manager simplifies DevOps operations for users by allowing them to perform tasks like adding more Kafka brokers, scaling the HDFS cluster, and adjusting resource requests/limits for applications, all without the need for running any kubectl commands.

Storage System

In KDP, HDFS is used as the main storage system for structured and semi-structured data. YARN has been omitted as K8s’ default scheduler can handle resource allocation and container orchestration. The Kubernetes-HDFS helm chart on Github was optimized for stability, and persistent volumes and K8s’ virtual network were utilized for data storage and network solutions, respectively, to make the solution more cloud-native.

KDP leverages both HDFS and Minio as its storage systems. HDFS remains the main storage for structured and semi-structured data, while Minio provides an S3-compatible object storage solution with a K8s operator and data redundancy through Erasure Coding. Minio offers a significantly higher usable storage capacity of 75% compared to 33.3% for HDFS. In KDP, users can run Hive SQL on tables stored on either HDFS or Minio.

The KDP stateful applications, such as HDFS, Kafka, and Minio, rely on persistent volumes (PVs) for storage. These applications already offer high availability, so there’s no need to add additional HA measures to the PVs. To provide the PVs for these applications, KDP uses OpenEBS Local PV as its dynamic PV provisioner. Our tests indicate that there’s no significant performance difference between accessing data on PVs using OpenEBS Local PV versus HostPath.

Processing Engine

In KDP, Hive serves as a data warehouse solution that manages schemas for structured data stored in HDFS or Minio and provides a SQL language to manipulate the data. Multiple versions of Hive can be deployed in KDP to accommodate different needs. To ensure efficient usage, a separate Hive Server2 is deployed for each user group in a namespace, while the Hive Metastore is shared among all groups. Structured data can be stored in either HDFS or Minio, and Hive SQL code can be submitted using Hue or Beeline client.

KDP provides the ability to run Hive SQL queries with the highly efficient Spark computing framework. This optimized version of GCP’s Spark on K8s operator is used in KDP, and features for data security, such as authentication and authorization, have been enabled. Users can access Spark either through an interactive JupyterLab interface or by submitting code through KDP’s HTTP API.

In KDP, the optimized Flink operator from the Apache community enables efficient and seamless scheduling of both Spark and Flink tasks, making it a unified scheduler for these distributed processing frameworks. Flink excels in handling real-time workloads as unbounded streams and batch processes as bounded streams, all while executing tasks in the RAM. With these optimizations, KDP provides users with a powerful and versatile solution for processing real-time and batch data.

Additionally, to ensure the security of sensitive data in Kafka streams, we have enabled authentication and authorization features in the Kafka cluster deployed on KDP. With KDP, users can manage and monitor their Kafka clusters, as well as perform DevOps operations like adding or removing brokers, scaling the cluster, and adjusting resources, all through the Kafka manager WebUI interface.

Scheduler

In KDP, we use Volcano as a fine-grained scheduler for big data and AI jobs that require more advanced scheduling capabilities than what the default K8s scheduler can provide. Volcano, which has been accepted as the first and only official container batch scheduling project by the CNCF, supports a range of scheduling policies, including gang scheduling, fair-share scheduling, queue scheduling, and preemption scheduling.

Why using KDP

KDP not only offers standardized deployment and management of big data applications on K8s, but also provides stability and efficiency through optimizations for a superior big data platform, including:

KDP optimizes cluster resource utilization by running both real-time and batch-processing workloads in a single cluster. Spark is utilized for batch processing, while Flink handles real-time processing. Volcano, a cloud-native scheduler, schedules both Spark and Flink jobs, ensuring efficient use of resources. KDP consistently achieves a resource utilization rate of over 60% for our customers, significantly higher than the average utilization rate of 30% in traditional systems.
KDP simplifies big data cluster management and reduces DevOps costs by standardizing maintenance procedures. With KDP, operations like deployment, upgrading, pod scaling, and data backup/restore can be executed as kubectl commands, eliminating the need for DevOps engineers to learn different procedures for each application. Additionally, KDP’s Cluster Manager features a user-friendly Web UI, enabling DevOps engineers to easily perform routine operations with just a few clicks.
KDP ensures efficient big data processing on K8s by optimizing open-source big data applications. For instance, the Spark operator was optimized to address two issues — Data Locality with HDFS and Sticky Spark Sessions. The optimization of Spark operator in KDP ensures efficient processing of big data workloads by prioritizing data locality in task distribution. This enhances performance by allowing Spark executors to access local data stored in HDFS on the same node, reducing the need to retrieve data through the network. To prevent frequent start and termination of Spark driver and executor pods, KDP implements Sticky Spark Sessions in Spark Thrift Server. When a Spark application is completed, the session in the Thrift Server remains active, allowing subsequent applications to reuse it, instead of creating new driver and executor pods.

Run Big Data applications on K8s

The LinkTimeCloud engineering team has shared a simplified KDP implementation on GitHub, allowing developers to deploy big data applications like HDFS, Hive, Spark, and Kafka on K8s for experimentation. Although it doesn’t include KDP Core and Scheduler, developers can still use the optimized operators and helm charts. Please refer to this blog to run big data on AWS EKS.