Run big data on AWS EKS

5 min readMar 14, 2023

Introduction

With the rapid development of K8s technology, running big data on cloud-native infrastructure has become a popular trend. This is largely due to the following reasons:

Scalability: K8s allows you to quickly scale up or down your big data cluster as needed, making it easier to handle growing data volumes.
Resource Management: K8s automates resource allocation and management, reducing the operational overhead of managing a big data cluster.
Portability: Running big data stacks on K8s allows you to run your applications on any cloud or on-premises infrastructure, making moving your applications to new environments easier.
Improved Availability: K8s provides automatic failover and self-healing capabilities, reducing downtime and improving availability.
Simplified Management: K8s provides a centralized management interface for your big data stack, making deploying, managing, and monitoring your applications much easier.
Improved Security: K8s provides robust security features that can be used to secure your big data stacks, such as role-based access control, encryption, and network segmentation.

Overall, running big data applications on K8s can greatly improve the performance, scalability, and manageability of data analytics platforms. In this blog, you’ll learn how to seamlessly deploy and integrate HDFS, Hive, and Spark on AWS EKS service to maximize their potential.

Prerequisite

With AWS EKS service, you can create a K8s cluster in just a few minutes, assuming you already have an AWS account. This video will guide you through the following necessary preparation steps:

Install command line tools like aws and kubectl on your laptop.
Create a cluster with master nodes.
Use kubectl command to access your cluster.
Add worker nodes for your cluster.

Note that, step 4 requires a few additional actions. Firstly, add the following policies for the IAM role associated with the node group:

AmazonEC2ContainerRegistryReadOnly
AmazonEKSWorkerNodePolicy
AmazonEKS_CNI_Policy
AmazonEBSCSIDriverPolicy

While the recommended video suggests assigning the first three policies, it’s crucial to include the fourth one as well, as it enables pods to create persistent volumes dynamically.

Sufficient resources are essential to run big data applications effectively. We suggest using a node group with four EC2 instances, each with 4 vCPUs, 8 GB memory, and 20 GiB disk space (such as instance type c4.xlarge).

Figure 1: Get more add-ons for EKS cluster

The final step in preparation is to add CSI Driver as an add-on for your cluster. Simply click “Get more add-ons” on the homepage of your cluster and select “Amazon EBS CSI Driver” on the following page.

Figure 2: Add CSI Driver for EKS cluster

Start your big data journey on EKS

To run big data applications on EKS, we start our journey by downloading the deployment scripts from this project on GitHub. It enables developers to deploy an experimental big data platform with key components like HDFS, Hive, Spark, and Kafka on K8s.

curl -L https://github.com/linktimecloud/big-data-on-k8s/archive/refs/tags/Release-1.0.zip

unzip big-data-on-k8s-Release-1.0

cd big-data-on-k8s-Release-1.0

Deploy HDFS cluster

We first deploy an HDFS cluster with one namenode and one datanode. The helm chart utilized here is from the kubernetes-HDFS project but has been expanded to incorporate Kerberos authentication and enable persistent volumes to store data blocks at datanodes.

bash hdfs-on-k8s/deploy.sh

To verify HDFS cluster is deployed sucessfully, we run a port forwarding command.

kubectl port-forward my-hdfs-namenode-0 50070:9870

Then we open a browser at http://127.0.0.1:50070/dfshealth.html#tab-datanode. We should see the datanode is running normally.

For a highly available HDFS cluster with two namenodes, execute the deployment command after modifying the hdfs-on-k8s/charts/hdfs-k8s/values.yaml file as follows.

global:
......
 namenodeHAEnabled: true

tags:
 ha: true
 kerberos: false
 simple: false

We can also modify the hdfs-on-k8s/charts/hdfs-k8s/values.yaml file as shown below to run an HDFS cluster with multiple datanodes (e.g., 3 datanodes).

hdfs-datanode-k8s:
 ......
 replicas: 3

global:
 ......
 dfsReplication: 3

Deploy Hive

We enhanced Apache Hive with version 4.0.0-SNAPSHOT by introducing the ability to execute Hive SQL on K8s and provided docker images for deployment.

We first deploy MySQL and utilize it to store Hive metadata, followed by deploying Hive Server2 and Hive Metastore.

bash mysql-on-k8s/deploy.sh

bash hive-on-k8s/deploy.sh

To verify Hive service is successfully deployed, we get into the shell of pod linktime-hms-0.

kubectl exec — stdin — tty linktime-hms-0 — /bin/bash

Then start a beeline client in the shell.

/opt/hive/bin/beeline -u ‘jdbc:hive2://linktime-hs2–0.linktime-hs2.default.svc.cluster.local:10000/;’

In beeline client, we run the following statements:

create table if not exists student(id int, name string) partitioned by(month string, day string);

set hive.spark.client.server.connect.timeout=270000ms;

insert into table student partition(month="202003", day="13")
values (1, "student1"), (2, "student2"), (3, "student3"), (4, "student4"), (5, "student5");

select * from student;

If everything is ok, we should see the data after running the last statement. To exit from beeline, we type “!q”. Finally, we exit the shell by typing “exit”.

Deploy Spark

The Spark operator used in this deployment is part of an enhanced version of GCP’s Spark on K8s operator, with added features for data security like authentication and authorization.

To deploy the Spark operator, we simply run the following command.

bash spark-on-k8s/deploy.sh

To verify that Spark Operator is working properly, we first copy two files to the pod linktime-hms-0.

kubectl cp spark-on-k8s/demo.py linktime-hms-0:/hms/.

kubectl cp spark-on-k8s/spark-submit.sh linktime-hms-0:/hms/.

Then we get into the shell of pod linktime-hms-0.

kubectl exec — stdin — tty linktime-hms-0 — /bin/bash

We run the following commands in the shell to submit a Spark application that executes Spark SQL.

/opt/hadoop/bin/hdfs dfs -mkdir /upload

/opt/hadoop/bin/hdfs dfs -put demo.py /upload/.

bash spark-submit.sh

To see if a Spark application is started, we first find its pod name.

kubectl get pods | grep spark-schedule-driver

If this pod is running, then we do port forwarding on it.

kubectl port-forward sparkapplication-xxxxxx-spark-schedule-driver 54040:4040

After doing the port-forwarding, we open a browser with the address http://localhost:54040 to check the status of this Spark application.

The project demonstrated here is a simplified version of Kubernetes Data Platform (KDP), a production-ready data platform that provides a unified way to deploy, manage, and operate big data applications, including HDFS, Hive, Spark, Flink, Kafka, and Minio on K8s. If you’re interested in learning more about KDP, please refer to this blog.