Apache Spark Architecture Fundamentals

In this blog, we are going to cover Apache Spark Architecture Fundamentals which is an open-source computing framework and a unified analytics engine for a large amount of data processing and machine learning.

Apache Spark Architecture is not only used for real-time processing but it is also used for batch processing and it can give you a hundred times faster performance than the traditional map-reduce which comes around with Hadoop.

Topics we’ll cover:

What are Spark Architecture Fundamentals
What is Azure Databricks Spark Cluster
How to Create Cluster.
Understanding the Architecture of Spark job.
Jobs and Stages in Spark Architecture
Cluster Management in Spark Architecture

What is Spark Architecture Fundamentals

Spark Architecture or Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytical applications and the Apache Spark core engine of which its resources are managed by YARN. In simple words, Spark Architecture is known for its speed and efficiency. The key features it possesses are:

It processes data faster which saves time in reading and writing operations.
It allows real-time streaming that makes it really used technology in today’s big data world.

NOTE: YARN is a sort of resource manager which is basically used to manage resources for your Spark Architecture.

NOTE: In the Spark core engine different functions are at the top of the Azure Spark Core Engine that you add as your analytics require. Some examples are Spark SQL, Spark MLlib, GraphX.

Spark Architecture is based on two important abstractions i.e Resilient Distributed Dataset(RDD) and Directed Acyclic Graph(DAG)

Also Read: Our blog post on Azure Synapse.

What Is Azure Databricks Spark Cluster

Azure Databricks is basically an Apache Spark-based big-data analytics platform optimized for Microsoft Azure cloud services. It manages and launches Apache Spark clusters in your Azure subscription. It provides Automated Cluster Management. After we combine the data bricks with the Microsoft cloud service i.e Microsoft Azure we get our Azure Data Bricks Service. Now, Apache Spark Clusters are the group of computers that are performed as a single computer and handles the execution of various commands issued from notebooks. Apache Spark uses a master-worker type architecture which basically can have a master process and worker processes that can run across multiple machines.

In the Master node, you have the driver program that drives the application so the code you write behaves like a driver program. Inside the driver program, the first thing we do is create a Spark context. Assuming Spark context could be a gateway to all or any spark functionality at an identical database connection. Worker nodes are the slave nodes whose job is to basically execute the tasks.

Also Check: Our blog post on Azure Data Lake.

How To Create Cluster

Now let’s have a look after you created the Azure Databricks services, as an Azure resource a Databricks appliance is deployed in your subscription. For using both the Worker nodes and Driver nodes you should specify the types and sizes of the Virtual Machine(VMs) at the time of cluster creation, but Azure Databricks manages all others aspects.

Inside, Azure Kubernetes Service(AKS) is a fully managed service that offers serverless continuous integration and deployment experience and is used to run the Azure Databricks data plan and control plan through containers that are running on the latest generation of Azure hardware. When the services of the managed resource group are ready then you will be able to manage the Databricks cluster by the Azure Databricks AI and by auto-scaling and auto-termination.

Also Check: Our blog post on Azure Data Factory Interview Questions.

Architecture Of Spark Job

Before we dive deeper, let us summarize the fundamentals of Spark Architecture.

As we know that Spark is a distributed computing environment and the unit of distribution is a Spark Cluster, so in every Cluster, there is a driver and more than one executor. When work is submitted to the cluster and the cluster is split into as many independent jobs as needed and this is how work is distributed across the Cluster’s node. Then the jobs are subdivided into tasks and further, the job is partitioned into more than one partition and these are the partitions that are the unit of work for each slot. Therefore partitions may need to be shared over the network.

Components Of Spark Architecture

In the above diagram, each executor has 2 slots available. Here the Driver is the Java Virtual Machine(JVM) in which the application runs. Parallelization is done at two levels first is the Executor and the second is Slot. Each Executor node has slots and is logical execution for running tasks in parallel. So the Driver sends tasks to the respected slots on the executors when work is to be executed.

1.) Jobs And Stages In Spark Architecture

A Job is a parallel computation consisting of multiple tasks and each distributed action is a job. A group of tasks that can be executed to perform the same task on multiple executors represents Stage. Each job gets broken into sets of tasks that depend on each other.

2.) Cluster Management In Spark Architecture

The resources provided to all the worker nodes as per their needs and operate all nodes accordingly is Cluster Manager i.e Cluster Manager is a mode where we can run Spark.

Apache Spark supports three types of Cluster Managers.

Standalone Cluster Manager
Hadoop YARN
Apache Mesos

Azure Databricks builds the capabilities of Spark by providing fully managed Spark clusters that include a platform for powering Spark-based applications.

Related/References

Next Task For You

In our Azure Data Engineer training program, we will cover 28 Hands-On Labs. If you want to begin your journey towards becoming a Microsoft Certified: Azure Data Engineer Associate by checking out our FREE CLASS.