BigData Hadoop: Introduction to Apache Hadoop and Spark

In this post, we have covered What is Apache Spark, its features, Apache Spark Components & Architecture and much more.

If you are just starting out in BigData & Hadoop then I highly recommend you to go through these posts below:

Big Data Hadoop Keypoints & Things you must know to Start learning Big Data & Hadoop, check here
Big Data & Hadoop Overview, Concepts, Architecture, including Hadoop Distributed File System (HDFS), Check here

Also, Please check my previous blog on Apache Spark Vs Hadoop MapReduce

What is Apache Spark?

Apache Spark is a fast in-memory big data processing engine equipped with the abilities of Machine Learning which runs up to 100 times faster than Apache Hadoop. It is a unified engine that is built around the concept of ease.

Before we learn about Apache Spark or its use cases or how we use it, let’s see the reason behind its invention.

Exploding Data: We are aware that today we have huge data being generated everywhere from various sources. This data is either being stored intentionally in a structured way or getting generated by machines. But data is of no use until we mine it and try to do some kind of analysis on it, in order to come up with actions based on the analysis outcomes. The act of gathering and storing information for eventual analysis is ages old but it had never been based on such a large amount of data, which is there today. There is a specific term for such voluminous data i.e. “Big Data”.

Big data is a term that describes the huge volume of data which can be structured and unstructured or semi-structured.

This problem can be solved if we have a framework which not only gives a solution to store all kinds of data (structured, semi-structured or unstructured) but an efficient way of analyzing it according to business needs. One such framework which is widely used is known as Hadoop. But Hadoop has several limitations, because of which Apache Spark was created.

Data Manipulation speed: Now we have a solution for our storage and also an efficient way of analyzing data of any size. That means we can make business decisions after analyzing data. But there is another challenge, which is that the decision based on the analysis insight on a huge data might not be relevant after some time. So in such cases, such insights are of no use because the deadline for action has passed.

Processing larger scale of data with Hadoop’s processing framework i.e. MapReduce (MR) is far better than our traditional system but still not good enough for organizations to take all its decision on time, because Hadoop operates on batch processing of data leading to high latency.

Several other shortcomings of Hadoop are:

Adherence to its MapReduce programming model
Limited programming language API options
Not a good fit for iterative algorithms like Machine Learning Algorithms
Pipelining of tasks is not easy

Apache spark was developed as a solution to the above-mentioned limitations of Hadoop.

Spark Features

Spark has several advantages when compared to other big data and MapReduce technologies like Hadoop.
Spark is faster than MapReduce and offers low latency due to reduced disk input and output operation.
Spark has the capability of in-memory computation and operations, which makes the data processing really fast than other MapReduce.
Unlike Hadoop spark maintains the intermediate results in memory rather than writing every intermediate output to disk.
This hugely cuts down the execution time of the operation, resulting in faster execution of a task, as more as 100X time a standard MapReduce job.

Apache Spark can also hold data onto the disk. When data crosses the threshold of the memory storage it is spilled to the disk. This way spark acts as an extension of MapReduce.

Apache Spark has other features, such as:

It leverages the distributed cluster memory for doing computations for increased speed and data processing.
Spark enables applications in Hadoop clusters to run up to as much as 100 times faster in memory and 10 times faster even when running in a disk.
It is most suitable for real-time decision making with big data.
It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), it can also process data stored by HBase structure.
Apache Spark can be integrated with various data sources like SQL, NoSQL, S3, HDFS, local file system etc.

Hadoop and Apache Spark

Hadoop as a big data processing technology has proven to be the go-to solution for processing large data sets. MapReduce is a great solution for computations, which needs one-pass to complete, but not very efficient for use cases that require multi-pass for computations and algorithms.

Each stage in the data processing workflow has one Map and one Reduce phase. To leverage MapReduce solution we need to convert our use case into MapReduce pattern. The Job’s output data between each step has to be stored in the file system before the next step can begin. Hence, this approach is slow, due to replication & disk Input/output operations.

If you want to do an iterative job, you would have to stitch together a sequence of MapReduce jobs and execute them in sequence. Each of those jobs has high-latency, and each depends upon the completion of previous stages.

Spark can run on top of Hadoop’s distributed file system Hadoop Distributed File System (HDFS) to leverage the distributed replicated storage.

Spark can be used along with MapReduce in the same Hadoop cluster or can be used alone as a processing framework. Apache Spark is an alternative to Hadoop MapReduce rather than a replacement of Hadoop.

Apache Spark Components and Architecture

SparkContext is an independent process through which spark application runs over a cluster. It gives the handle to the distributed mechanism/cluster so that you may use the resources of the distributed machines in your job. Your application program which will use SparkContext object would be known as driver program. Specifically, to run on a cluster, the SparkContext connects to several types of cluster managers (like Spark’s own standalone cluster manager, Apache Mesos or Hadoop’s YARN), which allocate resources across applications. Once connected, Spark takes over executors on distributed nodes in the cluster, which are processes in the distributed nodes that run computations and store data for your application. Next, it sends your application code to the executors through SparkContext. Finally, tasks are sent to the executors to run and complete it.

Cluster overview

Following are most important takeaways of the architecture:

Each application gets its own executor processes, which remains in memory up to the duration of the complete application and run tasks in multiple threads. This means each application is independent of the other, on both the scheduling side since each driver schedules its own tasks and executor side as tasks from different applications to run in different JVMs.
Spark is independent of cluster managers that implies, it can be coupled with any cluster manager and then leverage that cluster.
Because the driver schedules tasks on the cluster, it should be run as close to the worker nodes as possible.

Spark Eco-system components

Spark Core

Spark Core is the base of an overall spark project. It is responsible for distributed task dispatching, parallelism, scheduling, and basic I/O functionalities.

Other than Spark Core API, there are additional useful and powerful libraries that are part of the Spark ecosystem and adds powerful capabilities in Big Data analytics and Machine Learning areas. These libraries include:

Spark Streaming

Spark Streaming is a useful addition to the core Spark API. It is used for processing real-time streaming data.

Spark SQL, DataFrames, and Datasets:

SQL: Spark SQL exposes spark APIs to run SQL query like computation on large data. A spark user can perform an ad-hoc query and perform near real-time ETL on different types of data like (like JSON, Parquet, Database).

DataFrames: A DataFrame can be considered as a distributed set of data which has been organized into many named columns.

Dataset: A Dataset is a new addition to the list of spark libraries. It is an experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine.

Spark MLlib And ML: MLlib is collective bunch few handy and useful machine learning algorithms and data cleaning and processing approaches which includes classification, clustering, regression, feature extraction, dimensionality reduction, etc. as well as underlying optimization primitives like SGD and BFGS.

Spark GraphX: GraphX is the Spark API for graphs and graph-parallel computation. GraphX enhances the Spark RDD by introducing the Resilient Distributed Property Graph.

You will get to know all of this and deep-dive into each concept related to BigData & Hadoop, once you will get enrolled in our Big Data Hadoop Administration Training

Another question, which might come to your mind, What are all the things you will get when you enrolled!!

We are glad to tell you that:

Things you will get!!

Live Instructor-led Online Interactive Sessions
FREE unlimited retake for next 1 Years
FREE On-Job Support for next 1 Years
Training Material (Presentation + Step by Step Hands-on Guide)
Recording of Live Interactive Session for Lifetime Access
100% Money Back Guarantee (If you attend sessions, practice and don’t get results, We’ll do full REFUND, check our Refund Policy)

If You’ve not looked at Our Big Data Hadoop Administration Workshop & want to check what we cover in the Workshop then check here & Step By Step Hands-On Activity Guide that we cover in Training.

If you are looking for commonly asked interview questions for Big Data Hadoop Administration then just click below and get that in your inbox or join our Private Facebook Group dedicated to Big Data Hadoop Members Only.

Featured Course

BigData Hadoop: Introduction to Apache Spark

Share Post Now :

HOW TO GET HIGH PAYING JOBS IN AWS CLOUD