Introduction To PySpark

In this blog, we are going to cover an Introduction To PySpark, uses of PySpark, features of PySpark, and different installation methods for various types of operating systems.

Topics we’ll cover:

Introduction to PySpark
Uses of PySpark
Why python for PySpark
Features of PySpark
Installation and run PySpark on Jupyter notebook on windows

The PySpark provides an interface for Apache Spark in Python. It not only allows you to write Spark applications using python APIs, but it also provides the Pyspark shell for interactively analyzing your data in the distributed environment. Using PySpark we can run applications parallel on the distributed cluster.

The Spark is basically written in Scala but later on, due to industry adoption, its API PySpark was released for python using py4J. And to run PySpark You also need java to be installed along with python and Apache Spark.

Spark runs operations on trillions of data on distributed clusters 100 times faster than the traditional python applications.

Use Of PySpark

It is majorly used for processing structured and semi-structured datasets. it also provides an optimized API that can read the data from the different data sources containing different file formats. And you can also process the data by making use of SQL as well as HIVE Query Language.

Why Python For PySpark

Python is an interpreted high-level general-purpose programming language with dynamic semantics And considered as one of the easiest programming languages for a beginner to learn.

Python is easy to learn and implement.
It provides a simple API.
It provides various options for data visualization which is difficult using scala or java.
It is backed up by a huge and active community.

Features of PySpark

It Provides Inbuild optimization when using DataFrames
Can be used with many cluster managers like Spark, YARN, etc.
In-memory computation
Fault Tolerance
Immutable
Cache and Persistence
PySpark Architecture

Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called “Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Install And Run PySpark In Jupyter Notebook On Windows

A. Items needed

1. First of all download the Spark distribution from the spark.apache.org website.

2. Download the Python and Jupyter Notebook. You can get both by installing the Python 3. x version of Anaconda distribution from https://www.anaconda.com/products/individual

3. winutils.exe — a Hadoop binary for Windows — from Steve Loughran’s GitHub repo. Go to the corresponding Hadoop version in the Spark distribution and find winutils.exe under /bin. For example, https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

4. The find sparks Python module, which can be installed by running python -m pip installs find spark either in Windows command prompt or Git bash if Python is installed in item. You can find the command prompt by searching cmd in the search box.

5. If you don’t have Java or your Java version is 7. x or less, download and install Java from Oracle. I recommend getting the latest JDK from here https://www.oracle.com/java/technologies/downloads/#jdk17-windows

B. Installing PySpark

1. Unpack the .tgz file.

2. Move the winutils.exe downloaded from step A3 to the \bin folder of Spark distribution. For example, D:\spark\spark-2.2.1-bin-hadoop2.7\bin\winutils.exe

3. Add environment variables: the environment variables let Windows find where the files are when we start the PySpark kernel. You can find the environment variable settings by putting “environment variable” in the search box.

4. In the same environment variable settings window, look for the Path or PATH variable, click edit and add D:\spark\spark-2.2.1-bin-hadoop2.7\bin to it. In Windows 7 you need to separate the values in Path with a semicolon; between the values.

C. Running PySpark in Jupyter Notebook

1. To run Jupyter notebook, open the Windows command prompt or Git Bash and run the jupyter notebook. Once inside Jupyter notebook, open a Python 3 notebook

2. In the notebook, run the following code to check your Pyspark installation.

3. When you press run, it might trigger a Windows firewall pop-up. Pressed the cancel on the pop-up as blocking the connection doesn’t affect PySpark.

4. When you see the following below output, then you have installed PySpark on your Windows system.

Next Task For You

Interested in increasing your knowledge of the Big Data landscape? This course is for those new to data science and interested in understanding why the Big Data Era has come to be. If you want to begin your journey towards becoming a Big Data Engineer then register at our FREE CLASS.

Introduction To PySpark | Uses, It’s Features & Installation