Jupyter is one of the powerful tools for development. However, it doesn’t support Spark development implicitly. A lot of times Python developers are forced to use Scala for developing codes in Spark. This article aims to simplify that and enable the users to use the Jupyter itself for developing Spark codes with the help of PySpark. Kindly follow the below steps to get this implemented and enjoy the power of Spark from the comfort of Jupyter. This exercise approximately takes 30 minutes.
PySpark requires Java version 7 or later and Python version 2.6 or later.
1. Install Java
Java is used by many other software. So, it is quite possible that a required version (in our case version 7 or later) is already available on your computer. To check if Java is available and find its version, open a Command Prompt and type the following command.
If Java is installed and configured to work from a Command Prompt, running the above command should print the information about the Java version to the console. For example, I got the following output on my laptop.
java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Instead if you get a message like
'java' is not recognized as an internal or external command, operable program or batch file.
It means you need to install Java. Please reach out to IT team to get it installed.
2. Install Anaconda (for python)
To check if Python is available, open a Command Prompt and type the following command.
If Python is installed and configured to work from a Command Prompt, running the above command should print the information about the Python version to the console. For example, I got the following output on my laptop.
Python 3.6.5:: Anaconda, Inc.
Instead if you get a message like
'python' is not recognized as an internal or external command, operable program or batch file.
It means you need to install Python. Please install Anaconda with which you all the necessary packages will be installed.
After the installation is complete, close the Command Prompt if it was already open, open it and check if you can successfully run python –version command.
Install Apache Spark
1. Go to the Spark download
2. For Choose a Spark release, select the latest stable release (2.4.0 as of 13-Dec-2018) of Spark.
3. For Choose a package type, select a version that is pre-built for the latest version of Hadoop such as Pre-built for Hadoop 2.7 and later.
4. Click the link next to Download Spark to download the spark-2.4.0-bin-hadoop2.7.tgz
5. In order to install Apache Spark, there is no need to run any installer. You can extract the files from the downloaded zip file using winzip (right click on the extracted file and click extract here).
6. Make sure that the folder path and the folder name containing Spark files do not contain any spaces.
Now, create a folder called “spark”on your desktop and unzip the file that you downloaded as a folder called spark-2.4.0-bin-hadoop2.7. So, all Spark files will be in a folder called C:\Users\<your_user_name>\Desktop\Spark\spark-2.4.0-bin-hadoop2.7. From now on, we shall refer to this folder as SPARK_HOME in this document.
To test if your installation was successful, open Anaconda Prompt, change to SPARK_HOME directory and type bin\pyspark. This should start the PySpark shell which can be used to interactively work with Spark. We get following messages in the console after running bin\pyspark command.
Type versionin the shell. It should print the version of Spark. You can exit from the PySpark shell in the same way you exit from any Python shell by typing exit().
4. Install winutils.exe
Let’s download the
winutils.exe and configure our Spark installation to find
1. Create a hadoop\binfolder inside the SPARK_HOME folder which we already created in Step3 as above.
2. Download the exe for the version of hadoop against which your Spark installation was built for. Download the winutils.exe for hadoop 2.7.1 (in this case) and copy it to the hadoop\bin folder in the SPARK_HOME folder.
Note: 3 & 4 below require admin access
3. Create a system environment variable in Windows called SPARK_HOMEthat points to the SPARK_HOME folder path. This needs admin access hence if you don’t have one please get this done with the help of IT support team.
4. Create another system environment variable in Windows called HADOOP_HOMEthat points to the hadoop folder inside the SPARK_HOME folder.
hadoop folder is inside the SPARK_HOME folder, it is better to create
HADOOP_HOME environment variable using a value of
%SPARK_HOME%\hadoop. That way you don’t have to change
SPARK_HOME is updated.
5. Using Spark from Jupyter
1. Click on Windows and search “Anacoda Prompt”. Open Anaconda prompt and type “python -m pip install findspark”. This package is necessary to run spark from Jupyter notebook.
2. Now, from the same Anaconda Prompt, type “jupyter notebook” and hit enter. This would open a jupyter notebook from your browser. From Jupyter notebookàNewàSelect Python3, as shown below.
3. Upon selecting Python3, a new notebook would open which we can use to run spark and use pyspark. In the notebook, please run the below code to verify if Spark is successfully installed. Once this is done you can use our very own Jupyter notebook to run Spark using PySpark.
4. Now let us test the if our installation was successful using Test1 and Test 2 as below.
import findspark findspark.init()
Test2 (Run this only after you successfully run Test1 without errors)
import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.sql("select 'spark' as hello ") df.show()
If you are able to display hello spark as above, it means you have successfully installed Spark and will now be able to use pyspark for development. Please experiment with other pyspark commands and see if you are able to successfully use spark from Jupyter.