In this guide, we want to teach you to Install Apache Spark and Run Spark-shell and PySpark on Ubuntu 22.04.
Apache Spark is one the most popular and open-source big data distributed processing framework. It provides development APIs in Java, Scala, Python, and R. PySpark is the Python API for Apache Spark. With PySpark you can perform real-time, large-scale data processing in a distributed environment using Python. Also, it provides a PySpark shell for interactively analyzing your data.
Now follow the steps below to Install Apache Spark and Run PySpark on Ubuntu 22.04.
Install Apache Spark and Run PySpark on Ubuntu 22.04
To complete this guide, you must have access to your server as a non-root user with sudo privileges. To do this, you can check this guide on Initial Server Setup with Ubuntu 22.04.
Step 1 – Install Java on Ubuntu 22.04
To set up Apache Spark, you must have Java installed on your server. First, update your system by using the command below:
sudo apt update
Then, use the following command to install Java:
sudo apt install default-jdk -y
Verify your Java installation by checking its version:
Output openjdk 11.0.19 2023-04-18 OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1) OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
Step 2 – Download Apache Spark on Ubuntu 22.04
First, you need to install some required packages by using the command below:
sudo apt install mlocate git scala -y
Note: Hadoop is the foundation of your big data architecture. It’s responsible for storing and processing your data.
sudo wget https://dlcdn.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
When your Spark download is completed on Ubuntu 22.04, extract your downloaded file with the following command:
sudo tar xvf spark-3.4.0-bin-hadoop3.tgz
Move your extracted file to a new directory with the command below:
mv spark-3.4.0-bin-hadoop3 /opt/spark
Step 3 – How To Configure Spark Environment?
At this point, you need to add the environment variables to your bashrc file. Open the file with your desired text editor, here we use vi editor:
sudo vi ~/.bashrc
At the end of the file, add the following content to the file:
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Note: Remember to set your Spark installation directory next to the export SPARK_HOME=.
When you are done, save and close the file.
Next, source your bashrc file:
sudo source ~/.bashrc
Step 4 – How To Run Spark Shell on Ubuntu 22.04?
At this point, you can verify your Spark installation by running the Spark shell command:
If everything is ok, you should get the following output:
Output Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.0 /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19) Type in expressions to have them evaluated. Type :help for more information. scala>
Step 5 – How To Run PySpark on Ubuntu 22.04?
If you want to use Python instead of Scala, you can easily run PySpark on your Ubuntu server with the command below:
In your output, you should see:
Output Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.4.0 /_/ Using Python version 3.10.4 (main, Jun 29 2022 12:14:53) Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1686575411463). SparkSession available as 'spark'. >>>
From your PySpark shell, you can easily write the code and execute it.
At this point, you have learned to Install Apache Spark and Run Spark-shell and PySpark on Ubuntu 22.04. By using PySpark you can interactively analyze your data.
Hope you enjoy it. You may be like these articles: