Share your love
Install Apache Spark and Run PySpark on AlmaLinux 9
This tutorial intends to teach you to Install Apache Spark and Run Spark-shell and PySpark on AlmaLinux 9 from Command-Line. Before starting the installation, Let’s get familiar with Apache Spark and PySpark.
What Is Apache Spark?
Apache Spark is the most popular and open-source big data distributed processing framework. Also, it provides development APIs in Java, Scala, Python, and R.
What Is PySpark?
PySpark is the Python API for Apache Spark. With PySpark you can perform real-time, large-scale data processing in a distributed environment using Python.
How To Install Apache Spark and Run PySpark on AlmaLinux 9?
To set up Apache Spark, you must have access to your server as a non-root user with sudo privileges. To do this, you can follow this guide on Initial Server Setup with AlmaLinux 9.
Also, you must have Java JDK installed on your server. For this purpose, you can visit this guide on How To Install Java with DNF on AlmaLinux 9.
Now proceed to the following steps to complete this guide.
Step 1 – Download Apache Spark Hadoop on AlmaLinux 9
First, you need to install some required packages by using the command below:
sudo dnf install mlocate git -y
Then, visit the Apache Spark Downloads page and get the latest release of Apache Hadoop by using the following wget command:
Note: Hadoop is the foundation of your big data architecture. It’s responsible for storing and processing your data.
sudo wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
When your Apache Spark download is completed on AlmaLinux 9, extract your downloaded file with the following command:
sudo tar xvf spark-3.4.1-bin-hadoop3.tgz
Move your extracted file to a new directory with the command below:
sudo mv spark-3.4.1-bin-hadoop3 /opt/spark
Step 2 – Configure Spark Environment Variables on AlmaLinux 9
At this point, you need to add the environment variables to your bashrc file. Open the file with your desired text editor, here we use the vi editor:
sudo vi ~/.bashrc
At the end of the file, add the following content to the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Note: Remember to set your Spark installation directory next to the export SPARK_HOME=.
When you are done, save and close the file.
Next, source your bashrc file:
sudo source ~/.bashrc
Step 3 – How To Run Spark Shell on AlmaLinux 9?
At this point, you can verify your Spark installation by running the Spark shell command:
spark-shell
If everything is ok, you should get the following output:
Output
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.1
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Step 4 – How To Run PySpark on AlmaLinux 9?
If you want to use Python instead of Scala, you can easily run PySpark on your AlmaLinux server with the command below:
pyspark
In your output, you should see:
Output
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.4.1
/_/
Using Python version 3.9.16 (main, May 29 2023 00:00:00)
Spark context Web UI available at http://37.120.247.7:4040
Spark context available as 'sc' (master = local[*], app id = local-1689750671688).
SparkSession available as 'spark'.
>>>
From your PySpark shell, you can easily write the code and execute it.
Conclusion
At this point, you have learned to Install Apache Spark and Run Spark-shell and PySpark on AlmaLinux 9 from Command-Line. Hope you enjoy it.
You may be interested in these articles too: