Install Apache Spark and Run PySpark on AlmaLinux 9

This tutorial intends to teach you to Install Apache Spark and Run Spark-shell and PySpark on AlmaLinux 9 from Command-Line. Before starting the installation, Let’s get familiar with Apache Spark and PySpark.

What Is Apache Spark?

Apache Spark is the most popular and open-source big data distributed processing framework. Also, it provides development APIs in Java, Scala, Python, and R. 

What Is PySpark?

PySpark is the Python API for Apache Spark. With PySpark you can perform real-time, large-scale data processing in a distributed environment using Python. 

How To Install Apache Spark and Run PySpark on AlmaLinux 9?

To set up Apache Spark, you must have access to your server as a non-root user with sudo privileges. To do this, you can follow this guide on Initial Server Setup with AlmaLinux 9.

Also, you must have Java JDK installed on your server. For this purpose, you can visit this guide on How To Install Java with DNF on AlmaLinux 9.

Now proceed to the following steps to complete this guide.

Step 1 – Download Apache Spark Hadoop on AlmaLinux 9

First, you need to install some required packages by using the command below:

sudo dnf install mlocate git -y

Then, visit the Apache Spark Downloads page and get the latest release of Apache Hadoop by using the following wget command:

Note: Hadoop is the foundation of your big data architecture. It’s responsible for storing and processing your data.

sudo wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

When your Apache Spark download is completed on AlmaLinux 9, extract your downloaded file with the following command:

sudo tar xvf spark-3.4.1-bin-hadoop3.tgz

Move your extracted file to a new directory with the command below:

sudo mv spark-3.4.1-bin-hadoop3 /opt/spark

Step 2 – Configure Spark Environment Variables on AlmaLinux 9

At this point, you need to add the environment variables to your bashrc file. Open the file with your desired text editor, here we use the vi editor:

sudo vi ~/.bashrc

At the end of the file, add the following content to the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Note: Remember to set your Spark installation directory next to the export SPARK_HOME=.

When you are done, save and close the file.

Next, source your bashrc file:

sudo source ~/.bashrc

Step 3 – How To Run Spark Shell on AlmaLinux 9?

At this point, you can verify your Spark installation by running the Spark shell command:

spark-shell

If everything is ok, you should get the following output:

Output
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.1
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Step 4 – How To Run PySpark on AlmaLinux 9?

If you want to use Python instead of Scala, you can easily run PySpark on your AlmaLinux server with the command below:

pyspark

In your output, you should see:

Output
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.1
      /_/

Using Python version 3.9.16 (main, May 29 2023 00:00:00)
Spark context Web UI available at http://37.120.247.7:4040
Spark context available as 'sc' (master = local[*], app id = local-1689750671688).
SparkSession available as 'spark'.
>>>

From your PySpark shell, you can easily write the code and execute it.

Conclusion

At this point, you have learned to Install Apache Spark and Run Spark-shell and PySpark on AlmaLinux 9 from Command-Line. Hope you enjoy it.

You may be interested in these articles too:

How To Copy Files and Folders on AlmaLinux

How To Configure Rsyslog in AlmaLinux

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Stay informed and not overwhelmed, subscribe now!