In this article, we will have explained the necessary steps to install and configure Apache Spark on Ubuntu 20.04 LTS. Before continuing with this tutorial, make sure you are logged in as a user with sudo
privileges. All the commands in this tutorial should be run as a non-root user.
Apache Spark is an open-source framework for the distributed general-purpose cluster-computing system. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Install Apache Spark on Ubuntu 20.04
Step 1. First, before you start installing any package on your Ubuntu server, we always recommend making sure that all system packages are updated.
sudo apt update sudo apt upgrade
Step 2. Install Java.
Spark is based on Java, we need to install it on our Ubuntu system:
sudo apt install default-jdk
Check out Java version, by command line below:
java --version
Step 3. Install Scala.
Apache Spark is implemented on Scala programming language, so we have to install Scala for running Apache Spark:
sudo apt install scala
Verify the installation Scala:
scala -version
Step 4. Install Apache Spark on Ubuntu system.
Download the latest release of Apache Spark from the downloads page. As of this update, this is 3.0.0:
cd /opt wget https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
The next step is to extract the Apache Spark tarball files:
tar -xzvf spark-3.0.0-bin-hadoop2.7.tgz
Step 5. Configuring Apache Spark Environment.
Before starting a master server, you need to configure environment variables. There are a few Spark home paths you need to add to the user profile:
nano ~/.bashrc
Add the two lines below in the end fo the file:
export SPARK_HOME=/opt/spark/spark-3.0.0-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
You can now start a standalone master server using the start-master.sh
command:
star-master.sh
To view the Spark Web user interface, open a web browser and enter the localhost IP address on port 8080:
http://127.0.0.1:8080/
The URL for Spark Master is the name of your device on port 8080. In our case, this is ubuntu1:8080. So, there are three possible ways to load Spark Master’s Web UI:
- 127.0.0.1:8080
- localhost:8080
- deviceName:8080
Next, starting Spark worker process:
The Spark master service is running on a spark://ubuntu1:7077, so we will hit this address to startup the Spark worker process by submitting the command line below:
star-slave.sh spar://ubuntu1:7077
Finally, we verify this worker service on the web browser:
That’s all you need to do to install Apache Spark on Ubuntu 20.04 Focal Fossa. I hope you find this quick tip helpful. If you have questions or suggestions, feel free to leave a comment below.