How to Install Hadoop on Ubuntu 20.04

Install Hadoop on Ubuntu 20.04

In this article, we will have explained the necessary steps to install and configure Hadoop on Ubuntu 20.04 LTS. Before continuing with this tutorial, make sure you are logged in as a user with sudo privileges. All the commands in this tutorial should be run as a non-root user.

The Apache Hadoop software library is the framework that allows for the dispersed processing of large data sets across clusters of computers using simple development models. It is designed to level up from single servers to thousands of machines, each offering local computation and storage. Rather than rely upon hardware to deliver high-availability, the collection itself is designed to detect plus handle failures at the application coating, so delivering a highly-available service upon the top of a cluster of computer systems, each of which may be susceptible to failures.

Install Hadoop on Ubuntu 20.04

Step 1. First, before you start installing any package on your Ubuntu server, we always recommend making sure that all system packages are updated.

sudo apt update
sudo apt upgrade

Step 2. Install Java.

You can install OpenJDK from the default apt repositories:

sudo apt install default-jdk default-jre

After successfully installing Java on Ubuntu 20.04, confirm the version with the java command line:

java -version

Step 3. Create a Hadoop User.

Run the following command to create a new user with the name Hadoop:

sudo adduser hadoop
sudo usermod -aG sudo hadoop
sudo usermod -aG sudo hadoop

Next, run the following command to generate Public and Private Key Pairs:

ssh-keygen -t rsa

Then, append the generated public keys from id_rsa.pub to authorized_keys and set permission:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
chmod 640 ~/.ssh/authorized_keys

Verify that you can ssh using added key:

ssh localhost

Step 4. Install Hadoop on the Ubuntu system.

Go to the official Apache Hadoop project page, and select the version of Hadoop you want to implement:

su - hadoop 
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
tar -xvzf hadoop-3.3.2.tar.gz 
mv hadoop-3.3.2 hadoop

Next, you will need to configure Hadoop and Java Environment Variables on the Ubuntu system:

nano ~/.bashrc

Add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Once done, activate the environment variables:

source ~/.bashrc

Next, open the Hadoop environment variable file:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/

Step 5. Configure Hadoop.

Now create the namenode and datanode directories inside Hadoop home directory:

mkdir -p ~/hadoopdata/hdfs/namenode 
mkdir -p ~/hadoopdata/hdfs/datanode

Next, edit the core-site.xml file and update it with your system hostname:

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the following line:

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop.tecadmin.com:9000</value>
        </property>
</configuration>

Then, edit the hdfs-site.xml file:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Change the following line:

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
        <property>
                <name>dfs.name.dir</name>
                <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
        </property>
</configuration>

Next, edit the mapred-site.xml file:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes:

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

Once upon, edit the yarn-site.xml file:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Make the following file changes:

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

Step 6. Start Hadoop Cluster.

Now run the following command to format the Hadoop Namenode:

hdfs namenode -format 
start-dfs.sh

Then, start the YARN service using the following commands:

start-yarn.sh

Type this simple command to check if all the daemons are active and running as Java processes:

jps

Step 7. Configure Firewall.

Run the following command to allow Hadoop connections through the firewall:

firewall-cmd --permanent --add-port=9870/tcp 
firewall-cmd --permanent --add-port=8088/tcp 
firewall-cmd --reload

Step 8. Accessing Hadoop.

Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI:

http://your-ip-address:9870

That’s all you need to do to install Hadoop on Ubuntu 20.04 LTS Focal Fossa. I hope you find this quick tip helpful. For further reading on the Apache Hadoop, please refer to their official knowledge base. If you have questions or suggestions, feel free to leave a comment below.