In this article, we will have explained the necessary steps to install and configure Hadoop on Ubuntu 20.04 LTS. Before continuing with this tutorial, make sure you are logged in as a user with sudo
privileges. All the commands in this tutorial should be run as a non-root user.
The Apache Hadoop software library is the framework that allows for the dispersed processing of large data sets across clusters of computers using simple development models. It is designed to level up from single servers to thousands of machines, each offering local computation and storage. Rather than rely upon hardware to deliver high-availability, the collection itself is designed to detect plus handle failures at the application coating, so delivering a highly-available service upon the top of a cluster of computer systems, each of which may be susceptible to failures.
Install Hadoop on Ubuntu 20.04
Step 1. First, before you start installing any package on your Ubuntu server, we always recommend making sure that all system packages are updated.
sudo apt update sudo apt upgrade
Step 2. Install Java.
You can install OpenJDK from the default apt repositories:
sudo apt install default-jdk default-jre
After successfully installing Java on Ubuntu 20.04, confirm the version with the java command line:
java -version
Step 3. Create a Hadoop User.
Run the following command to create a new user with the name Hadoop:
sudo adduser hadoop sudo usermod -aG sudo hadoop sudo usermod -aG sudo hadoop
Next, run the following command to generate Public and Private Key Pairs:
ssh-keygen -t rsa
Then, append the generated public keys from id_rsa.pub
to authorized_keys
and set permission:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 640 ~/.ssh/authorized_keys
Verify that you can ssh using added key:
ssh localhost
Step 4. Install Hadoop on the Ubuntu system.
Go to the official Apache Hadoop project page, and select the version of Hadoop you want to implement:
su - hadoop wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz tar -xvzf hadoop-3.3.2.tar.gz mv hadoop-3.3.2 hadoop
Next, you will need to configure Hadoop and Java Environment Variables on the Ubuntu system:
nano ~/.bashrc
Add the following lines:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Once done, activate the environment variables:
source ~/.bashrc
Next, open the Hadoop environment variable file:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
Step 5. Configure Hadoop.
Now create the namenode
and datanode
directories inside Hadoop home directory:
mkdir -p ~/hadoopdata/hdfs/namenode mkdir -p ~/hadoopdata/hdfs/datanode
Next, edit the core-site.xml
file and update it with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the following line:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop.tecadmin.com:9000</value> </property> </configuration>
Then, edit the hdfs-site.xml
file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Change the following line:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value> </property> </configuration>
Next, edit the mapred-site.xml
file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the following changes:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Once upon, edit the yarn-site.xml
file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Make the following file changes:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Step 6. Start Hadoop Cluster.
Now run the following command to format the Hadoop Namenode:
hdfs namenode -format start-dfs.sh
Then, start the YARN service using the following commands:
start-yarn.sh
Type this simple command to check if all the daemons are active and running as Java processes:
jps
Step 7. Configure Firewall.
Run the following command to allow Hadoop connections through the firewall:
firewall-cmd --permanent --add-port=9870/tcp firewall-cmd --permanent --add-port=8088/tcp firewall-cmd --reload
Step 8. Accessing Hadoop.
Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI:
http://your-ip-address:9870
That’s all you need to do to install Hadoop on Ubuntu 20.04 LTS Focal Fossa. I hope you find this quick tip helpful. For further reading on the Apache Hadoop, please refer to their official knowledge base. If you have questions or suggestions, feel free to leave a comment below.