Apache Pig and Hive Installation Single Node Machine
The Apache Hadoop software library is a framework that allows the data distributed processing across clusters for computing using simple programming models called Map Reduce. It uses cluster of machines to offer local computation and storage inefficient way.
Hadoop solutions normally include clusters that are hard to manage and maintain. In many scenarios, it requires integration with other tools like MySQL, mahout, etc. It works in a series of map-reduce jobs and each of these jobs are high-latency and depend with each other. So no job can start until the previous job has been finished and successfully completed.
Hadoop admin is responsible for implementation and maintenance of Hadoop atmosphere. Hadoop admins itself might be a title that covers all inside the big data world. Hadoop administrator could, in addition, be concerned about performing arts DBA like tasks with different databases and warehouses like HBase and Hive databases, security administration and cluster administration. It does deploy, manage, monitor, and secure full Hadoop Cluster.
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
Map Reduce:
Map Reduce is a programming model of Hadoop for processing HDFS data. Apache Hadoop can run MapReduce programs written in different languages like Java, Ruby, and Python. MapReduce programs execute in parallel in cluster efficiently. It works in the following phases:
- Map phase
- Reduce phase
Tools in Hadoop:
HDFS (Hadoop Distributed File System) basic storage for Hadoop.
Apache Pig is an ETL (Extract Transform and Load) tool.
Map Reduce is a programmatic model engine to execute MR jobs.
Apache Hive is a Data Warehouse tool used to work on Historical data using HQL.
Apache Sqoop is a tool for Import and export data from RDBMS to HDFS and
Vice-Versa.
Apache Ooozie is a tool for Job scheduling to control applications over the cluster.
Apache HBase is a NoSQL database based on CAP (Consistency Automaticity Partition)
theory.
Master-slave architecture
1.x daemons/services 5 daemons
Name Node – is the master node. To hold metadata information
Secondary Name Node – write log and other information about cluster activities
Data Node – slave node where actual data resides.
Job Tracker – receives a request of performing the task and divide it into subtasks.
Task Tracker – perform tasks using MR on individual DN. Signal system –>
2.x YARN (Yet another resource negotiator)
6 daemons
Name Node – is the master node. To hold metadata information
Secondary Name Node – write log and other information about cluster activities
Data Node – slave node where actual data resides.
Resource Manager- recive and run an application on cluster.
Job history server- job status maintain
Node Manager- manages resources and deployment on the node of the cluster. Launch containers where actual jobs will get executed.
Spark is a framework does in-memory computation and works with Hadoop. This framework is based on scala and java language. Input to each phase is key-value pairs in Hadoop. We are going to see some program with steps. In this program, we will see how to create a mapper and reducer class to achieve the objectives. We will also see to submit Hadoop job through the terminal on Hadoop cluster.
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
Let’s see some key-terms used in architecture first:
Physical architecture: Master -Slave architecture
Cluster – Group of computers connected with each other.
5 Daemons in Hadoop – Services: Hadoop 1.x series framework
Name node, Secondary name node, Data node, Job tracker, Task tracker
Daemons – Background running processes… Thread running at background
Divide architecture into following:
Storage/HDFS architecture – Name node, Secondary name node and Data node, Blocks
Master:
Name node – Master node which handles and manages all meta data info and status control..
Name node as manager …
Secondary name node – Asst. Manager… Helping node to NN which maintains meta data in FS image file system and log generation etc.. It’s not a backup of NN.
Slave:
Data Node – Its a slave node. Where data resides. Data will be stored using block system.
Blocks are memory blocks. Have some size which is modifiable.
Hadoop 1.x Default Block size is 64MB
Hadoop 2.x Default Block size 128MB
Default replication factor is : 3
Replication factor nothing but how many data copies will get created over cluster.
Ex: 1 Tb data ⇒ 3 TB on cluster
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
Process architecture
Job tracker – Master → Receives a client request to perform an operation over cluster. Emp.txt
Select count(*) from emp;
JobID → 00001100 →
Task Tracker – Slave → Perform a task using MR phases. → heartbeats will be given to JT.
MR2.x the Job Tracker is divided into three services:
Resource Manager- Is a persistent YARN service that receives and runs applications on the cluster. A MapReduce Job is an application.
JobHistoryServer- To provide information about jobs and completions.
Application Master- To manage each MR job and terminate when its completed.
Task Tracker is replaced with Node Manager, that manages resources and deployment on a node. Its also responsible for launching containers that is of MR Task.
Speculative Execution Mechanism -> If a task tracker is fail or not working then JT will assign same task to other available TT.
Rack awareness algorithm – HDFS file distribution over network in blocks.
daWe have now Apache Pig and Hive both tools are abstraction over map reduce and are used for ETL(Extract Transform and Load) and Data Warehousing. Let see now how to install them on a single node machine.
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
- Go to https://pig.apache.org/releases.html
OR Direct download from the mirror:
http://mirrors.fibergrid.in/apache/pig/pig-0.17.0/
Will download pig-0.17.0.tar.gz
- Untar/Extract this file in hadoop installation directory
In our case at location > /home/sachin/hadoop-2.7.7/pig-0.17.0
- Now we need to configure pig. We need to edit “.bashrc” file for pig entries now. To edit this file execute below command:
> sudo gedit ~/.bashrc
And in this file we need to add the following:
#Pig Setting
export PATH=$PATH:/home/sachin/hadoop-2.7.7/pig-0.17.0/bin
export PIG_HOME=/home/sachin/hadoop-2.7.7/pig-0.17.0
export PIG_CLASSPATH=$HADOOP_PREFIX/conf
- Terminal > pig -version
Output: Apache Pig version 0.17.0 (r1797386)
compiled Jun 02 2017, 15:41:58
IF ERROR: java home not set
Then
Terminal > sudo gedit ~/.bashrc and add this line
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre/
- Start Pig: Local Mode: Terminal > pig -x local
Map Reduce Mode : Terminal > pig OR pig -x mapreduce
——————————————————————————————
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
Hive Installation:
- Download Hive from : https://hive.apache.org/downloads.html –> Download release now
Directly from : https://www-eu.apache.org/dist/hive/hive-2.3.4/
- Unzip it into our hadoop directory location in /home/sachin/hadoop-2.7.7
- Edit the “.bashrc” file to update the environment variables for user.
Terminal > sudo gedit ~/.bashrc
Add following at the end ..
#Hive Setting
export PATH=$PATH:/home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/bin
export HIVE_HOME=/home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin
export HIVE_CLASSPATH=$HADOOP_PREFIX/conf
- Check the Version: Terminal > hive –version
- Create Hive directories within HDFS. The directory ‘warehouse’ is the location to store the table or data related to hive. Before this Make sure all services are running…
Terminal > hdfs dfs -mkdir -p /user/hive/warehouse
Terminal > hdfs dfs -mkdir -p /tmp
Terminal > hdfs dfs -chmod g+w /user/hive/warehouse
Terminal > hdfs dfs -chmod g+w /tmp
- Set Hadoop path in hive-env.sh
Terminal > cd /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/conf
> sudo cp hive-env.sh.template hive-env.sh
> sudo gedit hive-env.sh
Add following lines at the end..
export HADOOP_HEAPSIZE=512
export HADOOP_HOME=/home/sachin/hadoop-2.7.7
export HIVE_CONF_DIR=/home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/conf
- Set hive-site.xml for Configuring metastore
** Before following step make sure we have installed MySQL. We are not gonna use Derby as its does not support multi sessions.
OR Install it Terminal > sudo apt-get update
> sudo apt-get install mysql-server
If Error: Could not get lock on apt/dpkg
Terminal > ps -aux | grep apt
Terminal > sudo kill -9 processId’s
Terminal > cd /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/conf
> sudo cp hive-default.xml.template hive-site.xml
In Real time is MySQl as Metastore, it supports multi users
MySQL Work: Goto your mysql
>sudo mysql -u root -p
>create database hiveMetaStore; //any name
>use hiveMetaStore;
>SOURCE /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/scripts/metastore/upgrade/mysql/hive-schema-2.3.0.mysql.sql; //this is as per version installed of hive
>create user ‘hiveuser’@’%’ identified by ‘hive123’;
>GRANT all on *.* to ‘hiveuser’ @localhost identified by ‘hive123’;
>flush privileges;
>Show tables;
——————————————————————————————
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
In Multi Node with Sqoop did below as was getting an error:
Terminal > sudo gedit /etc/mysql/mysql.conf.d/mysqld.cnf
Replace and add below master address:
bind-address = 192.168.60.176
Save file.
Terminal > sudo service mysql restart
Terminal > mysql -u root -p
mysql > GRANT ALL ON *.* to root@’%’ IDENTIFIED BY ‘root’;
mysql > GRANT ALL ON *.* to hiveuser@’%’ IDENTIFIED BY ‘hiveuser123’;
——————————————————————————————
Now in hive-site-.xml:
Terminal > sudo cp hive-default.xml.template hive-site.xml
Terminal > sudo gedit hive-site.xml
Add these lines: Note: Remove/Replace Existing Derby properties with below properties.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hiveMetaStore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>Password for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive123</value>
<description>Password for connecting to mysql server</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp/hive</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/hive</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
For Free, Demo classes Call: 8605110150
Registration Link: Click Here!
Last Step: Terminal > sudo cp /home/sachin/Desktop/mysql-connector-java-5.1.22-bin.jar /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/lib
Check the permission of connector after copy. Suppose to be open and not locked.
i.e. mysql-connector-java-5.1.28-bin.jar OR 5.1.22 else we get an Error code1 Fail
Terminal > hive
Terminal > show databases;
Author:
Mr. Sachin Patil (Hadoop Trainer and Coordinator Exp: 12+ years)
At Sevenmentor Pvt. Ltd.
Call the Trainer and Book your free demo Class now!!!
© Copyright 2019 | Sevenmentor Pvt Ltd.