Introducing R with Hadoop
R is an open-source software system package to perform applied math analysis on knowledge. R is a programming language used by data scientist statisticians and others who need to make statistical analysis of data and glean key insights from data using mechanisms, such as regression, clustering, classification, and text analysis. R is registered under GNU (General Public License). R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, which is currently handled by the R Development Core Team. It can be considered as a different implementation of S, developed by Johan Chambers at Bell Labs. There are some important differences, however, loads of the code written in S will be in-situ mistreatment the R interpreter engine.
R provides a large sort of applied statistic, machine learning (linear and nonlinear modeling, classic statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible. R has various built-in as well as extended functions for applied math, machine learning, and visualization tasks such as:
- Data extraction
- Data cleaning
- Data loading
- Data transformation
- Statistical analysis
- Predictive modeling
- Data visualization
It is one amongst the foremost well-liked open supply applied statistical analysis packages out there on the market today. It is crossplatform, includes a terribly wide community support, and a large and ever-growing user community who are adding new packages every day. With its growing list of packages, R can now connect with other data stores, such as MySQL, SQLite, MongoDB, and Hadoop for knowledge storage activities.
Understanding features of R
Let’s see different useful features of R:
- Effective programming language
- Relational database support
- Data analytics
- Data visualization
- Extension through the huge library of R packages
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Studying the popularity of R
R permits playing knowledge analytics by numerous applied math and machine learning
operations as follows:
- Regression
- Classification
- Clustering
- Recommendation
- Text mining
Introducing huge knowledge has got to trot out giant and complicated datasets that may be structured, semi-structured, or unstructured and will typically not fit into memory to be processed. They have to be processed in situ, which means that computation has to be done where the data resides for processing. When we check with developers, the individuals truly building huge knowledge systems and applications, we get a better idea of what they mean about 3Vs. They generally would mention the 3Vs model of huge knowledge, that area unit speed, volume, and variety.
Velocity refers to the low latency, real-time speed at which the analytics need to be applied. A typical example of this may be to perform analytics on kilobyte, MB, GB, TB, or PB based on the type of the application that generates or receives the data.
Variety refers to the varied forms of the info that may exist, for example, text, audio, video, and photos.
Big Data usually includes datasets with sizes. It is unacceptable for such systems to method this quantity of information at intervals the timeframe mandated by the business. Big knowledge volumes area unit a perpetually moving target, as of 2012 starting from some dozen terabytes to several petabytes of information in a very single dataset. Faced with this on the face of it insurmountable challenge, entirely new platforms area unit known as huge knowledge platforms.
Getting data concerning well-liked organizations that hold huge knowledge
Some of the favored organizations that hold huge knowledge area unit as follows:
- Facebook: it’s forty lead of information and captures one hundred TB/day
- Yahoo!: It has 60 PB of data
- Twitter: It captures 8 TB/day
- EBay: it’s forty lead of information and captures fifty TB/day
How much knowledge is taken into account as huge knowledge differs from company to company. Though true that one company’s huge knowledge is another’s tiny, there is something common: doesn’t fit in memory, nor disk, has rapid influx of data that needs to be processed and would benefit from distributed software stacks. For some firms, 10 TB of data would be considered Big Data and for others 1 PB would be Big Data. So solely you’ll verify whether or not the info is de facto huge knowledge. It is sufficient to say that it would start in the low terabyte range. Also, a question well worth asking is, as you are not capturing and retaining enough of your data do you think you do not have a Big Data problem now? In some situations, companies literally discard data, because there wasn’t a cost effective way to store and process it. With platforms as Hadoop, it is possible to start capturing and storing all that data.
Introducing Hadoop
Apache Hadoop is associate open supply Java framework for process and querying huge amounts of information on giant clusters of trade goods hardware. Hadoop may be a high level Apache project, initiated and led by Yahoo! and Doug Cutting. It depends on a vigorous community of contributors from everywhere the planet for its success. With a significant technology investment by Yahoo!, Apache Hadoop has become associate enterprise-ready cloud computing technology. It is changing into the business de facto framework for giant processing. Hadoop changes the economic science and therefore the dynamics of large-scale computing. Its impact will be stewed all the way down to four salient characteristics. Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions.
Exploring Hadoop features
Apache Hadoop has two main features:
- HDFS (Hadoop Distributed File System)
- MapReduce
Studying Hadoop components
Hadoop includes system of alternative merchandise engineered over the core HDFS and MapReduce layer to modify numerous sorts of operations on the platform.
A few popular Hadoop components are as follows:
- Mahout: This is an extensive library of machine learning algorithms.
- Pig: Pig could be a problem-oriented language (such as PERL) to investigate giant datasets with its own language syntax for expressing knowledge analysis programs, coupled with infrastructure for evaluating these programs.
- Hive: Hive could be a knowledge warehouse system for Hadoop that facilitates simple knowledge summarization, ad hoc queries, and the analysis of large datasets stored in
HDFS. it’s its own SQL-like command language referred to as Hive command language
(HQL), that is employed to issue question commands to Hadoop.
- HBase: HBase (Hadoop Database) could be a distributed, column-oriented database. HBase uses HDFS for the underlying storage. It supports both batch vogue computations exploitation MapReduce and atomic queries (random reads).
- Sqoop: Apache Sqoop can be a tool designed for expeditiously transferring bulk information between Hadoop and Structured relative Databases. Sqoop is an abbreviation for (SQ(L) to Hadoop).
- ZooKeper: ZooKeeper is a centralized service to maintain configuration information, naming, providing distributed synchronization, and group services, that are terribly helpful for a range of distributed systems.
Learning RHadoop
RHadoop is a great open source software framework of R for performing data analytics with the Hadoop platform via R functions. RHadoop has been developed by Revolution Analytics, which is the leading commercial provider of software and services supported the open supply R project for applied math computing.
The RHadoop project has 3 totally different R packages: rhdfs, rmr, and rhbase.
All these packages are implemented and tested on the Cloudera Hadoop distributions CDH3, CDH4,
and R 2.15.0. Also, these are tested with the R version 4.3, 5.0, and 6.0 distributions of Revolution Analytics.
These 3 totally different R packages are designed on Hadoop’s 2 main options
HDFS and MapReduce:
- rmr: this is often Associate in Nursing R package for providing Hadoop MapReduce interfaces to R. With the help of this package, the Mapper and Reducer can easily be developed.
- rhdfs: This is an R package for providing all Hadoop HDFS access to R. All distributed files can be managed with R functions.
- rhbase: This is an R package for handling data at HBase distributed database through R
Understanding Big Data Analysis with Machine Learning
Introduction to machine learning
Machine learning is a branch of artificial intelligence that allows us to make our application intelligent without being explicitly programmed.
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
A combination of machine learning and data processing are often wont to develop spam mail detectors, self-driven cars, speech recognition, face recognition, and online transactional fraud-activity detection.
There square measure several in style organizations that square measure exploitation machine-learning algorithms to create their service or product perceive the necessity of their users and supply services as per their behavior.
Google has its intelligent net programme, which provides a number one search, spam classification in Google Mail, news labeling in Google News, and Amazon for recommender systems.
There square measure several open supply frameworks obtainable for developing these forms of applications/frameworks, such as R, Python, Apache Mahout, and Weka.
Supervised machine-learning algorithms
In this section, we will be learning about supervised machine-learning algorithms.
The algorithms are as follows:
- Linear regression
- Logistic regression
Linear regression statistical regression is principally used for predicting and prediction values supported historical info.
Regression is a supervised machine-learning technique to identify the linear relationship between target variables and explanatory variables. We can say it is used for predicting the target variable values in numeric form.
In the following section, we are going to be learning regarding statistical regression with R and statistical regression with R and Hadoop.
Here, the variables that are going to be predicted are considered as target variables and the variables that are going to help predict the target variables are called explanatory variables. With the linear relationship, we can identify the impact of a change in explanatory variables on the target variable. In mathematics, regression can be formulated as follows:
y = ax +e
Other formulae include:
- The slope of the regression line is given by: a= (NΣxy – (Σx)(Σy)) / (NΣx2 – (Σx)2 )
- The intercept point of regression is given by: e = (Σy – b(Σx)) / N
Here, x and y are variables that form a dataset and N is the total numbers of values. Suppose we have the data shown in the following table:
X | Y |
63 | 3.1 |
64 | 3.6 |
65 | 3.8 |
66 | 4 |
If we have a new value of x, we can get the value of y with it with the help of the regression formula. Applications of linear regression include:
- Sales forecasting
- Predicting optimum product price
- Predicting the next online purchase from various sources and campaigns
Linear regression with R
Now we will see how to perform linear regression in R. We can use the in-built lm() method to build a linear regression model with R.
Model <-lm(target ~ ex_var1, data=train_dataset)
It will build a regression model based on the property of the provided dataset and store all of the variables’ coefficients and model parameters used for predicting and identifying of data pattern from the model variable values.
# Defining data variables
X = matrix(rnorm(2000), ncol = 10)
y = as.matrix(rnorm(200))
# Bundling data variables into dataframe
train_data <- data.frame(X,y)
# Training model for generating prediction
lmodel<- lm(y~ train_data $X1 + train_data $X2 + train_data $X3 + train_data $X4 + train_data $X5 + train_data $X6 + train_data $X7 + train_data $X8 + train_data $X9 + train_data $X10,data= train_data)
summary(lmodel)
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Linear Regression with R and Hadoop
Let’s see how to perform regression analysis with R and Hadoop data technologies.
# Defining the datasets with Big Data matrix X
X = matrix(rnorm(20000), ncol = 10)
X.index = to.dfs(cbind(1:nrow(X), X))
y = as.matrix(rnorm(2000))
Here, the Sum() function is re-usable as shown in the following code:
# Function defined to be used as reducers
Sum =
function(., YY)
keyval(1, list(Reduce(‘+’, YY)))
The outline of the linear regression algorithm is as follows:
- Calculating the Xtx value with MapReduce job1.
- Calculating the Xty value with MapReduce job2.
- Deriving the coefficient values with Solve (Xtx, Xty).
Let’s understand these steps one by one.
The first step is to calculate the Xtx value with MapReduce job 1.
- The big matrix is passed to the Mapper in chunks of complete rows. Smaller cross-products are computed for these submatrices and passed on to a single Reducer, which sums them together. Since we have a single key, a Combiner is mandatory and since the matrix sum is associative and commutative, we certainly can use it here.
# XtX =
values(
# For loading hdfs data in to R
from.dfs(
# MapReduce Job to produce XT*X
mapreduce(
input = X.index,
# Mapper – To calculate and emitting XT*X
map =
function(., Xi) {
yi = y[Xi[,1],]
Xi = Xi[,-1]
keyval(1, list(t(Xi) %*% Xi))},
# Reducer – To reduce the Mapper output by performing sum operation over them
reduce = Sum,
combine = TRUE)))[[1]]
- When we have a large amount of data stored in Hadoop Distributed File System (HDFS), we need to pass its path value to the input parameters in the MapReduce method.
- In the preceding code, we saw that X is the design matrix, which has been created with the following function:
X = matrix(rnorm(2000), ncol = 10)
To calculate the X+y value with MapReduce job 2 is pretty much the same as for the vector y, which is available to the nodes according to normal scope rules.
Xty = values(
# For loading hdfs data
from.dfs(
# MapReduce job to produce XT * y
mapreduce(
input = X.index,
# Mapper – To calculate and emitting XT*y
map = function(., Xi) {
yi = y[Xi[,1],]
Xi = Xi[,-1]
keyval(1, list(t(Xi) %*% yi))},
# Reducer – To reducer the Mapper output by performing
# sum operation over them
reduce = Sum,
combine = TRUE)))[[1]]
Logistic regression
In statistics, log regression or logit regression may be a form of probabilistic classification model.
Logistic regression is employed extensively in varied disciplines, as well as the medical and scientific discipline fields.
It can be binomial or multinomial.
Binary provision regression deals with things during which the result for a variable will have 2 potential sorts.
Multinomial provision regression deals with things wherever the result will have 3 or additional potential sorts.
Logistic regression can be implemented using logistic functions, which are listed here.
- To predict the log odds ratios, use the following formula:
logit(p) = β0 + β1 × x1 + β2 × x2 + … + βn × xn
- The probability formula is as follows:
p = elogit(p) ⁄ 1 + elogit(p)
logit(p) is a linear function of the explanatory variable, X (x1,x2,x3..xn), which is similar to linear regression. So, the output of this function will be in the range 0 to 1. Based on the probability score, we can set its probability range from 0 to 1.
In a majority of the cases, if the score is greater than 0.5, it will be considered as 1, otherwise 0. Also, we can say it provides a classification boundary to classify the outcome variable.
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Logistic regression with R
To perform logistic regression with R, we will use the iris dataset and th glm model.
#loading iris dataset
data(iris)
# Setting up target variable
target <- data.frame(isSetosa=(iris$Species == ‘setosa’))
# Adding target to iris and creating new dataset
inputdata <- cbind(target,iris)
# Defining the logistic regression formula
formula <- isSetosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
# running Logistic model via glm()
logisticModel <- glm(formula, data=inputdata, family=”binomial”)
Logistic regression with R and Hadoop
To perform logistic regression with R and Hadoop, we will use RHadoop with rmr2.
The outline of the logistic regression algorithm is as follows:
- Defining the lr.map Mapper function
- Defining the lr.reducer Reducer function
- Defining the logistic.regression MapReduce function
Let’s understand them one by one.
We will first define the logistic regression function with gradient decent. Multivariate regression can be performed by forming the nondependent variable into a matrix data format. For factorial variables, we can translate them to binary variables for fitting the model. This function will ask for input, iterations, dims, and alpha as input parameters.
- lr.map: This stands for the logistic regression Mapper, which will compute the contribution of subset points to the gradient.
# Mapper – computes the contribution of a subset of points to the gradient.
lr.map = function(., M)
- lr.reducer: This stands for the logistic regression Reducer, which is performing just a big sum of all the values of key 1.
# Reducer – Perform sum operation over Mapper output.
lr.reduce = function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
- logistic.regression: This will mainly define the logistic.regression MapReduce function with the following input parameters. Calling this function will start executing logistic regression of the MapReduce function.
- input: This is an input dataset
- iterations: This is the fixed number of iterations for calculating the gradient
- dims: This is the dimension of input variables
- alpha: This is the learning rate
Let’s see how to develop the logistic regression function.
# MapReduce job – Defining MapReduce function for executing logistic regression
logistic.regression =
function(input, iterations, dims, alpha){
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values(
from.dfs(
mapreduce(
input,
map = lr.map,
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient }
plane }
Please feel free to leave your comments in the comment box so that we can improve the guide and serve you better, also, follow sevenmentor.com to get updates on new blogs.
If you wish to learn Big Data Hadoop tools such as hive, pig, hbase, sqoop , spark , R then check out our course on sevenmentor.com
For Free, Demo classes Call: 8983120543
Registration Link: Click Here!
Author –
Harshal Patil
Data Science Trainer
https://www.sevenmentor.com/data-science-with-r-training-in-pune.php
Call the Trainer and Book your free demo Class now!!!
© Copyright 2019 | Sevenmentor Pvt Ltd.