# Introduction to R for Data Science :: Session 7 [Multiple Linear Regression Model in R  + C…

Welcome to Introduction to R for Data Science Session 7: Multiple Regression + Dummy Coding, Partial and Part Correlations [ Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R. ]

The course is co-organized by Data Science Serbia and Startit . You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out theCourse Overview to acess the learning material presented thus far.

## Lecturers

Summary of Session 7, 09. June 2016 :: Multiple Regression + Dummy Coding, Partial and Part Correlations.

Multiple Regression + Dummy Coding, Partial and Part Correlations.  Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.

## R script :: Session 7

`######################################################## # Introduction to R for Data Science # SESSION 7 :: 9 June, 2016 # Multiple Linear Regression in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ########################################################   # clear rm(list=ls())   #### read data library(datasets) library(broom) library(ggplot2) library(lattice) library(QuantPsyc)   #### load data(iris) str(iris)   #### simple linear regression: Sepal Length vs Petal Lenth # Predictor vs Criterion {ggplot2} ggplot(data = iris,        aes(x = Sepal.Length, y = Petal.Length)) +   geom_point(size = 2, colour = "black") +   geom_point(size = 1, colour = "white") +   geom_smooth(aes(colour = "black"),               method='lm') +   ggtitle("Sepal Length vs Petal Length") +   xlab("Sepal Length") + ylab("Petal Length") +   theme(legend.position = "none")`
`# What is wrong here? # let's see... reg <- lm(Petal.Length ~ Sepal.Length, data=iris)  summary(reg) # Hm, everything seems fine to me...   # And now for something completelly different (but in R)...   #### Problems with linear regression in iris # Predictor vs Criterion {ggplot2} - group separation ggplot(data = iris,         aes(x = Sepal.Length,            y = Petal.Length,            color = Species)) +    geom_point(size = 2) +   ggtitle("Sepal Length vs Petal Length") +   xlab("Sepal Length") + ylab("Petal Length")`
`# Predictor vs Criterion {ggplot2} - separate regression lines ggplot(data = iris,         aes(x = Sepal.Length,            y = Petal.Length,            colour=Species)) +    geom_smooth(method=lm) +    geom_point(size = 2) +   ggtitle("Sepal Length vs Petal Length") +   xlab("Sepal Length") + ylab("Petal Length")`
`### Ooops... ### overview and considerations plot(iris[,c(1,3,5)],      main = "Inspect: Sepal vs. Petal Length /nfollowing the discovery of the Species...",      cex.main = .75,      cex = .6)`
`### better... {lattice} xyplot(Petal.Length ~ Sepal.Length | Species, # {latice} xyplot        data = iris,        xlab = "Sepal Length", ylab = "Petal Length"        )`
`### Petal.Length and Sepal.Lengt: EDA and distributions par(mfcol = c(2,2)) # boxplot Petal.Length boxplot(iris\$Petal.Length,         horizontal = TRUE,          xlab="Petal Length") # histogram: Petal.Length hist(iris\$Petal.Length,       main="",       xlab="Petal Length",       prob=T) lines(density(iris\$Petal.Length),       lty="dashed",        lwd=2.5,        col="red") # boxplot Sepal.Length boxplot(iris\$Sepal.Length,         horizontal = TRUE,          xlab="Sepal Length") # histogram: Sepal.Length hist(iris\$Sepal.Length,       main="",       xlab="Sepal Length",       prob=T) lines(density(iris\$Sepal.Length),       lty="dashed",        lwd=2.5,        col="blue")`
`# Petal Length and Sepal Length: Conditional Densities densityplot(~ Petal.Length | Species, # {latice} xyplot        data = iris,        plot.points=FALSE,        xlab = "Petal Length", ylab = "Density",        main = "P(Petal Length|Species)",        col.line = 'red' )`
`densityplot(~ Sepal.Length | Species, # {latice} xyplot             data = iris,             plot.points=FALSE,             xlab = "Sepal Length", ylab = "Density",             main = "P(Sepal Length|Species)",             col.line = 'blue' )`
`# Linear regression in subgroups species <- unique(iris\$Species) w1 <- which(iris\$Species == species[1]) # setosa reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,])  tidy(reg) w2 <- which(iris\$Species == species[2]) # versicolor reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,])  tidy(reg) w3 <- which(iris\$Species == species[3]) # virginica reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,])  tidy(reg)   #### Dummy Coding: Species in the iris dataset is.factor(iris\$Species) levels(iris\$Species) reg <- lm(Petal.Length ~ Species, data=iris)  tidy(reg) glance(reg) # Never forget what the regression coefficient for a dummy variable means: # It tells us about the effect of moving from the baseline towards the respective reference level! # Here: baseline = setosa (cmp. levels(iris\$Species) vs. the output of tidy(reg)) # NOTE: watch for the order of levels! levels(iris\$Species) # Levels: setosa versicolor virginica iris\$Species <- factor(iris\$Species,                         levels = c("versicolor",                                    "virginica",                                   "setosa")) levels(iris\$Species) # baseline is now: versicolor reg <- lm(Petal.Length ~ Species, data=iris)  tidy(reg) # The regression coefficents (!): figure out what has happened!   ### another way to do dummy coding rm(iris); data(iris) # ...just to fix the order of Species back to default levels(iris\$Species) contrasts(iris\$Species) = contr.treatment(3, base = 1) contrasts(iris\$Species) # this probably what you remember from your stats class... iris\$Species <- factor(iris\$Species,                         levels = c ("virginica","versicolor","setosa")) levels(iris\$Species) contrasts(iris\$Species) = contr.treatment(3, base = 1) # baseline is now: virginica contrasts(iris\$Species) # consider carefully what you need to do   ### Petal.Length ~ Species (Dummy Coding) + Sepal.Length  rm(iris); data(iris) # ...just to fix the order of Species back to default reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # BTW: since is.factor(iris\$Species)==T, R does the dummy coding in lm() for you regSum <- summary(reg) regSum\$r.squared regSum\$coefficients # compare w. Simple Linear Regression reg <- lm(Petal.Length ~ Sepal.Length, data=iris)  regSum <- summary(reg) regSum\$r.squared regSum\$coefficients   ### Comparing nested models reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris) reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2 # terminology: reg2 is a "full model" # this terminology will be used quite often in Logistic Regression   # NOTE: Nested models # There is a set of coefficients for the nested model (reg1) such that it # can be expressed in terms of the full model (reg2); in our case it is simple  # HOME: - figure it out.   anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length # NOTE: for partial F-test, see: # http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf   # Influence Plot regFrame <- augment(reg2) ## Influence plot # influnce data infReg <- as.data.frame(influence.measures(reg)\$infmat) # data.frame for ggplot2 plotFrame <- data.frame(residual = regFrame\$.std.resid,                         leverage = regFrame\$.hat,                         cookD = regFrame\$.cooksd)   ggplot(plotFrame,        aes(y = residual,            x = leverage)) +   geom_point(size = plotFrame\$cookD*100, shape = 1) +   ggtitle("Influence Plot/nSize of the circle corresponds to Cook's distance") +   theme(plot.title = element_text(size=8, face="bold")) +   ylab("Standardized Residual") + xlab("Leverage")`
`#### Multiple Regression - by the book # Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression # (that's from your reading list, to remind you...) data(stackloss) str(stackloss) # Data set description # URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html # Air Flow represents the rate of operation of the plant.  # Water Temp is the temperature of cooling water circulated through coils in the absorption tower.  # Acid Conc. is the concentration of the acid circulating. # stack.loss (the dependent variable) is 10 times the percentage of the ingoing ammonia to  # the plant that escapes from the absorption column unabsorbed; # that is, an (inverse) measure of the over-all efficiency of the plant. stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,                      data=stackloss)   # let's see: summary(stacklossModel) glance(stacklossModel) # {broom} tidy(stacklossModel) # {broom}   # predict new data obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85) predict(stacklossModel, obs)   # confidence intervals confint(stacklossModel, level=.95) # 95% CI confint(stacklossModel, level=.99) # 99% CI # 95% CI for Acid.Conc. only confint(stacklossModel, "Acid.Conc.", level=.95)   # default regression plots in R plot(stacklossModel)`
`# multicolinearity library(car) # John Fox's car package VIF <- vif(stacklossModel) VIF sqrt(VIF) # Variance Inflation Factor (VIF) # The increase in the ***variance*** of an regression ceoff. due to colinearity # NOTE: sqrt(VIF) = how much larger the ***SE*** of a reg.coeff. vs. what it would be # if there were no correlations with the other predictors in the model # NOTE: lower_bound(VIF) = 1; no upper bound; VIF > 2 --> (Concerned == TRUE) Tolerance <- 1/VIF # obviously, tolerance and VIF are redundant Tolerance # NOTE: you can inspect multicolinearity in the multiple regression mode # by conducting a Principal Component Analysis over the predictors; # when the time is right.   #### R for partial and part (semi-partial) correlations library(ppcor) # a good one; there are many ways to do this in R   #### partial correlation in R dataSet <- iris str(dataSet) dataSet\$Species <- NULL irisPCor <- pcor(dataSet, method="pearson") irisPCor\$estimate # partial correlations irisPCor\$p.value # results of significance tests irisPCor\$statistic # t-test on n-2-k degrees of freedom ; k = num. of variables conditioned # partial correlation between x and y while controlling for z partialCor <- pcor.test(dataSet\$Sepal.Length, dataSet\$Petal.Length,                      dataSet\$Sepal.Width,                      method = "pearson") partialCor\$estimate partialCor\$p.value partialCor\$statistic   # NOTE: # Formally, the partial correlation between X and Y given a set of n  # controlling variables Z = {Z1, Z2, ..., Zn}, written ρXY·Z, is the  # correlation between the residuals RX and RY resulting from the  # linear regression of X with Z and of Y with Z, respectively.  # The first-order partial correlation (i.e. when n=1) is the difference  # between a correlation and the product of the removable correlations  # divided by the product of the coefficients of alienation of the  # removable correlations. [source: https://en.wikipedia.org/wiki/Partial_correlation] # NOTE: coefficient of alienation = 1 - R2 (R2 = "r-squared") # coefficient of alienation = the proportion of variance "unaccounted for"   #### semi-partial correlation in R # NOTE: ... Semi-partial correlation is the correlation of two variables  # with variation from a third or more other variables removed only  # from the ***second variable*** # NOTE: The first variable <- rows, the second variable <-columns # cf. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients (2015) # Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University # URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/ irisSPCor <- spcor(dataSet, method = "pearson") irisSPCor\$estimate irisSPCor\$p.value irisSPCor\$statistic   # NOTE: ... Semi-partial correlation is the correlation of two variables  # with variation from a third or more other variables removed only  # from the ***second variable*** partCor <- spcor.test(dataSet\$Sepal.Length, dataSet\$Petal.Length,                        dataSet\$Sepal.Width,                        method = "pearson") # NOTE: this is a correlation of dataSet\$Sepal.Length w. dataSet\$Petal.Length # when the variance of dataSet\$Petal.Length (2nd variable) due to dataSet\$Sepal.Width # is removed! partCor\$estimate partCor\$p.value partCor\$statistic   # NOTE: In multiple regression, this is the semi-partial (or part) correlation # that you need to inspect: # assume a model with X1, X2, X3 as predictors, and Y as a criterion # You need a semi-partial of X1 and Y following the removal of X2 and X3 from Y # It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3; # In Step 2, you take the residuals of Y, call them RY; in Step 3, you regress (correlate) # RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation # that you're looking for.   # Give a thought to the following discussion on categorical predictors: # http://stats.stackexchange.com/questions/133203/partial-correlation-and-multiple-regression-controlling-for-categorical-variable # What's your take on this?`