# Principal Components Regression, Pt.1: The Standard Method

## Principal Components Regression

In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent ( x ) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y . PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.

Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an " x -only" decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y , and conversely, high variance components are sometimes not important.

Mosteller and Tukey (1977, pp. 397-398) argue similarly that the components with small variance are unlikely to be important in regression, apparently on the basis that nature is "tricky, but not downright mean". We shall see in the examples below that without too much effort we can find examples where nature is "downright mean". — Jolliffe (1982)

The remainder of this note presents principal components analysis in the context of PCR and predictive modeling in general. We will show some of the issues in using an x -only technique like PCA for dimensionality reduction. In a follow-up note, we’ll discuss some y -aware approaches that address these issues.

First, let’s build our example. In this sort of teaching we insist on toy or synthetic problems so we actually know the right answer, and can therefore tell which procedures are better at modeling the truth.

In this data set, there are two (unobservable) processes: one that produces the output `yA` and one that produces the output `yB` . We only observe the mixture of the two: `y = yA + yB + eps` , where `eps` is a noise term. Think of `y` as measuring some notion of success and the `x` variables as noisy estimates of two different factors that can each drive success. We’ll set things up so that the first five variables (x.01, x.02, x.03, x.04, x.05) have all the signal. The odd numbered variables correspond to one process ( `yB` ) and the even numbered variables correspond to the other ( `yA` ).

Then, to simulate the difficulties of real world modeling, we’ll add lots of pure noise variables ( `noise*` ). The noise variables are unrelated to our y of interest — but are related to other "y-style" processes that we are not interested in. As is common with good statistical counterexamples, the example looks like something that should not happen or that can be easily avoided. Our point is that the data analyst is usually working with data just like this.

Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y , standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y -aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.

All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.

``# build example where even and odd variables are bringing in noisy images # of two different signals. mkData <- function(n) {   for(group in 1:10) {     # y is the sum of two effects yA and yB     yA <- rnorm(n)     yB <- rnorm(n)     if(group==1) {       d <- data.frame(y=yA+yB+rnorm(n))       code <- 'x'     } else {       code <- paste0('noise',group-1)     }     yS <- list(yA,yB)     # these variables are correlated with y in group 1,     # but only to each other (and not y) in other groups     for(i in 1:5) {       vi <- yS[[1+(i%%2)]] + rnorm(nrow(d))       d[[paste(code,formatC(i,width=2,flag=0),sep='.')]] <- ncol(d)*vi     }   }   d }``

Notice the copy of y in the data frame has additional "unexplainable variance" so only about 66% of the variation in y is predictable.

``# make data set.seed(23525) dTrain <- mkData(1000) dTest <- mkData(1000)``
``summary(dTrain[, c("y", "x.01", "x.02",                    "noise1.01", "noise1.02")])``
``##        y                 x.01               x.02         ##  Min.   :-5.08978   Min.   :-4.94531   Min.   :-9.9796   ##  1st Qu.:-1.01488   1st Qu.:-0.97409   1st Qu.:-1.8235   ##  Median : 0.08223   Median : 0.04962   Median : 0.2025   ##  Mean   : 0.08504   Mean   : 0.02968   Mean   : 0.1406   ##  3rd Qu.: 1.17766   3rd Qu.: 0.93307   3rd Qu.: 1.9949   ##  Max.   : 5.84932   Max.   : 4.25777   Max.   :10.0261   ##    noise1.01          noise1.02        ##  Min.   :-30.5661   Min.   :-30.4412   ##  1st Qu.: -5.6814   1st Qu.: -6.4069   ##  Median :  0.5278   Median :  0.3031   ##  Mean   :  0.1754   Mean   :  0.4145   ##  3rd Qu.:  5.9238   3rd Qu.:  6.8142   ##  Max.   : 26.4111   Max.   : 31.8405``