left-icon

Statistics Fundamentals Succinctly®
by Katie Kormanik

Previous
Chapter

of
A
A
A

CHAPTER 9

Linear Regression

Linear Regression


Predict one variable with another

We’ve looked at tests that determine whether or not two or more groups of values are significantly different or independent. With ANOVA, you can determine whether or not the dependent variable is significantly different between values of the categorical, independent variable(s).

The remainder of this e-book addresses how to test the association between one or more independent variables and a dependent variable—in other words, how to predict the amount by which the dependent variable will change when an independent variable changes by x, with all else held constant. This allows us to extrapolate—make future predictions based on trends—and interpolate—estimate values of one variable based on values of another. For example, given how the world population has been changing over the last 10 years, what might we expect the world population to be in the year 2050 (extrapolation)? Or, given the relationship between house price and square feet in a certain location, what might we expect to be the price of a 2,000-square-foot home (interpolation)?

We do this by finding a model that fits the trends in the data, one that inputs specified values of the independent variable(s) and outputs the predicted value of the dependent variable. This test is called regression, and the model we derive is the regression model. While there are many types of regression models (e.g., logistic, quadratic), we will cover only linear models, which are the simplest.

Because regression looks at the change in the dependent variable associated with a change in the independent variable(s), both the independent and dependent variables must be numeric.

Correlation

When performing linear regression, we first visualize trends in the data with a scatter plot. (Note that we can only do this with one independent variable.) Scatter plots show values of the independent variable on the x-axis (horizontal axis) and values of the dependent variable on the y-axis (vertical axis). For this reason, we’ll call the independent variable “x” and the dependent variable “y.” This visualization can quickly tell us whether or not the relationship between x and y is strong (points form a straighter line), weak (points are more variable), positive (greater values of the independent variable are associated with greater values of the dependent variable), or negative (greater values of the independent variable are associated with smaller values of the dependent variable).

The scatter plot on the left shows a positive, strong relationship; the scatter plot on the right shows a positive, weak relationship.

Figure 41: The scatter plot on the left shows a positive, strong relationship; the scatter plot on the right shows a positive, weak relationship.

The scatter plot on the left shows a negative, strong relationship; the scatter plot on the right shows a negative, weak relationship.

Figure 42: The scatter plot on the left shows a negative, strong relationship; the scatter plot on the right shows a negative, weak relationship.

We can quantify the strength and direction of a relationship with a statistic called the correlation coefficient, which is denoted by r. While the sign of r indicates the direction, the distance r is from 0 indicates the relationship’s strength.

Note that r ranges from -1 to 1, where -1 is a perfect negative relationship (i.e. the points form a straight line), and 1 is a perfect positive relationship. When r = 0, there is no relationship between x and y.

To calculate r, we first find the covariance of x and y, which measures the association that a change in x has with a change in y, by finding the average product of each x and y value from their respective means. If there is no relationship between x and y, then some of these products will be negative and some will be positive, and they will cancel each other out, resulting in a covariance closer to 0.

You can see from this equation that a positive relationship between x and y means that most points are to the lower left and the upper right of , so that and are usually either both positive or both negative. This will result in the products being mostly positive (so the sum will also be positive). Similarly, a negative relationship between x and y means that most coordinates are to the upper left and lower right of , resulting in a negative covariance.

To find r, we divide the covariance by the product of the standard deviation of x and the standard deviation of y.

Note: Because the standard deviation is always positive, the covariance determines whether r is positive or negative and therefore is the statistic responsible for describing the direction.

The product (sx)(sy) will always be greater than or equal to covx,y. If you visualize each, you can see that (sx)(sy) is the product of squares, which maximize area, and covx,y is the product of rectangles.

The covariance is the area of the average blue rectangle, while (sx)(sy) is the standard length of the orange squares multiplied by the standard height of the orange squares (where “standard” is the square root of the area of the average orange square).

Figure 43: The covariance is the area of the average blue rectangle, while (sx)(sy) is the standard length of the orange squares multiplied by the standard height of the orange squares (where “standard” is the square root of the area of the average orange square).

When r equals 1 or -1, the covariance is equal to the product of the standard deviations (r = 1) or the negative product of the standard deviations (r = -1).

As in previous chapters, when we calculate the correlation, we want to do a hypothesis test for significance. This test helps us decide—based on our calculation of r—if the true correlation of the population (denoted r) is significantly different from 0.

H0: r = 0
Ha: r ¹ 0 (two-tailed test)

Again, this is a type of t-test. We will not address how to calculate the t-statistic; the important thing is that you understand the principles and can interpret the results.

Let’s do a correlation test in R between SES and income in 2011 from the NCES data.

Code Listing 17

> plot(ses, income2011) #creates a scatter plot with “ses” on the x-axis and “income2011” on the y-axis

> cor.test(ses, income2011)

     Pearson's product-moment correlation

data:  ses and income2011

t = 13.4525, df = 8245, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

 0.1253655 0.1676048

sample estimates:

      cor

0.1465519

You can see from this test that while r is small (0.15), the results are significant, meaning that we’re pretty sure r is significantly different from 0. Also note that R gives us the 95% confidence interval for r, the entire range of which is positive.

Line of best fit

After determining that a relationship does indeed exist between the independent and dependent variables, the next step is to predict how much the dependent variable will change when one or more of the independent variables changes by a certain amount. We do this using a regression line, or line of best fit, so named because it minimizes the sum-of-squared residuals (the distance between each observed y-value (yi) and the predicted value (ŷi) for the corresponding observed x-value (xi)). The sum-of-squared residuals is equal to S(yi - ŷi)2. We’ll first go through simple linear regression, which involves only one independent variable.

The red dotted lines visualize the risiduals (yi - ŷi). Each point visualizes the observed values (xi, yi), and the line of best fit shows the predicted values of y (ŷ) for any value of x. This line minimizes the sum-of-squared residuals.

Figure 44: The red dotted lines visualize the risiduals (yi - ŷi). Each point visualizes the observed values (xi, yi), and the line of best fit shows the predicted values of y (ŷ) for any value of x. This line minimizes the sum-of-squared residuals.

The general equation for the regression line is ŷi = b0 + b1xi, where b0 and b1 are called the regression coefficients. Coefficient b0 is the predicted value of y when x = 0; coefficient b1 is the amount by which y is expected to change when x changes by one unit.

We can determine the values of b0 and b1 by using calculus to minimize S(yi - ŷi)2, setting ŷi equal to b0 + b1xi, inputting each value of xi and yi, and knowing that the regression line will always pass through the point .

Therefore, this is our linear regression equation:

We won’t bother calculating this by hand; instead, we’ll do it in R. And R will perform another hypothesis test to determine if the slope b1 is significantly different than 0. In other words, we want to know if a change in x is indeed associated with a change in y.

Let’s first execute a linear regression analysis with income2011 as the dependent variable and only SES as the independent variable. Then we’ll perform a multiple regression analysis to predict how a change in multiple independent variables would lead to a change in income2011.

Code Listing 18

> lm.income1 = lm(income2011 ~ ses) #assigns the name lm.income1 to the regression analysis

> summary(lm.income1) #outputs the results of the regression analysis

Call:

lm(formula = income2011 ~ ses)

Residuals:

   Min     1Q Median     3Q    Max

-35519 -16468  -2624  10514 230221

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  26693.0      271.1   98.48   <2e-16 ***

ses           4903.4      364.5   13.45   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24270 on 8245 degrees of freedom

Multiple R-squared:  0.02148, Adjusted R-squared:  0.02136

F-statistic:   181 on 1 and 8245 DF,  p-value: < 2.2e-16

The results of this test show that the coefficient of SES is 4,903.4, meaning that an increase in SES of 1 is associated with an increased 2011 salary of $4,903.4. This is significant at a < 0.001, meaning that the true population likely has the same association between SES and 2011 income.

Note: Because the units of SES don’t contain much meaning, it’s helpful to analyze the mean, standard deviation, range, and distribution so that we can see what a one-unit increase in SES means. If we create a histogram of SES, we can see that the increase is approximately normally distributed with mean 0.12 and standard deviation 0.73. We can then find the z-score of 1.12 (the mean plus an increase in SES of 1), which is 1.34. Therefore, a one-unit increase in SES moves from average to the top 90%.

With multiple regression, we have n independent variables, and our general model is
ŷ = b0 + b1x1 + b2x2 + … + bnxn. Each coefficient b0…bn tells us how much y is expected to change when the corresponding independent variable changes by one unit. Our null hypothesis states that each true coefficient for the population B0, B1, … Bn, is equal to 0 (a change in the respective independent variable is not associated with a change in the dependent variable) and the alternative states that the coefficient is significantly different from 0.

Let’s do an example in R. We can use the cor.test() function to find that the standardized test score (“test”) has a significant correlation with income in 2011 (r = 0.17, p < 0.001). Perhaps we want to predict 2011 income with the test score as well as with demographic variables race, gender, and SES.

Code Listing 19

> lm.income2 = lm(income2011 ~ test + race + gender + ses) #assigns the name lm.income2 to the regression analysis

> summary(lm.income2) #outputs the results of the regression analysis

Call:

lm(formula = income2011 ~ test + race + gender + ses)

Residuals:

   Min     1Q Median     3Q    Max

-42251 -15636  -2389  10599 237684

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept) 13357.78    1717.06   7.779 8.17e-15 ***

test          332.02      31.26  10.621  < 2e-16 ***

race         -285.83     141.05  -2.026   0.0427 * 

gender      -6626.44     526.92 -12.576  < 2e-16 ***

ses          2612.20     404.27   6.462 1.10e-10 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23860 on 8242 degrees of freedom

Multiple R-squared:  0.05444, Adjusted R-squared:  0.05399

F-statistic: 118.6 on 4 and 8242 DF,  p-value: < 2.2e-16

The results of this test show that all coefficients are significant—“test,” “gender,” and “ses” at p < 0.001 and “race” at p < 0.05. The associations “test” and “ses” have with “income2011” are easy to interpret because they’re numeric and continuous—a one-unit increase in test score is associated with a $332.02 increase in 2011 salary (with all else constant), and a one-unit increase in SES is associated with a $2,612.20 increase in salary. Note that the coefficient for SES was $4,903 when we executed simple linear regression. It changes now because we’re factoring in other variables.

The coefficients for “race” and “gender” are a little trickier because these variables are categorical. For “race,” we mostly care about the sign (positive or negative) and the reference value (“White,” because we assigned it a value of 0). This means any increase in this variable (i.e. moving from White to non-White) is associated with a smaller 2011 income.

Because “male” is the reference value for the variable “gender” (“male” is coded 0, “female” is coded 1), moving from “male” to “female” is associated with a decrease of $6,626.44 in 2011 income.

Feel free to test other multiple regressions using independent variables, such as the relationship between standardized test score (“test”) and variables such as whether or not students played sports, watched television, got good grades, etc. Perhaps you’ll find some quantitative evidence for which behaviors lead to good grades, which may come in handy when convincing a recalcitrant student to study!


Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.