CHAPTER 7
In Chapter 6 we explored methods for visualizing the association between two variables. In this chapter, we will provide the complement to those analyses by discussing statistical procedures for describing bivariate associations. This chapter includes some of the most familiar inferential statistics: correlation, regression, t-tests for two samples and paired observations, the one-factor analysis of variance, tests to compare multiple proportions, and the chi-squared test for independence.
Correlations are a mainstay of statistical analysis. R provides several approaches to correlations and the external packages provide an almost unlimited choice of procedures. In this section, we’ll start with cor(), the default correlation function in R. We’ll use the swiss data from R’s datasets package. This data contains fertility and socio-economic variables for 47 French-speaking provinces of Switzerland in 1888. (See ?swiss for more information.)
Sample: sample_7_1.R
# LOAD DATA require(“datasets”) # Load the datasets package. data(swiss) |
Once the data are loaded into R, we can simply call the cor() function on the entire data frame. Also, to make the printout less busy, we can wrap the cor() function with the round() function to round the output to two decimal places.
# CORRELATION MATRIX cor(swiss) round(cor(swiss), 2) # Rounded to 2 decimals |
The first few rows and columns of that output are:
Fertility Agriculture Examination Education Catholic Fertility 1.00 0.35 -0.65 -0.66 0.46 Agriculture 0.35 1.00 -0.69 -0.64 0.40 |
Missing from this printout are probability values. R does not have a built-in function to get p-values—or even asterisks indicating statistical significance—for a correlation matrix. Instead, we can test each correlation separately with cor.test(). This function will give the correlation value, a hypothesis test, and a confidence interval for the correlation.
# INFERENTIAL TEST FOR SINGLE CORRELATION cor.test(swiss$Fertility, swiss$Education) Pearson's product-moment correlation data: swiss$Fertility and swiss$Education t = -5.9536, df = 45, p-value = 3.659e-07 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.7987075 -0.4653206 sample estimates: cor -0.6637889 |
This is useful information but it is cumbersome to test correlations one at a time. The external package Hmisc provides an alternative, making it possible to get two matrices at once: one matrix with correlation values, and a second matrix with two-tailed p-values. The first step is to install and load the Hmisc package.
# INSTALL AND LOAD "HMISC" PACKAGE install.packages("Hmisc") require("Hmisc") |
When you load Hmisc, you will receive message that R is also loading several other packages—grid, lattice, survival, splines, and Formula—that Hmisc relies on. You may also receive a message that a few functions, such as format.pval, round.POSIXt, trunc.POSIXt, and units, were masked from package:base. This happens because Hmisc is temporarily overriding those functions, which is supposed to happen, so you can ignore those messages.
Before we can run the rcorr() function from Hmisc, however, we need to make a change to our data. The swiss data are a data frame but Hmisc operates on matrices. All that we need to do is wrap swiss with the as.matrix() function, which can then be nested within rcorr().
# COERCE DATA FRAME TO MATRIX # GET R & P MATRICES rcorr(as.matrix(swiss)) |
Here are the first few rows and columns of each resulting matrix. The first matrix consists of correlation coefficients and is identical to our rounded matrix from R’s cor() function.
Fertility Agriculture Examination Education Catholic Fertility 1.00 0.35 -0.65 -0.66 0.46 Agriculture 0.35 1.00 -0.69 -0.64 0.40 |
The second matrix consists of two-tailed probability values, with blanks on the diagonal:
P Fertility Agriculture Examination Education Catholic Fertility 0.0149 0.0000 0.0000 0.0010 Agriculture 0.0149 0.0000 0.0000 0.0052 |
Using the standard criterion of p < .05, all of these correlations are statistically significant.
For more information on Hmisc, enter help(package = "Hmisc"). And, as before, when we are finished, we should unload the external packages and clear out any variables or objects we created.
# CLEAN UP detach("package:datasets", unload=TRUE) # Unload the package. detach("package:Hmisc", unload=TRUE) # Unload the package. rm(list = ls()) # Remove the objects. |
Bivariate regression—that is, linear regression between two variables—is a very common, flexible, and powerful procedure. R has excellent built-in functions for bivariate regression that can provide an enormous amount of information.
For this example, we will use the trees data from R’s datasets package. (See ?trees for more information.)
Sample: sample_7_2.R
# LOAD DATA require(“datasets”) # Load the datasets package. data(trees) # Load data into the workspace. trees[1:5, ] # Show the first 5 lines. Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 |
We will use bivariate regression to see how a tree’s girth can predict its height. But first, it is always a good idea to check the variables graphically to see how well they meet the assumptions of our intended procedure.
# GRAPHICAL CHECK hist(trees$Height) hist(trees$Girth) plot(trees$Girth, trees$Height) |
To save space, I won’t print those graphs here but I will mention that both variables are approximately normally distributed, with some positive skew on girth, and the scatter plot appears to be approximately linear. The variables appear well suited for regression analysis.
The command for linear regression is lm(). It requires only two arguments: the outcome variable and the predictor variable, with the tilde operator, ~, between them. For this example, it appears as lm(Height ~ Girth, data = trees). The third argument here, data = trees, is optional. It is there to specify the data set from which the variables are drawn. As a note, it is also possible to give the full address for each variable, like trees$Height and trees$Girth. The separate data statement is just a convenience.
In this example, I am saving the regression model into reg1. By saving the model into an object, I am then able to call several functions on the model without having to specify it again. In the code that follows, I first run lm() and then get information on the model with summary().
# BASIC REGRESSION MODEL reg1 <- lm(Height ~ Girth, data = trees) # Save the model. summary(reg1) # Get regression output. Call: lm(formula = Height ~ Girth, data = trees) Residuals: Min 1Q Median 3Q Max -12.5816 -2.7686 0.3163 2.4728 9.9456 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 62.0313 4.3833 14.152 1.49e-14 *** Girth 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.538 on 29 degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758 |
The summary() function gives nearly everything that is needed for most analyses: the regression coefficients, their statistical significance, the R2 and adjusted R2, the F-value and p-value for the entire model, and information about residuals. (See ?lm and ?summary for more information.)
It is also possible to get confidence intervals for the regression coefficients with the confint() function:
# CONFIDENCE INTERVALS confint(reg1) 2.5 % 97.5 % (Intercept) 53.0664541 70.996174 Girth 0.3953483 1.713389 |
Other functions that can be useful when evaluating a regression model are:
We can finish by unloading packages and cleaning the workspace:
# CLEAN UP detach("package:datasets", unload = TRUE) # Unloads data sets package. rm(list = ls()) # Remove all objects from workspace. |
If you wish to compare the means of two different groups, then the most common choice is the two-sample t-test. R provides several options for this test and several ways to organize the data for analysis. For this example we will use the sleep data from R’s datasets package. This data set, which was gathered by the creator of the t-test, William Gosset, who published under the pseudonym Student, shows the effect of two soporific drugs on 10 patients. The data set has three columns: extra, which is the increase in hours of sleep; group, which indicates the drug given; and ID, which is the patient ID. To make the analysis simpler, we will open the data, remove the ID variable, and save it as a new data frame called sd for “sleep data.”
Sample: sample_7_3.R
# LOAD DATA & SELECT VARIABLES require(“datasets”) # Load the datasets package. sleep[1:5, ] # Show the first 5 cases. sd <- sleep[, 1:2] # Save just the first two variables. sd[1:5, ] # Show the first 5 cases. |
Once the data are loaded into the new data frame, we can create some basic graphs, a histogram and a boxplot, to check on the shape of the distribution and how well it meets the statistical assumptions of the t-test.
# GRAPHICAL CHECKS hist(sd$extra) # Histogram of extra sleep. boxplot(extra ~ group, data = sd) # Boxplots by group. |
To save space I will not reprint the graphics here, but I will say that the histogram is approximately normal and the boxplots, which are separated by group, do not show any outliers. With this preliminary check done, we can move ahead to the default t-test, using R’s t.test() function.
# TWO-SAMPLE T-TEST WITH DEFAULTS t.test(extra ~ group, data = sd) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.07939 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33 |
These results show that while there is a difference between the two group’s means—2.33 vs. 0.75—the difference does not meet the conventional levels of statistical significance, as the observed p-value of .08 is greater than the standard cut-off of .05. As a note, R uses the Welch two-sample t-test by default, which is often used for samples with unequal variances. One consequence of this choice is that the degree of freedom is a fractional value, 17.776 in this case. For the more common, equal-variance t-test, just add the argument var.equal = TRUE to the function call (see ?t.test for more information).
R’s t.test() function also provides a number of options, such as one-tailed tests and different confidence levels, as show in the following code.
# TWO-SAMPLE T-TEST WITH OPTIONS t.test(extra ~ group, # Same: Specifies variables. data = sd, # Same: Specifies data set. alternative = "less", # One-tailed test. conf.level = 0.80) # 80% CI (vs. 95%) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0397 alternative hypothesis: true difference in means is less than 0 80 percent confidence interval: -Inf -0.8478191 sample estimates: mean in group 1 mean in group 2 0.75 2.33 |
As expected, when a one-tailed or directional hypothesis test is applied to the same data, the results are statistically significant with a p-value of .04 (exactly half of the p-value for the two-tailed test).
In the previous examples with the sleep data, the outcome variable was in one column and the grouping variable was in a second column. It is, however, also possible to have the data for each group in separate columns. In the following example, we will create simulated data using R’s rnorm() function, which draws data from a normal distribution. One advantage of using simulated data is that it is possible to specify the differences between the groups exactly. It is also important to note that the data in this example will be slightly different every time the code is run, so your results should not match these exactly.
In this case, where the data for the two groups are in difference variables, the function call is simpler: just give the names of the two variables as the arguments to t.test(), as shown in the following code.
# SIMULATED DATA IN TWO VARIABLES x <- rnorm(30, mean = 20, sd = 5) y <- rnorm(30, mean = 24, sd = 6) t.test(x, y) Welch’s Two Sample t-test data: x and y t = -2.6208, df = 56.601, p-value = 0.01124 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.8721567 -0.9186296 sample estimates: mean of x mean of y 20.18359 24.07898 |
The results in this case indicate a statistically significant difference with a p-value of .01. Finally, we should tidy up by removing the packages and objects used in this section.
# CLEAN UP detach("package:datasets", unload = TRUE) # Unloads data sets package. rm(list = ls()) # Remove all objects from workspace. |
The t-test is a flexible test with several variations. In the previous section we used the t-test to compare the means of two groups. In this section, we will use the t-test to examine how scores change over time for a single group of subjects. This is known as the paired t-test because each participant’s observation at time 2 is paired with their own observation at time 1. The paired t-test is also known as the matched-subjects t-test or the within-subjects t-test.
In this example, we will use simulated data. In the code that follows, a variable called t1, for time 1, is created using the rnorm() function to generate normally distributed data with a specified mean and standard deviation (see ?rnorm for more information). To simulate changes over time, another random variable, called dif for difference, is generated. These two variables are then summed to yield t2, the scores at time 2 for each subject.
Sample: sample_7_4.R
# CREATE SIMULATED DATA t1 <- rnorm(50, mean = 52, sd = 6) # Time 1 dif <- rnorm(50, mean = 6, sd = 12) # Difference t2 <- t1 + dif # Time 2 |
Once the data are generated, the t.test() function can be called. The arguments in this case are the variables with the data at the two times, as well as paired = TRUE, which indicates that the paired t-test should be used.
# PAIRED T-TEST WITH DEFAULTS t.test(t2, t1, paired = TRUE) # Must specify "paired" Paired t-test data: t2 and t1 t = 4.457, df = 49, p-value = 4.836e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.078233 10.775529 sample estimates: mean of the differences 7.426881 |
In this example, the change in scores over time was statistically significant. Note that your own results will be different because the data are generated at random. This effect can be seen in the following code, were the t-test is run again with several options: a non-zero null value is specified, a one-tailed test is called, and a different confidence level is used. The mean difference is smaller in this example (5.81 vs. 7.43) because the data were generated again before running this code.
# PAIRED T-TEST WITH OPTIONS t.test(t2, t1, paired = TRUE, mu = 6, # Specify a non-0 null value. alternative = "greater", # One-tailed test conf.level = 0.99) # 99% CI (vs. 95%) Paired t-test data: t2 and t1 t = -0.1027, df = 49, p-value = 0.5407 alternative hypothesis: true difference in means is greater than 6 99 percent confidence interval: 1.529402 Inf sample estimates: mean of the differences 5.816891 |
In this case the difference was not statistically significant, but that is due primarily to different null value used—6 vs. 0—and not to the variations in the datat.
We can finish by clearing the workspace.
# CLEAN UP rm(list = ls()) # Remove all objects from the workspace. |
The one-factor analysis of variance, or ANOVA, is used to compare the means of several different groups on a single, quantitative outcome variable. In this example we will again use simulated data from R’s rnorm() function. As before, this means that your data will be slightly different, but the overall pattern should be the same. After the four variables are created with one variable for each group, they are combined into a single data frame using the data.frame() and cbind() functions. After that, the data need to be reorganized. The four separate variables are then stacked into a single outcome variable called values with the stack() function, which also creates a column called ind to indicate their source variable; this variable functions as the grouping variable.
Sample: sample_7_5.R
# CREATE SIMULATION DATA # Step 1: Each group in a separate variable. x1 <- rnorm(30, mean = 40, sd = 8) # Group 1, mean = 40 x2 <- rnorm(30, mean = 41, sd = 8) # Group 1, mean = 41 x3 <- rnorm(30, mean = 44, sd = 8) # Group 1, mean = 44 x4 <- rnorm(30, mean = 45, sd = 8) # Group 1, mean = 45 # Step 2: Combine vectors into a single data frame. xdf <- data.frame(cbind(x1, x2, x3, x4)) # xdf = "x data frame" # Step 3: Stack data to get the outcome column and group column. xs <- stack(xdf) # xs = "x stack" |
At this point, the data are ready to be analyzed with R’s aov() function. The default specification is simple, with just two required arguments: the outcome variable, which is values in this case, and the grouping variable, which is ind in this case. The two variables are joined by the tilde operator, ~. (The argument data = xs simply indicates the data frame that holds the variables.)
# ONE-FACTOR ANOVA anova1 <- aov(values ~ ind, data = xs) # Basic model. summary(anova1) # Model summary. Df Sum Sq Mean Sq F value Pr(>F) ind 3 579 192.97 3.188 0.0264 * Residuals 116 7022 60.53 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 |
These results indicate that the omnibus analysis of variance is statistically significant, with a p-value of .03. However, this test does not indicate where the group differences lie. To find them, we need a post-hoc comparison. The default post-hoc procedure in R is Tukey’s Honestly Significant Difference (HSD) test or TukeyHSD().[19] Because we saved the ANOVA model into an object named anova1, we can just use that object as the argument in the code that follows:
# POST-HOC COMPARISONS TukeyHSD(anova1) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = values ~ ind, data = xs) $ind diff lwr upr p adj x2-x1 1.132450 -4.1039279 6.368829 0.9425906 x3-x1 5.502549 0.2661705 10.738927 0.0354399 x4-x1 4.004412 -1.2319658 9.240791 0.1964255 x3-x2 4.370098 -0.8662799 9.606477 0.1362299 x4-x2 2.871962 -2.3644162 8.108340 0.4835521 x4-x3 -1.498136 -6.7345146 3.738242 0.8782931 |
The TukeyHSD() function reports observed differences between paired groups, as well as the lower and upper bounds for confidence intervals and adjusted p-values. What’s interesting in this particular situation is that the only statistically significant difference is between groups 1 and 3. This is surprising because the commands that generated the random data actually called for a larger difference between groups 1 and 4. However, given the vicissitudes of random data generation, such quirks are to be expected. This is also an indication that sample size of 30 scores per group is probably not sufficiently large enough to give stable results.[20] In future research, it would be beneficial to at least double the sample size.
We can finish by cleaning the workspace of the variables and objects we generated in this exercise.
# CLEAN UP rm(list = ls()) # Remove all objects from workspace |
In the last five procedures that we examined—correlations, regression, two-sample t-tests, paired t-tests, and ANOVA—the outcome variable was quantitative (i.e., an interval or ratio level of measurement). This made it possible to calculate means and standard deviations, which were then used in the inferential tests. However, there are situations where the outcome variable is categorical (i.e., a nominal or possibly ordinal level of measurement.) The last two sections in this chapter address those situations. In the case of a dichotomous outcome—that is, an outcome with only two possible values, such as yes or no, on or off, and click or don’t click, then a proportion test can be an ideal way to compare the performance of two or more groups.
In this example, we will again use simulated data. R’s prop.test() function needs two pieces of information for each group: the number of trials or total observations, usually referred to as n, and the number of trials with positive outcome, usually referred to as X. In the code that follows, I create a vector called n5 of five 100s for the number of trials by using R’s rep() function (see ?rep for more information). A second vector called x5 is then created with the number of positive outcomes for each group.
Sample: sample_7_6.R
# CREATE SIMULATION DATA # Two vectors are needed: # One specifies the total number of people in each group. # This creates a vector with 5 100s in it, for 5 groups. n5 <- c(rep(100, 5)) # Another specifies the number of cases with positive outcomes. x5 <- c(65, 60, 60, 50, 45) |
The next step is to call R’s prop.test() function with two arguments: the variable with the X data and the variable with the n data.
# PROPORTION TEST prop.test(x5, n5) 5-sample test for equality of proportions without continuity correction data: x5 out of n5 X-squared = 10.9578, df = 4, p-value = 0.02704 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 prop 5 0.65 0.60 0.60 0.50 0.45 |
The output includes a chi-squared test—shown here as X-squared—and the p-value, which in this case is below the standard cutoff of .05. It also gives the observed sample proportions.
In the case where there are only two groups being compared, R will also give a confidence interval for the difference between the groups’ proportions. The default value is .95 but that can be changed with the conf.level argument.
# CREATE SIMULATION DATA FOR 2 GROUPS n2 <- c(40, 40) # 40 trials x2 <- c(30, 20) # Number of positive outcomes # PROPORTION TEST FOR 2 GROUPS prop.test(x2, n2, conf.level = .80) # With CI 2-sample test for equality of proportions with continuity correction data: x2 out of n2 X-squared = 4.32, df = 1, p-value = 0.03767 alternative hypothesis: two.sided 80 percent confidence interval: 0.09097213 0.40902787 sample estimates: prop 1 prop 2 0.75 0.50 |
In this case, despite the relatively small samples, the difference between the sample proportions is statistically significant, as shown by both the p-value and the confidence interval.
We can finish by clearing the workspace.
# CLEAN UP rm(list = ls()) # Remove all objects from the workspace. |
The final bivariate procedure that we will discuss is the chi-squared inferential test for cross-tabulated data. In this example we will use the Titanic data from R’s datasets package. This data contains the survival data from the wreck of the Titanic, broken down by sex, age (child or adult), and class of passenger (1st, 2nd, 3rd, or crew). See ?Titanic for more information. The data are in a tabular format that, when displayed, consists of four 4 x 2 tables, although it is also possible to display it as a “flat” contingency table with the ftable() function.
Sample: sample_7_7.R
# LOAD DATA require(“datasets”) # Load the datasets package. Titanic # Show complete data in tables. ftable(Titanic) # Display a "flat" contingency table. Survived No Yes Class Sex Age 1st Male Child 0 5 Adult 118 57 Female Child 0 1 Adult 4 140 2nd Male Child 0 11 Adult 154 14 Female Child 0 13 Adult 13 80 3rd Male Child 35 13 Adult 387 75 Female Child 17 14 Adult 89 76 Crew Male Child 0 0 Adult 670 192 Female Child 0 0 Adult 3 20 |
While the tabular format is a convenient way of storing and displaying the data, we need to restructure the data for our analysis so there is one row per observation. We can do this using a string of nested commands that we saw in sample_1_1.R:
# RESTRUCTURE DATA tdf <- as.data.frame(lapply(as.data.frame.table(Titanic), function(x)rep(x, as.data.frame.table(Titanic)$Freq)))[, -5] tdf[1:5, ] # Check the first five rows of data. Class Sex Age Survived 1 3rd Male Child No 2 3rd Male Child No 3 3rd Male Child No 4 3rd Male Child No 5 3rd Male Child No |
For this example, we will focus on just two of the four variables: Class and Survived. To do this, we will create a reduced data set by using the table() function and save it into a new object, ttab for Titanic Table.
# CREATE REDUCED TABLE ttab <- table(tdf$Class, tdf$Survived) # Select two variables. ttab # Show the new table. No Yes 1st 122 203 2nd 167 118 3rd 528 178 Crew 673 212 |
The raw frequencies are important, but it easier to understand the pattern by looking at the row percentages. We can get these by using the prop.table() function and requesting the first column. In the code the follows, I also wrap these commands in the round() function to reduce the output to two decimal places, then multiply by 100 to get percentages.
# PERCENTAGES rowp <- round(prop.table(ttab, 1), 2) * 100 # row % colp <- round(prop.table(ttab, 2), 2) * 100 # column % totp <- round(prop.table(ttab), 2) * 100 # total % rowp No Yes 1st 38 62 2nd 59 41 3rd 75 25 Crew 76 24 |
From these results it is clear that first-class passengers had a much higher survival rate at 62%, while third-class passengers and crew had a much lower rate at 25% and 24%, respectively. The chi-squared test for independence, or R’s chisq.test() function, can then test these results against a null hypothesis:
# CHI-SQUARED TEST FOR INDEPENDENCE tchi <- chisq.test(ttab) # Compare two variables in ttab tchi Pearson's Chi-squared test data: ttab X-squared = 190.4011, df = 3, p-value < 2.2e-16 |
Not surprisingly, these group differences are highly significant, with a p-value that is nearly zero. In addition to this basic output, R is able to give several other detailed tables from the saved chi-squared object:
Finally, we can unload the packages and clear the workspace before moving on.
# CLEAN UP detach("package:datasets", unload = TRUE) # Unloads the datasets package. rm(list = ls()) # Remove all objects from the workspace. |