R Succinctly^®
by Barton Poulson

CHAPTER 7

Statistics for Associations

In Chapter 6 we explored methods for visualizing the association between two variables. In this chapter, we will provide the complement to those analyses by discussing statistical procedures for describing bivariate associations. This chapter includes some of the most familiar inferential statistics: correlation, regression, t-tests for two samples and paired observations, the one-factor analysis of variance, tests to compare multiple proportions, and the chi-squared test for independence.

Correlations

Correlations are a mainstay of statistical analysis. R provides several approaches to correlations and the external packages provide an almost unlimited choice of procedures. In this section, we’ll start with cor(), the default correlation function in R. We’ll use the swiss data from R’s datasets package. This data contains fertility and socio-economic variables for 47 French-speaking provinces of Switzerland in 1888. (See ?swiss for more information.)

Sample: sample_7_1.R

# LOAD DATA

require(“datasets”) # Load the datasets package.

data(swiss)

Once the data are loaded into R, we can simply call the cor() function on the entire data frame. Also, to make the printout less busy, we can wrap the cor() function with the round() function to round the output to two decimal places.

# CORRELATION MATRIX

cor(swiss)

round(cor(swiss), 2) # Rounded to 2 decimals

The first few rows and columns of that output are:

Fertility Agriculture Examination Education Catholic

Fertility 1.00 0.35 -0.65 -0.66 0.46

Agriculture 0.35 1.00 -0.69 -0.64 0.40

Missing from this printout are probability values. R does not have a built-in function to get p-values—or even asterisks indicating statistical significance—for a correlation matrix. Instead, we can test each correlation separately with cor.test(). This function will give the correlation value, a hypothesis test, and a confidence interval for the correlation.

# INFERENTIAL TEST FOR SINGLE CORRELATION

cor.test(swiss$Fertility, swiss$Education)

Pearson's product-moment correlation

data: swiss$Fertility and swiss$Education

t = -5.9536, df = 45, p-value = 3.659e-07

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.7987075 -0.4653206

sample estimates:

cor

-0.6637889

This is useful information but it is cumbersome to test correlations one at a time. The external package Hmisc provides an alternative, making it possible to get two matrices at once: one matrix with correlation values, and a second matrix with two-tailed p-values. The first step is to install and load the Hmisc package.

# INSTALL AND LOAD "HMISC" PACKAGE

install.packages("Hmisc")

require("Hmisc")

When you load Hmisc, you will receive message that R is also loading several other packages—grid, lattice, survival, splines, and Formula—that Hmisc relies on. You may also receive a message that a few functions, such as format.pval, round.POSIXt, trunc.POSIXt, and units, were masked from package:base. This happens because Hmisc is temporarily overriding those functions, which is supposed to happen, so you can ignore those messages.

Before we can run the rcorr() function from Hmisc, however, we need to make a change to our data. The swiss data are a data frame but Hmisc operates on matrices. All that we need to do is wrap swiss with the as.matrix() function, which can then be nested within rcorr().

# COERCE DATA FRAME TO MATRIX

# GET R & P MATRICES

rcorr(as.matrix(swiss))

Here are the first few rows and columns of each resulting matrix. The first matrix consists of correlation coefficients and is identical to our rounded matrix from R’s cor() function.

Fertility Agriculture Examination Education Catholic

Fertility 1.00 0.35 -0.65 -0.66 0.46

Agriculture 0.35 1.00 -0.69 -0.64 0.40

The second matrix consists of two-tailed probability values, with blanks on the diagonal:

Fertility Agriculture Examination Education Catholic

Fertility 0.0149 0.0000 0.0000 0.0010

Agriculture 0.0149 0.0000 0.0000 0.0052

Using the standard criterion of p < .05, all of these correlations are statistically significant.

For more information on Hmisc, enter help(package = "Hmisc"). And, as before, when we are finished, we should unload the external packages and clear out any variables or objects we created.

# CLEAN UP

detach("package:datasets", unload=TRUE) # Unload the package.

detach("package:Hmisc", unload=TRUE) # Unload the package.

rm(list = ls()) # Remove the objects.

Bivariate regression

Bivariate regression—that is, linear regression between two variables—is a very common, flexible, and powerful procedure. R has excellent built-in functions for bivariate regression that can provide an enormous amount of information.

For this example, we will use the trees data from R’s datasets package. (See ?trees for more information.)

Sample: sample_7_2.R

# LOAD DATA

require(“datasets”) # Load the datasets package.

data(trees) # Load data into the workspace.

trees[1:5, ] # Show the first 5 lines.

Girth Height Volume

1 8.3 70 10.3

2 8.6 65 10.3

3 8.8 63 10.2

4 10.5 72 16.4

5 10.7 81 18.8

We will use bivariate regression to see how a tree’s girth can predict its height. But first, it is always a good idea to check the variables graphically to see how well they meet the assumptions of our intended procedure.

# GRAPHICAL CHECK

hist(trees$Height)

hist(trees$Girth)

plot(trees$Girth, trees$Height)

To save space, I won’t print those graphs here but I will mention that both variables are approximately normally distributed, with some positive skew on girth, and the scatter plot appears to be approximately linear. The variables appear well suited for regression analysis.

The command for linear regression is lm(). It requires only two arguments: the outcome variable and the predictor variable, with the tilde operator, ~, between them. For this example, it appears as lm(Height ~ Girth, data = trees). The third argument here, data = trees, is optional. It is there to specify the data set from which the variables are drawn. As a note, it is also possible to give the full address for each variable, like trees$Height and trees$Girth. The separate data statement is just a convenience.

In this example, I am saving the regression model into reg1. By saving the model into an object, I am then able to call several functions on the model without having to specify it again. In the code that follows, I first run lm() and then get information on the model with summary().

# BASIC REGRESSION MODEL

reg1 <- lm(Height ~ Girth, data = trees) # Save the model.

summary(reg1) # Get regression output.

Call:

lm(formula = Height ~ Girth, data = trees)

Residuals:

Min 1Q Median 3Q Max

-12.5816 -2.7686 0.3163 2.4728 9.9456

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 62.0313 4.3833 14.152 1.49e-14 ***

Girth 1.0544 0.3222 3.272 0.00276 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.538 on 29 degrees of freedom

Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445

F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758

The summary() function gives nearly everything that is needed for most analyses: the regression coefficients, their statistical significance, the R² and adjusted R², the F-value and p-value for the entire model, and information about residuals. (See ?lm and ?summary for more information.)

It is also possible to get confidence intervals for the regression coefficients with the confint() function:

# CONFIDENCE INTERVALS

confint(reg1)

2.5 % 97.5 %

(Intercept) 53.0664541 70.996174

Girth 0.3953483 1.713389

Other functions that can be useful when evaluating a regression model are:

predict(), which gives the predicted values on the outcome variable for different value on the predictor (see ?predict).
lm.influence(), which gives basic quantities used in forming a wide variety of diagnostics for checking the quality of regression fits (see ?lm.influence).
influence.measures(), which gives information on residuals, dispersion, hat values, and others (see ?influence.measures).

We can finish by unloading packages and cleaning the workspace:

# CLEAN UP

detach("package:datasets", unload = TRUE) # Unloads data sets package.

rm(list = ls()) # Remove all objects from workspace.

Two-sample t-test

If you wish to compare the means of two different groups, then the most common choice is the two-sample t-test. R provides several options for this test and several ways to organize the data for analysis. For this example we will use the sleep data from R’s datasets package. This data set, which was gathered by the creator of the t-test, William Gosset, who published under the pseudonym Student, shows the effect of two soporific drugs on 10 patients. The data set has three columns: extra, which is the increase in hours of sleep; group, which indicates the drug given; and ID, which is the patient ID. To make the analysis simpler, we will open the data, remove the ID variable, and save it as a new data frame called sd for “sleep data.”

Sample: sample_7_3.R

# LOAD DATA & SELECT VARIABLES

require(“datasets”) # Load the datasets package.

sleep[1:5, ] # Show the first 5 cases.

sd <- sleep[, 1:2] # Save just the first two variables.

sd[1:5, ] # Show the first 5 cases.

Once the data are loaded into the new data frame, we can create some basic graphs, a histogram and a boxplot, to check on the shape of the distribution and how well it meets the statistical assumptions of the t-test.

# GRAPHICAL CHECKS

hist(sd$extra) # Histogram of extra sleep.

boxplot(extra ~ group, data = sd) # Boxplots by group.

To save space I will not reprint the graphics here, but I will say that the histogram is approximately normal and the boxplots, which are separated by group, do not show any outliers. With this preliminary check done, we can move ahead to the default t-test, using R’s t.test() function.

# TWO-SAMPLE T-TEST WITH DEFAULTS

t.test(extra ~ group, data = sd)

Welch Two Sample t-test

data: extra by group

t = -1.8608, df = 17.776, p-value = 0.07939

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-3.3654832 0.2054832

sample estimates:

mean in group 1 mean in group 2

0.75 2.33

These results show that while there is a difference between the two group’s means—2.33 vs. 0.75—the difference does not meet the conventional levels of statistical significance, as the observed p-value of .08 is greater than the standard cut-off of .05. As a note, R uses the Welch two-sample t-test by default, which is often used for samples with unequal variances. One consequence of this choice is that the degree of freedom is a fractional value, 17.776 in this case. For the more common, equal-variance t-test, just add the argument var.equal = TRUE to the function call (see ?t.test for more information).

R’s t.test() function also provides a number of options, such as one-tailed tests and different confidence levels, as show in the following code.

# TWO-SAMPLE T-TEST WITH OPTIONS

t.test(extra ~ group, # Same: Specifies variables.

data = sd, # Same: Specifies data set.

alternative = "less", # One-tailed test.

conf.level = 0.80) # 80% CI (vs. 95%)

Welch Two Sample t-test

data: extra by group

t = -1.8608, df = 17.776, p-value = 0.0397

alternative hypothesis: true difference in means is less than 0

80 percent confidence interval:

-Inf -0.8478191

sample estimates:

mean in group 1 mean in group 2

0.75 2.33

As expected, when a one-tailed or directional hypothesis test is applied to the same data, the results are statistically significant with a p-value of .04 (exactly half of the p-value for the two-tailed test).

In the previous examples with the sleep data, the outcome variable was in one column and the grouping variable was in a second column. It is, however, also possible to have the data for each group in separate columns. In the following example, we will create simulated data using R’s rnorm() function, which draws data from a normal distribution. One advantage of using simulated data is that it is possible to specify the differences between the groups exactly. It is also important to note that the data in this example will be slightly different every time the code is run, so your results should not match these exactly.

In this case, where the data for the two groups are in difference variables, the function call is simpler: just give the names of the two variables as the arguments to t.test(), as shown in the following code.

# SIMULATED DATA IN TWO VARIABLES

x <- rnorm(30, mean = 20, sd = 5)

y <- rnorm(30, mean = 24, sd = 6)

t.test(x, y)

Welch’s Two Sample t-test

data: x and y

t = -2.6208, df = 56.601, p-value = 0.01124

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-6.8721567 -0.9186296

sample estimates:

mean of x mean of y

20.18359 24.07898

The results in this case indicate a statistically significant difference with a p-value of .01. Finally, we should tidy up by removing the packages and objects used in this section.

# CLEAN UP

detach("package:datasets", unload = TRUE) # Unloads data sets package.

rm(list = ls()) # Remove all objects from workspace.

Paired t-test

The t-test is a flexible test with several variations. In the previous section we used the t-test to compare the means of two groups. In this section, we will use the t-test to examine how scores change over time for a single group of subjects. This is known as the paired t-test because each participant’s observation at time 2 is paired with their own observation at time 1. The paired t-test is also known as the matched-subjects t-test or the within-subjects t-test.

In this example, we will use simulated data. In the code that follows, a variable called t1, for time 1, is created using the rnorm() function to generate normally distributed data with a specified mean and standard deviation (see ?rnorm for more information). To simulate changes over time, another random variable, called dif for difference, is generated. These two variables are then summed to yield t2, the scores at time 2 for each subject.

Sample: sample_7_4.R

# CREATE SIMULATED DATA

t1 <- rnorm(50, mean = 52, sd = 6) # Time 1

dif <- rnorm(50, mean = 6, sd = 12) # Difference

t2 <- t1 + dif # Time 2

Once the data are generated, the t.test() function can be called. The arguments in this case are the variables with the data at the two times, as well as paired = TRUE, which indicates that the paired t-test should be used.

# PAIRED T-TEST WITH DEFAULTS

t.test(t2, t1, paired = TRUE) # Must specify "paired"

Paired t-test

data: t2 and t1

t = 4.457, df = 49, p-value = 4.836e-05

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

4.078233 10.775529

sample estimates:

mean of the differences

7.426881

In this example, the change in scores over time was statistically significant. Note that your own results will be different because the data are generated at random. This effect can be seen in the following code, were the t-test is run again with several options: a non-zero null value is specified, a one-tailed test is called, and a different confidence level is used. The mean difference is smaller in this example (5.81 vs. 7.43) because the data were generated again before running this code.

# PAIRED T-TEST WITH OPTIONS

t.test(t2, t1,

paired = TRUE,

mu = 6, # Specify a non-0 null value.

alternative = "greater", # One-tailed test

conf.level = 0.99) # 99% CI (vs. 95%)

Paired t-test

data: t2 and t1

t = -0.1027, df = 49, p-value = 0.5407

alternative hypothesis: true difference in means is greater than 6

99 percent confidence interval:

1.529402 Inf

sample estimates:

mean of the differences

5.816891

In this case the difference was not statistically significant, but that is due primarily to different null value used—6 vs. 0—and not to the variations in the datat.

We can finish by clearing the workspace.

# CLEAN UP

rm(list = ls()) # Remove all objects from the workspace.

One-factor ANOVA

The one-factor analysis of variance, or ANOVA, is used to compare the means of several different groups on a single, quantitative outcome variable. In this example we will again use simulated data from R’s rnorm() function. As before, this means that your data will be slightly different, but the overall pattern should be the same. After the four variables are created with one variable for each group, they are combined into a single data frame using the data.frame() and cbind() functions. After that, the data need to be reorganized. The four separate variables are then stacked into a single outcome variable called values with the stack() function, which also creates a column called ind to indicate their source variable; this variable functions as the grouping variable.

Sample: sample_7_5.R

# CREATE SIMULATION DATA

# Step 1: Each group in a separate variable.

x1 <- rnorm(30, mean = 40, sd = 8) # Group 1, mean = 40

x2 <- rnorm(30, mean = 41, sd = 8) # Group 1, mean = 41

x3 <- rnorm(30, mean = 44, sd = 8) # Group 1, mean = 44

x4 <- rnorm(30, mean = 45, sd = 8) # Group 1, mean = 45

# Step 2: Combine vectors into a single data frame.

xdf <- data.frame(cbind(x1, x2, x3, x4)) # xdf = "x data frame"

# Step 3: Stack data to get the outcome column and group column.

xs <- stack(xdf) # xs = "x stack"

At this point, the data are ready to be analyzed with R’s aov() function. The default specification is simple, with just two required arguments: the outcome variable, which is values in this case, and the grouping variable, which is ind in this case. The two variables are joined by the tilde operator, ~. (The argument data = xs simply indicates the data frame that holds the variables.)

# ONE-FACTOR ANOVA

anova1 <- aov(values ~ ind, data = xs) # Basic model.

summary(anova1) # Model summary.

Df Sum Sq Mean Sq F value Pr(>F)

ind 3 579 192.97 3.188 0.0264 *

Residuals 116 7022 60.53

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These results indicate that the omnibus analysis of variance is statistically significant, with a p-value of .03. However, this test does not indicate where the group differences lie. To find them, we need a post-hoc comparison. The default post-hoc procedure in R is Tukey’s Honestly Significant Difference (HSD) test or TukeyHSD().[19] Because we saved the ANOVA model into an object named anova1, we can just use that object as the argument in the code that follows:

# POST-HOC COMPARISONS

TukeyHSD(anova1)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = values ~ ind, data = xs)

$ind

diff lwr upr p adj

x2-x1 1.132450 -4.1039279 6.368829 0.9425906

x3-x1 5.502549 0.2661705 10.738927 0.0354399

x4-x1 4.004412 -1.2319658 9.240791 0.1964255

x3-x2 4.370098 -0.8662799 9.606477 0.1362299

x4-x2 2.871962 -2.3644162 8.108340 0.4835521

x4-x3 -1.498136 -6.7345146 3.738242 0.8782931

The TukeyHSD() function reports observed differences between paired groups, as well as the lower and upper bounds for confidence intervals and adjusted p-values. What’s interesting in this particular situation is that the only statistically significant difference is between groups 1 and 3. This is surprising because the commands that generated the random data actually called for a larger difference between groups 1 and 4. However, given the vicissitudes of random data generation, such quirks are to be expected. This is also an indication that sample size of 30 scores per group is probably not sufficiently large enough to give stable results.[20] In future research, it would be beneficial to at least double the sample size.

We can finish by cleaning the workspace of the variables and objects we generated in this exercise.

# CLEAN UP

rm(list = ls()) # Remove all objects from workspace

Comparing proportions

In the last five procedures that we examined—correlations, regression, two-sample t-tests, paired t-tests, and ANOVA—the outcome variable was quantitative (i.e., an interval or ratio level of measurement). This made it possible to calculate means and standard deviations, which were then used in the inferential tests. However, there are situations where the outcome variable is categorical (i.e., a nominal or possibly ordinal level of measurement.) The last two sections in this chapter address those situations. In the case of a dichotomous outcome—that is, an outcome with only two possible values, such as yes or no, on or off, and click or don’t click, then a proportion test can be an ideal way to compare the performance of two or more groups.

In this example, we will again use simulated data. R’s prop.test() function needs two pieces of information for each group: the number of trials or total observations, usually referred to as n, and the number of trials with positive outcome, usually referred to as X. In the code that follows, I create a vector called n5 of five 100s for the number of trials by using R’s rep() function (see ?rep for more information). A second vector called x5 is then created with the number of positive outcomes for each group.

Sample: sample_7_6.R

# CREATE SIMULATION DATA

# Two vectors are needed:

# One specifies the total number of people in each group.

# This creates a vector with 5 100s in it, for 5 groups.

n5 <- c(rep(100, 5))

# Another specifies the number of cases with positive outcomes.

x5 <- c(65, 60, 60, 50, 45)

The next step is to call R’s prop.test() function with two arguments: the variable with the X data and the variable with the n data.

# PROPORTION TEST

prop.test(x5, n5)

5-sample test for equality of proportions without continuity

correction

data: x5 out of n5

X-squared = 10.9578, df = 4, p-value = 0.02704

alternative hypothesis: two.sided

sample estimates:

prop 1 prop 2 prop 3 prop 4 prop 5

0.65 0.60 0.60 0.50 0.45

The output includes a chi-squared test—shown here as X-squared—and the p-value, which in this case is below the standard cutoff of .05. It also gives the observed sample proportions.

In the case where there are only two groups being compared, R will also give a confidence interval for the difference between the groups’ proportions. The default value is .95 but that can be changed with the conf.level argument.

# CREATE SIMULATION DATA FOR 2 GROUPS

n2 <- c(40, 40) # 40 trials

x2 <- c(30, 20) # Number of positive outcomes

# PROPORTION TEST FOR 2 GROUPS

prop.test(x2, n2, conf.level = .80) # With CI

2-sample test for equality of proportions with continuity

correction

data: x2 out of n2

X-squared = 4.32, df = 1, p-value = 0.03767

alternative hypothesis: two.sided

80 percent confidence interval:

0.09097213 0.40902787

sample estimates:

prop 1 prop 2

0.75 0.50

In this case, despite the relatively small samples, the difference between the sample proportions is statistically significant, as shown by both the p-value and the confidence interval.

We can finish by clearing the workspace.

# CLEAN UP

rm(list = ls()) # Remove all objects from the workspace.

Cross-tabulations

The final bivariate procedure that we will discuss is the chi-squared inferential test for cross-tabulated data. In this example we will use the Titanic data from R’s datasets package. This data contains the survival data from the wreck of the Titanic, broken down by sex, age (child or adult), and class of passenger (1^st, 2^nd, 3^rd, or crew). See ?Titanic for more information. The data are in a tabular format that, when displayed, consists of four 4 x 2 tables, although it is also possible to display it as a “flat” contingency table with the ftable() function.

Sample: sample_7_7.R

# LOAD DATA

require(“datasets”) # Load the datasets package.

Titanic # Show complete data in tables.

ftable(Titanic) # Display a "flat" contingency table.

Survived No Yes

Class Sex Age

1st Male Child 0 5

Adult 118 57

Female Child 0 1

Adult 4 140

2nd Male Child 0 11

Adult 154 14

Female Child 0 13

Adult 13 80

3rd Male Child 35 13

Adult 387 75

Female Child 17 14

Adult 89 76

Crew Male Child 0 0

Adult 670 192

Female Child 0 0

Adult 3 20

While the tabular format is a convenient way of storing and displaying the data, we need to restructure the data for our analysis so there is one row per observation. We can do this using a string of nested commands that we saw in sample_1_1.R:

# RESTRUCTURE DATA

tdf <- as.data.frame(lapply(as.data.frame.table(Titanic),

function(x)rep(x, as.data.frame.table(Titanic)$Freq)))[, -5]

tdf[1:5, ] # Check the first five rows of data.

Class Sex Age Survived

1 3rd Male Child No

2 3rd Male Child No

3 3rd Male Child No

4 3rd Male Child No

5 3rd Male Child No

For this example, we will focus on just two of the four variables: Class and Survived. To do this, we will create a reduced data set by using the table() function and save it into a new object, ttab for Titanic Table.

# CREATE REDUCED TABLE

ttab <- table(tdf$Class, tdf$Survived) # Select two variables.

ttab # Show the new table.

No Yes

1st 122 203

2nd 167 118

3rd 528 178

Crew 673 212

The raw frequencies are important, but it easier to understand the pattern by looking at the row percentages. We can get these by using the prop.table() function and requesting the first column. In the code the follows, I also wrap these commands in the round() function to reduce the output to two decimal places, then multiply by 100 to get percentages.

# PERCENTAGES

rowp <- round(prop.table(ttab, 1), 2) * 100 # row %

colp <- round(prop.table(ttab, 2), 2) * 100 # column %

totp <- round(prop.table(ttab), 2) * 100 # total %

rowp

No Yes

1st 38 62

2nd 59 41

3rd 75 25

Crew 76 24

From these results it is clear that first-class passengers had a much higher survival rate at 62%, while third-class passengers and crew had a much lower rate at 25% and 24%, respectively. The chi-squared test for independence, or R’s chisq.test() function, can then test these results against a null hypothesis:

# CHI-SQUARED TEST FOR INDEPENDENCE

tchi <- chisq.test(ttab) # Compare two variables in ttab

tchi

Pearson's Chi-squared test

data: ttab

X-squared = 190.4011, df = 3, p-value < 2.2e-16

Not surprisingly, these group differences are highly significant, with a p-value that is nearly zero. In addition to this basic output, R is able to give several other detailed tables from the saved chi-squared object:

tchi$observed gives the observed frequencies, which is the same as data in ttab.
tchi$expected gives the expected frequencies from the null distribution.
tchi$residuals give the Pearson's residuals.
tchi$stdres gives the standardized residuals based on the residual cell variance.

Finally, we can unload the packages and clear the workspace before moving on.

# CLEAN UP

detach("package:datasets", unload = TRUE) # Unloads the datasets package.

rm(list = ls()) # Remove all objects from the workspace.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Statistics for Associations

Correlations

Bivariate regression

Two-sample t-test

Paired t-test

One-factor ANOVA

Comparing proportions

Cross-tabulations

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.