CHAPTER 2
Arguably the five most fundamental objects in the R language are vectors, lists, arrays, matrices, and data frames. We should note, however, that R uses terminology somewhat differently than most other programming languages. An R vector corresponds to what is called an array in other languages, and R arrays and matrices correspond to an array-of-arrays style data structure in other languages. Being able to work with the five basic objects and write functions that operate on them are key skills.
Figure 19’s screenshot gives you an idea of where this chapter is headed.

Figure 19: Vectors and Functions Demo
In section 2.1, we’ll look at the five basic collection objects in R (vectors, lists, arrays, matrices, and data frames) and you’ll learn how to write program-defined functions that work with these collection objects. In section 2.2, we'll see how to perform a sequential search on a vector using built-in functions such as which.max() and is.element() and also by using program-defined functions.
In section 2.3, you'll learn how to perform a binary search using a program-defined function. In section 2.4, we’ll examine how to sort vectors using the built-in sort() function and how to write a program-defined sorting function. Finally, in section 2.5, you'll learn how to shuffle the values in a vector and how to select samples of the values in a vector.
An R vector has a fixed number of cells, and each cell holds a value of the same data type (integer, character, etc.). A list can hold values with different types, and the length of a list can change during runtime. A matrix has a fixed number of cells in exactly two dimensions (rows and columns), and each cell holds the same data type.
An array has a fixed number of cells in one, two, three, or more dimensions, and each cell holds the same data type. A data.frame has columns in which the number of values and the data types can be different.
Code Listing 2: The Five Basic Data Structures in R
# vectorsVsArrays.R # R 3.2.4 cat(“\nBegin vectors vs. arrays demo \n\n”) cat(“Creating three demo vectors \n\n”) v <- c(1:3) # [1 2 3] cat(v, “\n\n”) v <- vector(mode=“numeric”, 4) # [0.0 0.0 0.0 0.0] cat(v, “\n\n”) v <- c(“a”, “b”, “x”) cat(v, “\n\n”) cat(“Creating two demo lists \n\n”) ls <- list(“a”, 2.2) ls[3] <- as.integer(3) print(ls) cat(“\n”) cat(“Cell [2] is: “, ls[[2]], “\n\n”) ls <- list(lname=“Smith”, age=22) cat(“Cells by cell names are: “, ls$lname, “-”, ls$age) cat(“\n\n”) cat(“Creating a 2x3 matrix \n\n”) m <- matrix(0.0, nrow=2, ncol=3) print(m) cat(“\n”) cat(“Creating 1 and 2-dim arrays \n\n”) arr <- array(0.0, 3) # [0.0 0.0 0.0] print(arr) cat(“\n”) arr <- array(0.0, c(2,3)) # 2x3 matrix print(arr) cat(“\n”) # arr = array(0.0, c(2,5,4)) # 2x5x4 n-array # print(arr) # 40 values displayed # cat(“\n”) cat(“Creating a data frame \n\n”) people <- c(“Alex”, “Barb”, “Carl”) ages <- c(19, 29, 39) df <- data.frame(people, ages) names(df) <- c(“NAME”, “AGE”) print(df) cat(“\nEnd vectors vs. arrays demo \n\n”) |
> setwd(“C:\\Succinctly\\Ch2”) > rm(list=ls()) > source(“vectorsVsArrays.R”) Begin vectors vs. arrays demo Creating three demo vectors 1 2 3 0 0 0 0 a b x Creating two demo lists [[1]] [1] “a” [[2]] [1] 2.2 [[3]] [1] 3 Cell [2] is: 2.2 Cells by cell names are: Smith - 22 Creating a 2x3 matrix [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 Creating 1 and 2-dim arrays [1] 0 0 0 [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 Creating a data frame NAME AGE 1 Alex 19 2 Barb 29 3 Carl 39 End vectors vs. arrays demo |
The demo program is launched by issuing a setwd() command to point to the source code directory, calling the rm() command to remove all existing objects in the current context and using the source() command to begin program execution. Next, the demo creates and displays three vectors:
v <- c(1:3) # [1 2 3]
cat(v, “\n\n”)
v <- vector(mode=“numeric”, 4) # [0.0 0.0 0.0 0.0]
cat(v, “\n\n”)
v <- c(“a”, “b”, “x”)
cat(v, “\n\n”
The c() function’s name is short for “combine,” and this function can be used to create a vector when you know the initial cell values.
The cell data type (integer, numeric / double, character, logical, complex, raw) of a vector is inferred by the values passed to the c() function. If you want to explicitly specify the cell type, you can use the vector() function with a mode parameter value as shown.
Vector cell values are accessed using 1-based indexing—for example, v[1] <- 3.14 or x <- v[3]. Next, the demo program shows how to create, manipulate, and access a simple list:
ls <- list(“a”, 2.2)
ls[3] <- as.integer(3)
print(ls)
cat(“Cell [2] is: “, ls[[2]], “\n\n”)
The first statement creates a list with two values, the character a and the numeric 2.2. The second statement dynamically extends the length of the list by adding a new integer value of 3 at cell index [3]. As with almost all R composite objects, lists use 1-based indexing rather than 0-based indexing.
Notice that in order to access the value in cell [2] of the list, the demo uses double square brackets, ls[[2]], rather than single square brackets (which would return a list object with one value). However, in order to store a value in cell [3] of the list, the demo uses single square brackets: ls[3] <- as.integer(3). Two rules of thumb that are usually, but not always, the correct methods—first, use double square brackets on the right side of an assignment statement or when standing alone; and second, use single square brackets on the left side of an assignment statement or when accessing an item in the list.
Next, the demo creates a list in which cell values can be accessed by name as well as by index:
ls <- list(lname=“Smith”, age=22)
cat(“Cells by cell names are: “, ls$lname, “-”, ls$age)
The list named ls represents a person by storing the last name and age. Notice that the cell values are accessed using the $ token. As another general rule of thumb, for clarity’s sake it’s preferable to access by cell name rather than by index when using a list to store related values.
Next, the demo creates and displays a 2x3 numeric matrix:
m <- matrix(0.0, nrow=2, ncol=3)
print(m)
The matrix function has several optional parameters, but the demo example uses the most common pattern. Next, the demo program illustrates the R array type:
arr <- array(0.0, 3)
print(arr)
arr <- array(0.0, c(2,3))
print(arr)
The first example creates a one-dimensional array that has three cells, each of which is initialized to 0.0. The second example creates a two-dimensional array that has two rows and three columns, with each of the six cells initialized to 0.0. Notice that a one-dimensional array can be used in situations in which a vector can be used, and a two-dimensional array can be used in situations in which a matrix can be used.
An array can have three or more dimensions. For example, arr <- array(0.0, c(2,5,4)) creates an array with three dimensions—the first with two cells, the second with five cells, and the third with four cells, for a total of 40 cells.
Next, the demo program creates a data.frame object:
cat(“Creating a data frame \n\n”)
people <- c(“Alex”, “Barb”, “Carl”)
ages <- c(19, 29, 39)
df <- data.frame(people, ages)
names(df) <- c(“NAME”, “AGE”)
print(df)
Although R data frames are somewhat similar to table objects in other languages, data frames are created manually by column rather than by row. The first two statements create a column vector with the names of three people and a second column vector with their ages. The data.frame() function has several optional parameters that give great flexibility for creating a data-frame object.
After the data frame is created, the demo supplies header names of “NAME” and AGE” for the two columns. Header names are optional, but they are useful because many built-in R functions use them.
In many situations, rather than programmatically constructing a data frame from column vectors, you’ll want to create a data frame from the data stored in a text file using the read.table() function or the closely related read.csv() or read.delim() functions.
In summary, an R vector corresponds to what is called an array in most other programming languages. Vectors are accessed using 1-based indexing. An R matrix has exactly two dimensions. An R array can have one, two, or more dimensions and therefore can function like a vector or a matrix. An R list can hold values of different types and can change size dynamically. When accessing a list value using indexing, you should typically use double square brackets rather than single square brackets. An R data.frame is a table-like object, which often has columns that contain different data types.
For additional details about creating and using vectors, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/vector.html.
For additional details about creating and using lists, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html.
For additional details about creating and using matrices, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/matrix.html.
For additional details about creating and using arrays, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/array.html.
For additional details about creating and using data frames, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html.
The R language has several built-in ways to search a vector for a target value, including the match(), is.element(), and which.max() functions along with the %in% operator. In some situations with numeric vectors, you’ll want to write a program-defined, sequential search function.
Code Listing 3: Sequential Search
# seqsearching.R # R 3.2.4 seq_search = function(v, t, eps) { # search vector v for target value t # eps is epsilon tolerance for equality n <- length(v) for (i in 1:n) { if (abs(v[i] - t) <= eps) { return(i) } } return(0) } my_print = function(v, dec) { n <- length(v) for (i in 1:n) { x <- v[i] xx <- formatC(x, digits=dec, format=“f”) cat(xx, “ “) } cat(“\n”) } cat(“\nBegin sequential search demo \n\n”) vec <- c(1.0, 5.0, 2.0, 3.0, 4.0) target <- 2.0 epsilon <- 1.0e-5 cat(“Vector is: “) my_print(vec, dec=2) cat(“\n”) cat(“Target is “) cat(formatC(target, digits=1, format=“f”), “\n”) cat(“Epsilon is “, epsilon, “\n”) cat(“Search using program-defined seq_search() \n”) idx <- seq_search(vec, target, epsilon) cat(“idx = “, idx, “\n\n”) cat(“Search using base %in% operator \n”) there <- target %in% vec cat(“there = “, there, “\n\n”) cat(“Search using base which.max() \n”) idx <- which.max(vec) cat(“idx of largest = “, idx, “\n\n”) target <- c(5.0, 3.0) cat(“Target is “, target, “\n”) cat(“Search using base match() \n”) matches <- match(vec, target) cat(“matches = “, matches, “\n\n”) cat(“Search using base is.element() \n”) matches <- is.element(vec, target) cat(“matches = “, matches, “\n\n”) cat(“End demo\n”) |
> source(“seqsearching.R”) Begin sequential search demo Vector is: 1.0 5.0 2.0 3.0 4.0 Target is 2.0 Epsilon is 1e-05 Search using program-defined seq_search() idx = 3 Search using base %in% operator there = TRUE Search using base which.max() idx of largest = 2 Target is 5 3 Search using base match() matches = NA 1 NA 2 NA Search using base is.element() matches = FALSE TRUE FALSE TRUE FALSE End demo |
The simplest way to search a vector for a value is to use the %in% operator. In the demo program, the key code is:
vec <- c(1.0, 5.0, 2.0, 3.0, 4.0)
target <- 2.0
cat(“Search using base %in% operator \n”)
there <- target %in% vec
cat(“there = “, there, “\n\n”)
The return result stored into the variable there is TRUE because the target value 2.0 is in the source vector at index [3]. Using the %in% operator is simple and effective in most situations. However, none of the R built-in search mechanisms give you control over exactly what equality means when you are comparing two floating-point values.
Because floating-point values are only approximations, comparing two floating-point values for exact equality can be very tricky. For example:
> x <- 0.15 + 0.15 # 0.30
> y <- 0.20 + 0.10 # 0.30
> if (x == y) { cat(“equal”) } else { cat(“NOT equal”) }
> “NOT equal” # what the heck?
Instead of comparing two floating-point values for exact equality, it’s usually preferable to use the built-in abs() (absolute value) function to check whether or not the two values are very close. The demo program defines a custom search function that uses this approach:
seq_search = function(v, t, eps) {
n <- length(v)
for (i in 1:n) {
if (abs(v[i] - t) <= eps) {
return(i)
}
}
return(0)
}
The function walks through each cell of the source vector v using a for-loop. If the current value being checked is close to within a small value eps (epsilon), the function returns the index of the target value. If none of the cell values is close enough to the target value, the function returns 0, which is a nonvalid vector index.
Here is the key calling code:
vec <- c(1.0, 5.0, 2.0, 3.0, 4.0)
target <- 2.0
epsilon <- 1.0e-5
cat(“Target is “)
cat(formatC(target, digits=1, format=“f”), “\n”)
cat(“Epsilon is “, epsilon, “\n”)
cat(“Search using program-defined seq_search() \n”)
idx <- seq_search(vec, target, epsilon)
cat(“idx = “, idx, “\n\n”)
Different problem domains tend to use different values of epsilon, but the value of 0.00001 used in the demo is quite common in machine learning. There are several ways to display floating-point values with nice formatting. My preference is to use the built-in formatC() function.
The built-in which.max() function returns the index that contains the largest value in a vector. For example:
vec <- c(1.0, 5.0, 2.0, 3.0, 4.0)
idx <- which.max(vec)
These statements return a value of 2 into variable idx because the largest value in vec is 5.0, which is located at cell [2]. There is a similar built-in which.min() function, too.
The built-in match() function is useful for searching a vector for multiple values. For example:
vec <- c(1.0, 5.0, 2.0, 3.0, 4.0)
target <- c(5.0, 3.0)
matches <- match(vec, target)
Here the return value stored into variable matches is the list (NA, 1, NA, 2, NA) because the first value in vec doesn’t match any value in target, the second value in vec matches the value at [1] in target, and so on.
The built-in is.element() function is very similar to the match() function, except that the return list contains Boolean values rather than integer indices. For example:
vec <- c(1.0, 5.0, 2.0, 3.0, 4.0)
target <- c(5.0, 3.0)
matches <- is.element(vec, target)
These statements return a list with values (FALSE, TRUE, FALSE, TRUE, FALSE) because the cells in vec at [2] and [4] have values that match a value in target.
In summary, if you want to search a vector that contains floating-point values and you want to control the epsilon tolerance for equality, you can easily write a program-defined function. When searching a vector that contains values that are not floating-point types, using the %in% operator is simple and effective. In order to find the location of the largest value in a vector, you can use the which.max() function. When searching a vector for multiple values, the built-in match() and is.element() functions are good choices.
For additional details about the match function and the %in% operator, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/match.html.
For more information about the which.max() and which.min() functions, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/which.min.html.
When searching a sorted vector or list that has many cells, a binary search algorithm is much faster than a sequential search. The base R language doesn’t have a binary search algorithm. Instead, you can use the binsearch() function from the gtools package or write your own binary search function.
Code Listing 4: Binary Search
# binsearching.R # R 3.2.4 bin_search = function(v, t, eps) { # search sorted vector v for target value t # eps is epsilon tolerance for equality lo <- 1 hi <- length(v) while (lo <= hi) { mid <- as.integer(round((lo + hi) / 2)) # always even! cat(“lo, mid, hi = “, lo, mid, hi, “\n”)
if (abs(v[mid] - t) <= eps) { return(mid) } else if (v[mid] < t) { # C format OK in a function lo <- mid + 1 } else { hi <- mid - 1 } } return(0) } cat(“\nBegin binary search demo \n\n”) vec <- c(1.5, 3.5, 5.5, 7.5, 9.5, 11.5, 13.5, 15.5, 17.5, 19.5) target <- 17.5 epsilon <- 1.0e-5 cat(“Vector is: \n”) print(vec) cat(“Target is “, target, “\n”) cat(“Epsilon is “, epsilon, “\n\n”) cat(“Begin search \n\n”) idx <- bin_search(vec, target, epsilon) if (idx == 0) { cat(“\nTarget not found \n\n”) } else { cat(“\nTarget found at cell index “, idx, “\n\n”) } cat(“End demo \n”) |
> source(“binsearching.R”) Begin binary search demo Vector is: [1] 1.5 3.5 5.5 7.5 9.5 11.5 13.5 15.5 17.5 19.5 Target is 17.5 Epsilon is 1e-05 Begin search lo, mid, hi = 1 6 10 lo, mid, hi = 7 8 10 lo, mid, hi = 9 10 10 lo, mid, hi = 9 9 9 Target found at cell index 9 End demo |
The demo program begins by setting up a vector with 10 floating-point values, a target value to search for, and an epsilon tolerance that determines how close two floating-point values must be in order to be evaluated as equal:
vec <- c(1.5, 3.5, 5.5, 7.5, 9.5, 11.5, 13.5, 15.5, 17.5, 19.5)
target <- 17.5
epsilon <- 1.0e-5
cat(“Vector is: \n”)
print(vec)
cat(“Target is “, target, “\n”)
cat(“Epsilon is “, epsilon, “\n\n”)
The program-defined binary search function bin_search() is called this way:
idx <- bin_search(vec, target, epsilon)
if (idx == 0) {
cat(“\nTarget not found \n\n”)
} else {
cat(“\nTarget found at cell index “, idx, “\n\n”)
}
The return value is either 0 if the target value is not found, or it is the index of the first occurrence of the target value if the target is in the vector.
The structure of the bin_search() function looks like this:
bin_search = function(v, t, eps)
lo <- 1
hi <- length(v)
while (lo <= hi) {
# compute a midpoint
# if target at midpoint return index
# otherwise search left or sight part of v
}
return(0)
}
Although simple in principle, the binary search algorithm is quite subtle and has many possible implementation variations. In order to compute a midpoint between indices lo and hi, the demo function uses this statement:
mid <- as.integer(round((lo + hi) / 2))
Compared to many other programming languages, the R round() function is unusual because it uses IEEE 754-2008 “round ties to even.” Here is the heart of the binary search implementation:
if (abs(v[mid] - t) <= eps) { # found it!
return(mid)
}
else if (v[mid] < t) { # search left side
lo <- mid + 1
}
else { # search right side
hi <- mid - 1
}
The Wikipedia entry on the binary search algorithm addresses several alternative variations of the core search-left, search-right implementation.
In summary, the base R language doesn’t have a binary search function. The gtools package has a binsearch() function, but it doesn’t allow you to control the epsilon tolerance for floating-point equality, so in some situations you’ll want to write a custom binary search function.
For information about the gtools binsearch() function, see:
http://svitsrv25.epfl.ch/R-doc/library/gtools/html/binsearch.html.
For details about the surprisingly tricky round(), ceiling(), and related functions, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Round.html.
The base R language has a sort() function that can be used to arrange the values in a vector. The sort() function can use three different sorting algorithms: shell sort, quicksort, and radix sort. You can also write a custom sorting function for special situations.
Code Listing 5: Sorting
# sorting.R # R 3.2.4 bubb_sort = function(v) { # ----- exchange = function(ii, jj) { tmp <<- v[ii] v[ii] <<- v[jj] v[jj] <<- tmp } # ----- n <- length(v) for (i in 1:(n-1)) { for (j in (i+1):n) { if (v[i] > v[j]) { exchange(i, j) # tmp = v[i]; v[i] = v[j]; v[j] = tmp } } } return(v) } my_print = function(v, h, t) { n <- length(v) cat(“[1] “) cat(head(v, h), “\n”) cat(“ . . . “, “\n”) idx <- n-t+1 cat(“[“, idx, “] “, sep=““) cat(tail(v, t), “\n\n”) } cat(“\nBegin sorting demo \n\n”) cat(“Generating 5,000 random values \n”) set.seed(0) vec <- rnorm(5000) cat(“Vector to sort is : \n”) my_print(vec, 4, 4) cat(“\n”) cat(“Sorting vector using built-in shell sort() \n”) start_t <- proc.time() sorted <- sort(vec, method=“shell”) end_t <- proc.time() times <- end_t - start_t cat(“Sorted vector is \n”) my_print(sorted, 4, 4) e_time <- formatC(times[3], digits=2, format=“f”) cat(“Elapsed time =“, e_time, “sec. \n”) cat(“\n”) cat(“Sorting with user-defined bubb_sort() \n”) start_t <- proc.time() sorted <- bubb_sort(vec) end_t <- proc.time() times <- end_t - start_t cat(“Sorted vector is \n”) my_print(sorted, 4, 4) cat(“Elapsed time =“, times[3], “sec.\n”) cat(“\n”) cat(“Verifying result is sorted using is.sorted() \n”) unsorted <- is.unsorted(sorted) if (unsorted == T) { cat(“Error. Result not sorted \n\n”) } else { cat(“Result verified sorted \n\n”) } cat(“End demo \n”) |
> source(“sorting.R”) Begin sorting demo Generating 5,000 random values Vector to sort is : [1] 1.262954 -0.3262334 1.329799 1.272429 . . . [4997] 2.162363 1.175956 0.6190771 0.008463216 Sorting vector using built-in shell sort() Sorted vector is [1] -3.703236 -3.236386 -3.115391 -3.082364 . . . [4997] 2.961743 3.023597 3.039033 3.266415 Elapsed time = 0.00 sec. Sorting with user-defined bubb_sort() Sorted vector is [1] -3.703236 -3.236386 -3.115391 -3.082364 . . . [4997] 2.961743 3.023597 3.039033 3.266415 Elapsed time = 27.57 sec. Verifying result is sorted using is.sorted() Result verified sorted End demo |
The demo program begins by creating a vector that has 5,000 random values:
set.seed(0)
vec <- rnorm(5000)
cat(“Vector to sort is : \n”)
my_print(vec, 4, 4)
Unlike some programming languages that allow you to create a random number generation object, R has one global random number generator. Calling the set.seed() function allows you to get reproducible results.
By default, the rnorm() function returns values that are normal (Gaussian) distributed with mean = 0.0 and standard deviation = 1.0, which means the vast majority of the 5,000 demo values will be between -4.0 and +4.0.
When working with very large vectors, you will typically find it impractical to display all the cell values. It’s often useful to use the built-in head() or tail() function to display a few of the values at the beginning or end of the vector. The demo program defines a custom my_print() function that allows you to see a few cells at both the beginning and end of a vector.
The demo program calls the sort() function this way:
start_t <- proc.time()
sorted <- sort(vec, method=“shell”)
end_t <- proc.time()
times <- end_t - start_t
The demo sandwiches the call to sort() with calls to the proc.time() function in order to determine how long the sorting will take. Notice that because R functions call by value rather than reference, the result is stored into a new vector object named sorted.
The proc.time() function returns a vector with three values. The value for determining an elapsed time is the value at cell [3]:
e_time <- formatC(times[3], digits=2, format=“f”)
cat(“Elapsed time =“, e_time, “sec. \n”)
Interestingly, if a method parameter value is not supplied to the sort() function, the default algorithm depends on the data type of the cell values. For vectors that contain integers, Boolean values, or factor values, the default algorithm is the radix sort. For all other vector types (such as floating-point values and character values), the default is the shell sort algorithm.
The demo has a program-defined function that implements a crude bubble sort algorithm. In general, you’ll only want to implement a custom sort function in unusual scenarios. Because the bubble sort algorithm is the slowest reasonable algorithm, it is sometimes used as a baseline.
bubb_sort = function(v) {
# -----
exchange = function(ii, jj) {
tmp <<- v[ii]; v[ii] <<- v[jj]; v[jj] <<- tmp
}
# -----
n = length(v)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
if (v[i] > v[j]) { exchange(i, j) }
}
}
return(v)
}
The custom bubb_sort() function has a nested function named exchange() that swaps the values to be sorted in the array. Notice that exchange() has access to parameter v, but in order to change v the special <<- operator must be used. The bubb_sort() function is called this way:
sorted <- bubb_sort(vec)
unsorted <- is.unsorted(sorted)
if (unsorted == T) { cat(“Error. Result not sorted \n\n”) }
In summary, the base R language can sort vectors using the built-in sort() function, which can use the shell sort algorithm, the quicksort algorithm, or the radix sort algorithm. You can time function calls in R using the value in cell [3] of the return result from the proc.time() function. The R language supports nested-function definitions, but in my opinion those definitions are rarely worth the trouble. The built-in is.unsorted() function can be used to test whether or not a vector is sorted.
For detailed information about the base sort() function, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html.
For details about the rnorm() function and related dnorm(), pnorm(), and qnorm(), see:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Normal.html.
The base R language has a versatile sample() function that can select a subset of numbers from a given range and also shuffle the values in a vector to a random order. You can also write custom functions to sample and shuffle vector values when you want specialized behavior.
Code Listing 6: Sampling and Shuffling
# sampling.R # R 3.2.4 my_sample = function(N, k) { # select k random ints between 1 and N result <- c(1:k) # [1, 2, . . k] for (i in (k+1):N) { # reservoir sampling algorithm j <- floor(runif(1, min=1.0, max=i+1)) if (j <= k) { result[j] = i } } return(result) } my_shuffle = function(v) { # Fisher-Yates shuffle n = length(v) for (i in 1:n) { ri <- floor(runif(1, min=i, max=n+1)) t <- v[i]; v[i] <- v[ri]; v[ri] <- t } return(v) } cat(“\nBegin vector sampling demo \n\n”) set.seed(20) # arbitrary N <- 9 k <- 4 cat(“N = “, N, “\n”) cat(“k = “, k, “\n\n”) cat(“Sampling 4 items using program-defined my_sample() \n”) samp <- my_sample(N, k) cat(“Sample = “, samp, “\n\n”) cat(“Sampling 4 items using built-in sample() \n”) samp <- sample(N, k) cat(“Sample = “, samp, “\n\n”) cat(“Sampling 4 items using built-in sample(replace=TRUE) \n”) samp <- sample(N, k, replace=T) cat(“Sample = “, samp, “\n\n”) cat(“Shuffling (1,2,3,4,5,6) using built-in sample() \n”) v <- c(1:6) vv <- sample(v) cat(“Shuffled v = “, vv, “\n\n”) cat(“Shuffling (1,2,3,4,5,6) using program-defined my_shuffle() \n”) v <- c(1:6) vv <- my_shuffle(v) cat(“Shuffled v = “, vv, “\n\n”) cat(“End demo\n”) |
> source(“sampling.R”) Begin vector sampling demo N = 9 k = 4 Sampling 4 items using program-defined my_sample() Sample = 1 7 3 4 Sampling 4 items using built-in sample() Sample = 9 1 8 2 Sampling 4 items using built-in sample(replace=TRUE) Sample = 4 7 7 1 Shuffling (1,2,3,4,5,6) using built-in sample() Shuffled v = 5 1 2 6 3 4 Shuffling (1,2,3,4,5,6) using program-defined my_shuffle() Shuffled v = 5 4 3 1 2 6 End demo |
The demo program shows how to select a random subset of a range of values using the built-in sample() function:
set.seed(20)
N <- as.integer(9)
k <- as.integer(4)
samp <- my_sample(N, k)
cat(“Sample = “, samp, “\n\n”)
The set.seed() function is used so that the demo results will be reproducible. The seed value of 20 is essentially arbitrary. Variable N is set to 9 and is the size of the source data. Variable k, set to 4, is the subset size. The call to sample(N, k) returns four random (by default, all different) integers that have values from 1 to 9, inclusive.
The demo program defines a custom-sampling function named my_sample() that uses a very clever technique called reservoir sampling:
my_sample = function(N, k) {
result <- c(1:k) # [1, 2, . . k]
for (i in (k+1):N) {
j <- floor(runif(1, min=1.0, max=i+1))
if (j <= k) {
result[j] = i
}
}
return(result)
}
Selecting k random and unique values from a set of values is surprisingly tricky. The reservoir algorithm starts with (1, 2, . . , k), then it probabilistically replaces these values. The runif() function (random uniform) call returns one random floating-point value from 1.0 to i+1 (exclusive). The floor() function strips away the decimal part of the return value from the runif() function.
The built-in sample() function can return a subset in which duplicate values are allowed by passing logical True to the replace parameter:
samp <- sample(N, k, replace=T)
cat(“Sample = “, samp, “\n\n”)
The behavior of the built-in sample() function is overloaded so that the function can also act as a shuffle function. For example:
cat(“Shuffling (1,2,3,4,5,6) using built-in sample() \n”)
v <- c(1:6)
vv <- sample(v)
cat(“Shuffled v = “, vv, “\n\n”)
If you pass sample() a vector but no subset size, the function will sample all values in the vector, which essentially performs a shuffle operation. This technique is useful when you want to traverse all the cells in a vector in a random order (a common task in many machine-learning algorithms). For example:
m <- matrix(0.0, nrow=9, ncol=3)
# populate the matrix with data
indices <- sample(1:9)
for (i in 1:9) {
idx <- indices[i]
# process row at [idx]
The demo program defines a custom function named my_shuffle() that can shuffle the values in a vector:
my_shuffle = function(v) {
n = length(v)
for (i in 1:n) {
ri <- floor(runif(1, min=i, max=n+1))
t <- v[i]; v[i] <- v[ri]; v[ri] <- t
}
return(v)
}
The custom function uses a neat mini-algorithm called the Fisher-Yates shuffle. Shuffling the values in a vector into random order can be challenging, and it’s very easy to write code that produces results that appear random but in fact are heavily biased toward certain patterns. Fortunately, the Fisher-Yates algorithm is very simple and effective.
The demo program calls my_shuffle()this way:
cat(“Shuffling (1,2,3,4,5,6) using program-defined my_shuffle() \n”)
v <- c(1:6)
vv <- my_shuffle(v)
cat(“Shuffled v = “, vv, “\n\n”)
Recall that R functions pass parameters by value rather than by reference, which means that although function my_shuffle() appears to manipulate its input parameter v, behind the scenes a copy of v is made so that the actual argument passed to my_shuffle() does not change. If you want to shuffle the source vector, you can use this calling pattern:
v <- my_shuffle(v)
In summary, the built-in sample() function can generate a random subset with no duplicate values, a random subset with duplicate values allowed, or a random subset of all source values with no duplicates, which is a shuffle. If you need to define a function with custom sampling behavior, you can use the reservoir-sampling algorithm, and if you need to define a function with custom shuffling behavior, you can use the Fisher-Yates mini-algorithm.
For detailed information about the sample() function, see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html.
For information about runif() and three related random functions, see:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Uniform.html.
For more information about the reservoir-sampling algorithm, see:
https://en.wikipedia.org/wiki/Reservoir_sampling.