Which summary functions can you use to preview data frames in r? select all that apply.

Teaching: 45 min
Exercises: 0 min

Questions

  • What are the different data types in R?

  • What are the different data structures in R?

  • How do I access data within the various data structures?

Objectives

  • Expose learners to the different data types in R and show how these data types are used in data structures.

  • Learn how to create vectors of different types.

  • Be able to check the type of vector.

  • Learn about missing data and other special values.

  • Get familiar with the different data structures (lists, matrices, data frames).

To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on them.

Data structures are very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

Everything in R is an object.

R has 6 basic data types. (In addition to the five listed below, there is also raw which will not be discussed in this workshop.)

  • character
  • numeric (real or decimal)
  • integer
  • logical
  • complex

Elements of these data types may be combined to form data structures, such as atomic vectors. When we call a vector atomic, we mean that the vector only holds data of a single data type. Below are examples of atomic character vectors, numeric vectors, integer vectors, etc.

  • character: "a", "swc"
  • numeric: 2, 15.5
  • integer: 2L (the L tells R to store this as an integer)
  • logical: TRUE, FALSE
  • complex: 1+4i (complex numbers with real and imaginary parts)

R provides many functions to examine features of vectors and other objects, for example

  • class() - what kind of object is it (high-level)?
  • typeof() - what is the object’s data type (low-level)?
  • length() - how long is it? What about two dimensional objects?
  • attributes() - does it have any metadata?

# Example x <- "dataset" typeof(x)

R has many data structures. These include

  • atomic vector
  • list
  • matrix
  • data frame
  • factors

Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types:

although the term “vector” most commonly refers to the atomic types not to lists.

The Different Vector Modes

A vector is a collection of elements that are most commonly of mode character, logical, integer or numeric.

You can create an empty vector with vector(). (By default the mode is logical. You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character(), numeric(), etc.

vector() # an empty 'logical' (the default) vector

vector("character", length = 5) # a vector of mode 'character' with 5 elements

character(5) # the same thing, but using the constructor directly

numeric(5) # a numeric vector with 5 elements

logical(5) # a logical vector with 5 elements

[1] FALSE FALSE FALSE FALSE FALSE

You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector. For instance:

will create a vector x of mode numeric. These are the most common kind, and are treated as double precision real numbers. If you wanted to explicitly create integers, you need to add an L to each element (or coerce to the integer type using as.integer()).

Using TRUE and FALSE will create a vector of mode logical:

y <- c(TRUE, TRUE, FALSE, FALSE)

While using quoted text will create a vector of mode character:

z <- c("Sarah", "Tracy", "Jon")

Examining Vectors

The functions typeof(), length(), class() and str() provide useful information about your vectors and R objects in general.

chr [1:3] "Sarah" "Tracy" "Jon"

Adding Elements

The function c() (for combine) can also be used to add elements to a vector.

[1] "Sarah" "Tracy" "Jon" "Annette"

[1] "Greg" "Sarah" "Tracy" "Jon" "Annette"

Vectors from a Sequence of Numbers

You can create vectors as a sequence of numbers.

seq(from = 1, to = 10, by = 0.1)

[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 [91] 10.0

Missing Data

R supports missing data in vectors. They are represented as NA (Not Available) and can be used for all the vector types covered in this lesson:

x <- c(0.5, NA, 0.7) x <- c(TRUE, FALSE, NA) x <- c("a", NA, "c", "d", "e") x <- c(1+5i, 2-3i, NA)

The function is.na() indicates the elements of the vectors that represent missing data, and the function anyNA() returns TRUE if the vector contains any missing values:

x <- c("a", NA, "c", "d", NA) y <- c("a", "b", "c", "d", "e") is.na(x)

[1] FALSE TRUE FALSE FALSE TRUE

[1] FALSE FALSE FALSE FALSE FALSE

Other Special Values

Inf is infinity. You can have either positive or negative infinity.

NaN means Not a Number. It’s an undefined value.

R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”. For instance, can you guess what the following do (without running them first)?

xx <- c(1.7, "a") xx <- c(TRUE, 2) xx <- c("a", TRUE)

You can also control how vectors are coerced explicitly using the as.<class_name>() functions:

Do you see a property that’s common to all these vectors above?

All vectors are one-dimensional and each element is of the same type.

Objects Attributes

Objects can have attributes. Attributes are part of the object. These include:

  • names
  • dimnames
  • dim
  • class
  • attributes (contain metadata)

You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).

nchar("Software Carpentry")

Matrix

In R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.

m <- matrix(nrow = 2, ncol = 2) m

[,1] [,2] [1,] NA NA [2,] NA NA

You can check that matrices are vectors with a class attribute of matrix by using class() and typeof().

m <- matrix(c(1:3)) class(m)

While class() shows that m is a matrix, typeof() shows that fundamentally the matrix is an integer vector.

Consider the following matrix:

FOURS <- matrix( c(4, 4, 4, 4), nrow = 2, ncol = 2)

Given that typeof(FOURS[1]) returns "double", what would you expect typeof(FOURS) to return? How do you know this is the case even without running this code?

Hint Can matrices be composed of elements of different data types?

We know that typeof(FOURS) will also return "double" since matrices are made of elements of the same data type. Note that you could do something like as.character(FOURS) if you needed the elements of FOURS as characters.

Matrices in R are filled column-wise.

m <- matrix(1:6, nrow = 2, ncol = 3)

Other ways to construct a matrix

m <- 1:10 dim(m) <- c(2, 5)

This takes a vector and transforms it into a matrix with 2 rows and 5 columns.

Another way is to bind columns or rows using rbind() and cbind() (“row bind” and “column bind”, respectively).

x <- 1:3 y <- 10:12 cbind(x, y)

x y [1,] 1 10 [2,] 2 11 [3,] 3 12

[,1] [,2] [,3] x 1 2 3 y 10 11 12

You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:

mdat <- matrix(c(1, 2, 3, 11, 12, 13), nrow = 2, ncol = 3, byrow = TRUE) mdat

[,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13

Elements of a matrix can be referenced by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets.

List

In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.

A list is a special type of vector. Each element can be a different type.

Create lists using list() or coerce other objects using as.list(). An empty list of the required length can be created using vector()

x <- list(1, "a", TRUE, 1+4i) x

[[1]] [1] 1 [[2]] [1] "a" [[3]] [1] TRUE [[4]] [1] 1+4i

x <- vector("list", length = 5) # empty list length(x)

The content of elements of a list can be retrieved by using double square brackets.

Vectors can be coerced to lists as follows:

x <- 1:10 x <- as.list(x) length(x)

  1. What is the class of x[1]?
  2. What is the class of x[[1]]?

Elements of a list can be named (i.e. lists can have the names attribute)

xlist <- list(a = "Karthik Ram", b = 1:10, data = head(mtcars)) xlist

$a [1] "Karthik Ram" $b [1] 1 2 3 4 5 6 7 8 9 10 $data mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

  1. What is the length of this object?
  2. What is its structure?
  1. List of 3 $ a : chr "Karthik Ram" $ b : int [1:10] 1 2 3 4 5 6 7 8 9 10 $ data:'data.frame': 6 obs. of 11 variables: ..$ mpg : num [1:6] 21 21 22.8 21.4 18.7 18.1 ..$ cyl : num [1:6] 6 6 4 6 8 6 ..$ disp: num [1:6] 160 160 108 258 360 225 ..$ hp : num [1:6] 110 110 93 110 175 105 ..$ drat: num [1:6] 3.9 3.9 3.85 3.08 3.15 2.76 ..$ wt : num [1:6] 2.62 2.88 2.32 3.21 3.44 ... ..$ qsec: num [1:6] 16.5 17 18.6 19.4 17 ... ..$ vs : num [1:6] 0 0 1 1 0 1 ..$ am : num [1:6] 1 1 1 0 0 0 ..$ gear: num [1:6] 4 4 4 3 3 3 ..$ carb: num [1:6] 4 4 1 1 2 1

Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can “staple” together lots of different kinds of results into a single object that a function can return.

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

Elements are indexed by double brackets. Single brackets will still return a(nother) list. If the elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data).

Data Frame

A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).

Data frames can have additional attributes such as rownames(), which can be useful for annotating data, like subject_id or sample_id. But most of the time they are not used.

Some additional information on data frames:

  • Usually created by read.csv() and read.table(), i.e. when importing the data into R.
  • Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect.
  • Can also create a new data frame with data.frame() function.
  • Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.
  • Rownames are often automatically generated and look like 1, 2, …, n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset.

Creating Data Frames by Hand

To create data frames by hand:

dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20) dat

id x y 1 a 1 11 2 b 2 12 3 c 3 13 4 d 4 14 5 e 5 15 6 f 6 16 7 g 7 17 8 h 8 18 9 i 9 19 10 j 10 20

  • head() - shows first 6 rows
  • tail() - shows last 6 rows
  • dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
  • nrow() - number of rows
  • ncol() - number of columns
  • str() - structure of data frame - name, type and preview of data in each column
  • names() or colnames() - both show the names attribute for a data frame
  • sapply(dataframe, class) - shows the class of each column in the data frame

See that it is actually a special list:

Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix).

As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.

[1] 11 12 13 14 15 16 17 18 19 20

[1] 11 12 13 14 15 16 17 18 19 20

The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.

Dimensions Homogenous Heterogeneous
1-D atomic vector list
2-D matrix data frame

Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data frames or another type of objects). Lists can also contain elements of any length, therefore list do not necessarily have to be “rectangular”. However in order for the list to qualify as a data frame, the length of each element has to be the same.

Knowing that data frames are lists, can columns be of different type?

What type of structure do you expect to see when you explore the structure of the PlantGrowth data frame? Hint: Use str().

The weight column is numeric while group is a factor. Lists can have elements of different types. Since a Data Frame is just a special type of list, it can have columns of differing type (although, remember that type must be consistent within each column!).

'data.frame': 30 obs. of 2 variables: $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ... $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

  • R’s basic data types are character, numeric, integer, complex, and logical.

  • R’s basic data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames).

  • Objects may have attributes, such as name, dimension, and class.


Page 2

Teaching: 15 min
Exercises: 0 min

Questions

  • What is the call stack, and how does R know what order to do things in?

  • How does scope work in R?

Objectives

  • Explain how stack frames are created and destroyed as functions are called.

  • Correctly identify the scope of a function’s local variables.

  • Explain variable scope in terms of the call stack.

Let’s take a closer look at what happens when we call fahrenheit_to_kelvin(32). To make things clearer, we’ll start by putting the initial value 32 in a variable and store the final result in one as well:

original <- 32 final <- fahrenheit_to_kelvin(original)

The diagram below shows what memory looks like after the first line has been executed:

Which summary functions can you use to preview data frames in r? select all that apply.

When we call fahrenheit_to_kelvin, R doesn’t create the variable temp_F right away. Instead, it creates something called a stack frame to keep track of the variables defined by fahrenheit_to_kelvin. Initially, this stack frame only holds the value of temp_F:

Which summary functions can you use to preview data frames in r? select all that apply.

When we call fahrenheit_to_celsius inside fahrenheit_to_kelvin, R creates another stack frame to hold fahrenheit_to_celsius’s variables:

Which summary functions can you use to preview data frames in r? select all that apply.

It does this because there are now two variables in play called temp_F: the argument to fahrenheit_to_celsius, and the argument to fahrenheit_to_kelvin. Having two variables with the same name in the same part of the program would be ambiguous, so R (and every other modern programming language) creates a new stack frame for each function call to keep that function’s variables separate from those defined by other functions.

When the call to fahrenheit_to_celsius returns a value, R throws away fahrenheit_to_celsius’s stack frame and creates a new variable in the stack frame for fahrenheit_to_kelvin to hold the temperature in Celsius:

Which summary functions can you use to preview data frames in r? select all that apply.

It then calls celsius_to_kelvin, which means it creates a stack frame to hold that function’s variables:

Which summary functions can you use to preview data frames in r? select all that apply.

Once again, R throws away that stack frame when celsius_to_kelvin is done and creates the variable temp_K in the stack frame for fahrenheit_to_kelvin:

Which summary functions can you use to preview data frames in r? select all that apply.

Finally, when fahrenheit_to_kelvin is done, R throws away its stack frame and puts its result in a new variable called final that lives in the stack frame we started with:

Which summary functions can you use to preview data frames in r? select all that apply.

This final stack frame is always there; it holds the variables we defined outside the functions in our code. What it doesn’t hold is the variables that were in the various stack frames. If we try to get the value of temp_F after our functions have finished running, R tells us that there’s no such thing:

Error in eval(expr, envir, enclos): object 'temp_F' not found

The explanation of the stack frame above was very general and the basic concept will help you understand most languages you try to program with. However, R has some unique aspects that can be exploited when performing more complicated operations. We will not be writing anything that requires knowledge of these more advanced concepts. In the future when you are comfortable writing functions in R, you can learn more by reading the R Language Manual or this chapter from Advanced R Programming by Hadley Wickham. For context, R uses the terminology “environments” instead of frames.

Why go to all this trouble? Well, here’s a function called span that calculates the difference between the minimum and maximum values in an array:

span <- function(a) { diff <- max(a) - min(a) return(diff) } dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE) # span of inflammation data span(dat)

Notice span assigns a value to variable called diff. We might very well use a variable with the same name (diff) to hold the inflammation data:

diff <- read.csv(file = "data/inflammation-01.csv", header = FALSE) # span of inflammation data span(diff)

We don’t expect the variable diff to have the value 20 after this function call, so the name diff cannot refer to the same variable defined inside span as it does in the main body of our program (which R refers to as the global environment). And yes, we could probably choose a different name than diff for our variable in this case, but we don’t want to have to read every line of code of the R functions we call to see what variable names they use, just in case they change the values of our variables.

The big idea here is encapsulation, and it’s the key to writing correct, comprehensible programs. A function’s job is to turn several operations into one so that we can think about a single function call instead of a dozen or a hundred statements each time we want to do something. That only works if functions don’t interfere with each other; if they do, we have to pay attention to the details once again, which quickly overloads our short-term memory.

We previously wrote functions called highlight and edges. Draw a diagram showing how the call stack changes when we run the following:

inner_vec <- "carbon" outer_vec <- "+" result <- edges(highlight(inner_vec, outer_vec))

  • R keeps track of active function calls using a call stack comprised of stack frames.

  • Only global variables and variables in the current stack frame can be accessed directly.


Page 3

Teaching: 30 min
Exercises: 0 min

Questions

  • How can I do the same thing multiple times more efficiently in R?

  • What is vectorization?

  • Should I use a loop or an apply statement?

Objectives

  • Compare loops and vectorized operations.

  • Use the apply family of functions.

In R you have multiple options when repeating calculations: vectorized operations, for loops, and apply functions.

This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, analyze, over multiple data files:

analyze <- function(filename) { # Plots the average, min, and max inflammation over time. # Input is character string of a csv file. dat <- read.csv(file = filename, header = FALSE) avg_day_inflammation <- apply(dat, 2, mean) plot(avg_day_inflammation) max_day_inflammation <- apply(dat, 2, max) plot(max_day_inflammation) min_day_inflammation <- apply(dat, 2, min) plot(min_day_inflammation) }

filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE)

Vectorized Operations

A key difference between R and many other languages is a topic known as vectorization. When you wrote the total function, we mentioned that R already has sum to do this; sum is much faster than the interpreted for loop because sum is coded in C to work with a vector of numbers. Many of R’s functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.

For example, to add pairs of numbers contained in two vectors

You could loop over the pairs adding each in turn, but that would be very inefficient in R.

Instead of using i in a to make our loop variable, we use the function seq_along to generate indices for each element a contains.

res <- numeric(length = length(a)) for (i in seq_along(a)) { res[i] <- a[i] + b[i] } res

[1] 2 4 6 8 10 12 14 16 18 20

Instead, + is a vectorized function which can operate on entire vectors at once

res2 <- a + b all.equal(res, res2)

Vector Recycling

When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:

[1] 2 4 6 8 10 7 9 11 13 15

The elements of a and b are added together starting from the first element of both vectors. When R reaches the end of the shorter vector b, it starts again at the first element of b and continues until it reaches the last element of the longest vector a. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector a by 5:

[1] 5 10 15 20 25 30 35 40 45 50

Remember there are no scalars in R, so b is actually a vector of length 1; in order to add its value to every element of a, it is recycled to match the length of a.

When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:

Warning in a + b: longer object length is not a multiple of shorter object length

[1] 2 4 6 8 10 12 14 9 11 13

for or apply?

A for loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply family, which can be used in much the same way. You’ve already used one of the family, apply in the first lesson. The apply family members include

  • apply - apply over the margins of an array (e.g. the rows or columns of a matrix)
  • lapply - apply over an object and return list
  • sapply - apply over an object and return a simplified object (an array) if possible
  • vapply - similar to sapply but you specify the type of object returned by the iterations

Each of these has an argument FUN which takes a function to apply to each element of the object. Instead of looping over filenames and calling analyze, as you did earlier, you could sapply over filenames with FUN = analyze:

sapply(filenames, FUN = analyze)

Deciding whether to use for or one of the apply family is really personal preference. Using an apply family function forces to you encapsulate your operations as a function rather than separate calls with for. for loops are often more natural in some circumstances; for several related operations, a for loop will avoid you having to pass in a lot of extra arguments to your function.

Loops in R Are Slow

No, they are not! If you follow some golden rules:

  1. Don’t use a loop when a vectorized alternative exists
  2. Don’t grow objects (via c, cbind, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
  3. Allocate an object to hold the results and fill it in during the loop

As an example, we’ll create a new version of analyze that will return the mean inflammation per day (column) of each file.

analyze2 <- function(filenames) { for (f in seq_along(filenames)) { fdata <- read.csv(filenames[f], header = FALSE) res <- apply(fdata, 2, mean) if (f == 1) { out <- res } else { # The loop is slowed by this call to cbind that grows the object out <- cbind(out, res) } } return(out) } system.time(avg2 <- analyze2(filenames))

user system elapsed 0.027 0.000 0.026

Note how we add a new column to out at each iteration? This is a cardinal sin of writing a for loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the fth column of our results matrix out. This time there is no copying/growing for R to deal with.

analyze3 <- function(filenames) { out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files for (f in seq_along(filenames)) { fdata <- read.csv(filenames[f], header = FALSE) out[, f] <- apply(fdata, 2, mean) } return(out) } system.time(avg3 <- analyze3(filenames))

user system elapsed 0.024 0.000 0.024

In this simple example there is little difference in the compute time of analyze2 and analyze3. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that apply handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply. At its heart, apply is just a for loop with extra convenience.

  • Where possible, use vectorized operations instead of for loops to make code faster and more concise.

  • Use functions such as apply instead of for loops to operate on the values in a data structure.