To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on them. Data structures are very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners. Everything in R is an object. R has 6 basic data types. (In addition to the five listed below, there is also raw which will not be discussed in this workshop.)
Elements of these data types may be combined to form data structures, such as atomic vectors. When we call a vector atomic, we mean that the vector only holds data of a single data type. Below are examples of atomic character vectors, numeric vectors, integer vectors, etc.
R provides many functions to examine features of vectors and other objects, for example
# Example x <- "dataset" typeof(x) R has many data structures. These include
VectorsA vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types: although the term “vector” most commonly refers to the atomic types not to lists. The Different Vector ModesA vector is a collection of elements that are most commonly of mode character, logical, integer or numeric. You can create an empty vector with vector(). (By default the mode is logical. You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character(), numeric(), etc. vector() # an empty 'logical' (the default) vector vector("character", length = 5) # a vector of mode 'character' with 5 elements character(5) # the same thing, but using the constructor directly numeric(5) # a numeric vector with 5 elements logical(5) # a logical vector with 5 elements [1] FALSE FALSE FALSE FALSE FALSE You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector. For instance: will create a vector x of mode numeric. These are the most common kind, and are treated as double precision real numbers. If you wanted to explicitly create integers, you need to add an L to each element (or coerce to the integer type using as.integer()). Using TRUE and FALSE will create a vector of mode logical: y <- c(TRUE, TRUE, FALSE, FALSE) While using quoted text will create a vector of mode character: z <- c("Sarah", "Tracy", "Jon") Examining VectorsThe functions typeof(), length(), class() and str() provide useful information about your vectors and R objects in general. chr [1:3] "Sarah" "Tracy" "Jon" Adding ElementsThe function c() (for combine) can also be used to add elements to a vector. [1] "Sarah" "Tracy" "Jon" "Annette" [1] "Greg" "Sarah" "Tracy" "Jon" "Annette" Vectors from a Sequence of NumbersYou can create vectors as a sequence of numbers. seq(from = 1, to = 10, by = 0.1) [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 [91] 10.0 Missing DataR supports missing data in vectors. They are represented as NA (Not Available) and can be used for all the vector types covered in this lesson: x <- c(0.5, NA, 0.7) x <- c(TRUE, FALSE, NA) x <- c("a", NA, "c", "d", "e") x <- c(1+5i, 2-3i, NA) The function is.na() indicates the elements of the vectors that represent missing data, and the function anyNA() returns TRUE if the vector contains any missing values: x <- c("a", NA, "c", "d", NA) y <- c("a", "b", "c", "d", "e") is.na(x) [1] FALSE TRUE FALSE FALSE TRUE [1] FALSE FALSE FALSE FALSE FALSE Other Special ValuesInf is infinity. You can have either positive or negative infinity. NaN means Not a Number. It’s an undefined value. R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”. For instance, can you guess what the following do (without running them first)? xx <- c(1.7, "a") xx <- c(TRUE, 2) xx <- c("a", TRUE) You can also control how vectors are coerced explicitly using the as.<class_name>() functions:
Objects AttributesObjects can have attributes. Attributes are part of the object. These include:
You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings). nchar("Software Carpentry") MatrixIn R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type. m <- matrix(nrow = 2, ncol = 2) m [,1] [,2] [1,] NA NA [2,] NA NA You can check that matrices are vectors with a class attribute of matrix by using class() and typeof(). m <- matrix(c(1:3)) class(m) While class() shows that m is a matrix, typeof() shows that fundamentally the matrix is an integer vector.
Matrices in R are filled column-wise. m <- matrix(1:6, nrow = 2, ncol = 3) Other ways to construct a matrix m <- 1:10 dim(m) <- c(2, 5) This takes a vector and transforms it into a matrix with 2 rows and 5 columns. Another way is to bind columns or rows using rbind() and cbind() (“row bind” and “column bind”, respectively). x <- 1:3 y <- 10:12 cbind(x, y) x y [1,] 1 10 [2,] 2 11 [3,] 3 12 [,1] [,2] [,3] x 1 2 3 y 10 11 12 You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation: mdat <- matrix(c(1, 2, 3, 11, 12, 13), nrow = 2, ncol = 3, byrow = TRUE) mdat [,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13 Elements of a matrix can be referenced by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets. ListIn R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors. A list is a special type of vector. Each element can be a different type. Create lists using list() or coerce other objects using as.list(). An empty list of the required length can be created using vector() x <- list(1, "a", TRUE, 1+4i) x [[1]] [1] 1 [[2]] [1] "a" [[3]] [1] TRUE [[4]] [1] 1+4i x <- vector("list", length = 5) # empty list length(x) The content of elements of a list can be retrieved by using double square brackets. Vectors can be coerced to lists as follows: x <- 1:10 x <- as.list(x) length(x)
Elements of a list can be named (i.e. lists can have the names attribute) xlist <- list(a = "Karthik Ram", b = 1:10, data = head(mtcars)) xlist $a [1] "Karthik Ram" $b [1] 1 2 3 4 5 6 7 8 9 10 $data mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can “staple” together lots of different kinds of results into a single object that a function can return. A list does not print to the console like a vector. Instead, each element of the list starts on a new line. Elements are indexed by double brackets. Single brackets will still return a(nother) list. If the elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data). Data FrameA data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics. A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list). Data frames can have additional attributes such as rownames(), which can be useful for annotating data, like subject_id or sample_id. But most of the time they are not used. Some additional information on data frames:
Creating Data Frames by HandTo create data frames by hand: dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20) dat id x y 1 a 1 11 2 b 2 12 3 c 3 13 4 d 4 14 5 e 5 15 6 f 6 16 7 g 7 17 8 h 8 18 9 i 9 19 10 j 10 20
See that it is actually a special list: Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix). As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $. [1] 11 12 13 14 15 16 17 18 19 20 [1] 11 12 13 14 15 16 17 18 19 20 The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.
Page 2
Let’s take a closer look at what happens when we call fahrenheit_to_kelvin(32). To make things clearer, we’ll start by putting the initial value 32 in a variable and store the final result in one as well: original <- 32 final <- fahrenheit_to_kelvin(original) The diagram below shows what memory looks like after the first line has been executed: When we call fahrenheit_to_kelvin, R doesn’t create the variable temp_F right away. Instead, it creates something called a stack frame to keep track of the variables defined by fahrenheit_to_kelvin. Initially, this stack frame only holds the value of temp_F: When we call fahrenheit_to_celsius inside fahrenheit_to_kelvin, R creates another stack frame to hold fahrenheit_to_celsius’s variables: It does this because there are now two variables in play called temp_F: the argument to fahrenheit_to_celsius, and the argument to fahrenheit_to_kelvin. Having two variables with the same name in the same part of the program would be ambiguous, so R (and every other modern programming language) creates a new stack frame for each function call to keep that function’s variables separate from those defined by other functions. When the call to fahrenheit_to_celsius returns a value, R throws away fahrenheit_to_celsius’s stack frame and creates a new variable in the stack frame for fahrenheit_to_kelvin to hold the temperature in Celsius: It then calls celsius_to_kelvin, which means it creates a stack frame to hold that function’s variables: Once again, R throws away that stack frame when celsius_to_kelvin is done and creates the variable temp_K in the stack frame for fahrenheit_to_kelvin: Finally, when fahrenheit_to_kelvin is done, R throws away its stack frame and puts its result in a new variable called final that lives in the stack frame we started with: This final stack frame is always there; it holds the variables we defined outside the functions in our code. What it doesn’t hold is the variables that were in the various stack frames. If we try to get the value of temp_F after our functions have finished running, R tells us that there’s no such thing: Error in eval(expr, envir, enclos): object 'temp_F' not found
Why go to all this trouble? Well, here’s a function called span that calculates the difference between the minimum and maximum values in an array: span <- function(a) { diff <- max(a) - min(a) return(diff) } dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE) # span of inflammation data span(dat) Notice span assigns a value to variable called diff. We might very well use a variable with the same name (diff) to hold the inflammation data: diff <- read.csv(file = "data/inflammation-01.csv", header = FALSE) # span of inflammation data span(diff) We don’t expect the variable diff to have the value 20 after this function call, so the name diff cannot refer to the same variable defined inside span as it does in the main body of our program (which R refers to as the global environment). And yes, we could probably choose a different name than diff for our variable in this case, but we don’t want to have to read every line of code of the R functions we call to see what variable names they use, just in case they change the values of our variables. The big idea here is encapsulation, and it’s the key to writing correct, comprehensible programs. A function’s job is to turn several operations into one so that we can think about a single function call instead of a dozen or a hundred statements each time we want to do something. That only works if functions don’t interfere with each other; if they do, we have to pay attention to the details once again, which quickly overloads our short-term memory.
Page 3
In R you have multiple options when repeating calculations: vectorized operations, for loops, and apply functions. This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, analyze, over multiple data files: analyze <- function(filename) { # Plots the average, min, and max inflammation over time. # Input is character string of a csv file. dat <- read.csv(file = filename, header = FALSE) avg_day_inflammation <- apply(dat, 2, mean) plot(avg_day_inflammation) max_day_inflammation <- apply(dat, 2, max) plot(max_day_inflammation) min_day_inflammation <- apply(dat, 2, min) plot(min_day_inflammation) } filenames <- list.files(path = "data", pattern = "inflammation-[0-9]{2}.csv", full.names = TRUE) Vectorized OperationsA key difference between R and many other languages is a topic known as vectorization. When you wrote the total function, we mentioned that R already has sum to do this; sum is much faster than the interpreted for loop because sum is coded in C to work with a vector of numbers. Many of R’s functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R. For example, to add pairs of numbers contained in two vectors You could loop over the pairs adding each in turn, but that would be very inefficient in R. Instead of using i in a to make our loop variable, we use the function seq_along to generate indices for each element a contains. res <- numeric(length = length(a)) for (i in seq_along(a)) { res[i] <- a[i] + b[i] } res [1] 2 4 6 8 10 12 14 16 18 20 Instead, + is a vectorized function which can operate on entire vectors at once res2 <- a + b all.equal(res, res2) Vector RecyclingWhen performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example: [1] 2 4 6 8 10 7 9 11 13 15 The elements of a and b are added together starting from the first element of both vectors. When R reaches the end of the shorter vector b, it starts again at the first element of b and continues until it reaches the last element of the longest vector a. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector a by 5: [1] 5 10 15 20 25 30 35 40 45 50 Remember there are no scalars in R, so b is actually a vector of length 1; in order to add its value to every element of a, it is recycled to match the length of a. When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given: Warning in a + b: longer object length is not a multiple of shorter object length [1] 2 4 6 8 10 12 14 9 11 13 for or apply?A for loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply family, which can be used in much the same way. You’ve already used one of the family, apply in the first lesson. The apply family members include
Each of these has an argument FUN which takes a function to apply to each element of the object. Instead of looping over filenames and calling analyze, as you did earlier, you could sapply over filenames with FUN = analyze: sapply(filenames, FUN = analyze) Deciding whether to use for or one of the apply family is really personal preference. Using an apply family function forces to you encapsulate your operations as a function rather than separate calls with for. for loops are often more natural in some circumstances; for several related operations, a for loop will avoid you having to pass in a lot of extra arguments to your function. Loops in R Are SlowNo, they are not! If you follow some golden rules:
As an example, we’ll create a new version of analyze that will return the mean inflammation per day (column) of each file. analyze2 <- function(filenames) { for (f in seq_along(filenames)) { fdata <- read.csv(filenames[f], header = FALSE) res <- apply(fdata, 2, mean) if (f == 1) { out <- res } else { # The loop is slowed by this call to cbind that grows the object out <- cbind(out, res) } } return(out) } system.time(avg2 <- analyze2(filenames)) user system elapsed 0.027 0.000 0.026 Note how we add a new column to out at each iteration? This is a cardinal sin of writing a for loop in R. Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the fth column of our results matrix out. This time there is no copying/growing for R to deal with. analyze3 <- function(filenames) { out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files for (f in seq_along(filenames)) { fdata <- read.csv(filenames[f], header = FALSE) out[, f] <- apply(fdata, 2, mean) } return(out) } system.time(avg3 <- analyze3(filenames)) user system elapsed 0.024 0.000 0.024 In this simple example there is little difference in the compute time of analyze2 and analyze3. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger. Note that apply handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply. At its heart, apply is just a for loop with extra convenience.
|