Data structures

Warm-up

Type the following into your console:

# Create a vector in R
x <- c(5, 29, 13, 87)
x

## [1]  5 29 13 87

Two important ideas:
- Commenting (we will come back to this)
- Assignment
  - The <- symbol means assign x the value c(5, 29, 13, 87).
  - Could use = instead of <- but this is discouraged.
  - All assignments take the same form: object_name <- value.
  - c() means “concatenate”.
  - Type x into the console to print its assignment.
Note: the [1] tells us that 5 is the first element of the vector.

# Create a vector in R
x <- 1:50
x

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Data structures

Vector	Homogeneous	Heterogeneous
1d	Atomic vector	List
2d	Matrix	Data frame
nd	Array	-

Almost all other objects are built upon these foundations.
R has no 0-dimensional, or scalar types.
Best way to understand what data structures any object is composed of is str() (short for structure).

x <- c(5, 29, 13, 87)
str(x)

##  num [1:4] 5 29 13 87

Vector

Two flavors:
- atomic vectors,
- lists.
Three common properties:
- Type, typeof(), what it is.
- Length, length(), how many elements it contains.
- Attributes, attributes(), additional arbitrary metadata.
Main difference: elements of an atomic vector must be the same type, whereas those of a list can have different types.

Atomic vectors

Four primary types of atomic vectors: logical, integer, double, and character (which contains strings).
Integer and double vectors are known as numeric vectors.
There are two rare types: complex and raw (won’t be discussed further).

Scalars

Special syntax to create an individual value, AKA a scalar:

Logicals:
- In full (TRUE or FALSE),
- Abbreviated (T or F).
Doubles:
- Decimal (0.1234), scientific (1.23e4), or hexadecimal (0xcafe) form.
- Special values unique to doubles: Inf, -Inf, and NaN (not a number).
Integers:
- Similar to doubles but
  - must be followed by L (1234L, 1e4L, or 0xcafeL),
  - and can not contain fractional values.
Strings:
- Surrounded by " ("hi") or ' ('bye').
- Special characters escaped with ; see ?Quotes for details.

Making longer vectors with `c()`

To create longer vectors from shorter ones, use c():

lgl_var <- c(TRUE, FALSE)
int_var <- c(1L, 6L, 10L)
dbl_var <- c(1, 2.5, 4.5)
chr_var <- c("these are", "some strings")

Depicting vectors as connected rectangles:

With atomic vectors, c() returns atomic vectors (i.e., flattens):

c(c(1, 2), c(3, 4))

## [1] 1 2 3 4

Determine the type and length of a vector with typeof() and length():

lgl_var <- c(TRUE, FALSE)
typeof(lgl_var)

## [1] "logical"

int_var <- c(1L, 6L, 10L)
typeof(int_var)

## [1] "integer"

dbl_var <- c(1, 2.5, 4.5)
typeof(dbl_var)

## [1] "double"

chr_var <- c("these are", "some strings")
typeof(chr_var)

## [1] "character"

Missing or unknown values

Represented with NA (short for not applicable/available).
Missing values tend to be infectious:

NA > 5

## [1] NA

10 * NA

## [1] NA

!NA

## [1] NA

Exception: when some identity holds for all possible inputs…

NA ^ 0

## [1] 1

NA | TRUE

## [1] TRUE

NA & FALSE

## [1] FALSE

Propagation of missingness leads to a common mistake

x <- c(NA, 5, NA, 10)
x == NA

## [1] NA NA NA NA

Instead, use is.na():

x <- c(NA, 5, NA, 10)
is.na(x)

## [1]  TRUE FALSE  TRUE FALSE

Testing and coercion

Test if a vector is of a given type with is.*(), but be careful:
- is.logical(), is.integer(), is.double(), and is.character() do what you might expect.
- Avoid is.vector(), is.atomic(), and is.numeric() or carefully read the documentation.
For atomic vectors:
- Type is a property of the entire vector (all elements of the same type).
- When combining different types: coercion in a fixed order (character $\to$ double $\to$ integer $\to$ logical).

str(c("a", 1))

##  chr [1:2] "a" "1"

Often happens automatically:
- Most mathematical functions (+, log, etc.) coerce to numeric.
- Useful for logical vectors because TRUE/FALSE become 1/0.

x <- c(FALSE, FALSE, TRUE)
as.numeric(x)

## [1] 0 0 1

c(sum(x), mean(x)) # Total number of TRUEs and proportion that are TRUE

## [1] 1.0000000 0.3333333

Additionally:
- Deliberately coerce by using as.*() (as.logical(), as.integer(), as.double(), or as.character()).
- Failed coercion of strings $\to$ warning and missing value.

as.integer(c("1", "1.5", "a"))

## Warning: NAs introduced by coercion

## [1]  1  1 NA

Attributes

How about matrices, arrays, factors, or date-times?

Built on top of atomic vectors by adding attributes.
In the next few topics:
- The dim attribute to make matrices and arrays.
- The class attribute to create “S3” vectors, including factors, dates, and date-times.

Getting and setting

Similar to name-value pairs attaching metadata to an object.
Attributes can be retrieved/modified
- individually with attr(),
- or “En masse” with attributes()/structure().

a <- 1:3
attr(a, "x") <- "abcdef"
attr(a, "x")

## [1] "abcdef"

attr(a, "y") <- 4:6
str(attributes(a))

## List of 2
##  $ x: chr "abcdef"
##  $ y: int [1:3] 4 5 6

# Or equivalently
a <- structure(
  1:3,
  x = "abcdef",
  y = 4:6
)

Attributes should generally be thought of as ephemeral.
For example, most attributes are lost by most operations:

a <- structure(
  1:3,
  x = "abcdef",
  y = 4:6
)
attributes(a[1])

## NULL

attributes(sum(a))

## NULL

There are only two attributes that are routinely preserved:
- names, a character vector giving each element a name.
- dim, short for dimensions, an integer vector, used to turn vectors into matrices or arrays.
To preserve other attributes, need to create your own S3 class!

Names

You can name a vector in three ways:

# When creating it
x <- c(a = 1, b = 2, c = 3)
# By assigning a character vector to names()
x <- 1:3
names(x) <- c("a", "b", "c")
# Inline, with setNames()
x <- setNames(1:3, c("a", "b", "c"))

Avoid attr(x, "names") (more typing and less readable).
Remove names with unname(x) or names(x) <- NULL.

Dimensions

The dim attribute allow a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.
Most important feature: multidimensional subsetting, which we’ll see later.
Create matrices and arrays with matrix():

# Two scalar arguments specify row and column sizes
a <- matrix(1:6, nrow = 2, ncol = 3)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Or arrays with array():

b <- array(1:12, c(2, 3, 2))
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Alternatively, use the assignment form of dim():

# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Functions for working with vectors, matrices and arrays:

Vector	Matrix	Array
`names()`	`rownames()`, `colnames()`	`dimnames()`
`length()`	`nrow()`, `ncol()`	`dim()`
`c()`	`rbind()`, `cbind()`	`abind::abind()`
—	`t()`	`aperm()`
`is.null(dim(x))`	`is.matrix()`	`is.array()`

A vector without a dim attribute set is often thought of as 1-dimensional, but actually has NULL dimensions.
You also can have matrices with a single row or single column, or arrays with a single dimension:
- They may print similarly, but will behave differently.
- The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function.
- As always, use str() to reveal the differences.

str(1:3) # 1d vector

##  int [1:3] 1 2 3

str(matrix(1:3, ncol = 1)) # column vector

##  int [1:3, 1] 1 2 3

str(matrix(1:3, nrow = 1)) # row vector

##  int [1, 1:3] 1 2 3

str(array(1:3, 3))

##  int [1:3(1d)] 1 2 3

S3 atomic vectors

One of the most important vector attributes is class, which underlies the S3 object system.
- Having a class attribute turns an object into an S3 object (i.e., behave differently when passed to a generic function).
- Every S3 object is built on top of a base type, and stores additional information in other attributes.
- More about the S3 object system later.
In the next few slides, four important S3 vectors in R:
- Categorical data (values come from a fixed set of levels): factor vectors.
- Dates (day resolution): Date vectors.
- Date-times (second or sub-second resolution): POSIXct vectors.
- Durations (between Dates or Date-times pairs): difftime vectors.

Factors

A vector that can contain only predefined values.
Used to store categorical data.
Built on top of an integer vector with two attributes:
- a class (defines a behavior different from integer vectors),
- and levels (defines the set of allowed values).

x <- factor(c("a", "b", "b", "a"))
x

## [1] a b b a
## Levels: a b

typeof(x)

## [1] "integer"

attributes(x)

## $levels
## [1] "a" "b"
## 
## $class
## [1] "factor"

Useful when you know the set of possible values but they’re not all present in a given dataset.
When tabulating a factor you’ll get counts of all categories, even unobserved ones:

sex_char <- c("m", "m", "m")
table(sex_char)

## sex_char
## m 
## 3

sex_factor <- factor(sex_char, levels = c("m", "f"))
table(sex_factor)

## sex_factor
## m f 
## 3 0

Ordered factors:
- Behave like regular factors, but the order of the levels is meaningful (e.g., low, medium, high)
- This property is automatically leveraged by some modelling/visualisation functions.

grade <- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a"))
grade

## [1] b b a c
## Levels: c < b < a

While factors look like character vectors, be careful:
- Some string methods (like gsub() and grepl()) will automatically coerce factors to strings.
- Others (like nchar()) will throw an error.
- Still others will (like c()) use the underlying integer values.
- Best to explicitly convert factors to character vectors if you need string-like behavior.
In base R:
- Factors are frequent because many functions (e.g. read.csv()/data.frame()) automatically convert character vectors to factors.
- Suboptimal because there’s no way to know the set of all possible levels or their correct order: the levels are a property of theory or experimental design, not of the data.
- Use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the “theoretical” data.
The tidyverse:
- Never automatically coerces characters to factors.
- Provides the forcats package specifically for working with factors.
- More on that later.

Time

Dates

Built on top of double vectors.
A class Date and no other attributes.

today <- Sys.Date()
typeof(today)

## [1] "double"

attributes(today)

## $class
## [1] "Date"

Value of the double = the number of days since 1970-01-01

date <- as.Date("1970-02-01")
unclass(date)

## [1] 31

Dates-times

Two ways of storing this information: POSIXct, and POSIXlt.
Odd names:
- “POSIX” is short for “Portable Operating System Interface”,
- “ct” stands for calendar time (the time_t type in C),
- and “lt” for local time (the struct tm type in C).
Focus on POSIXct (the simplest):
- Built on top of a double vector.
- Value = number of seconds since 1970-01-01.

now_ct <- as.POSIXct("2022-03-31 12:00", tz = "UTC")
now_ct

## [1] "2022-03-31 12:00:00 UTC"

typeof(now_ct)

## [1] "double"

attributes(now_ct)

## $class
## [1] "POSIXct" "POSIXt" 
## 
## $tzone
## [1] "UTC"

The tzone attribute:
- Controls only how the date-time is formatted; not the represented instant of time.
- Note that the time is not printed if it is midnight.

now_ct <- as.POSIXct("2022-03-31 12:00", tz = "UTC")
structure(now_ct, tzone = "Asia/Tokyo")

## [1] "2022-03-31 21:00:00 JST"

structure(now_ct, tzone = "America/New_York")

## [1] "2022-03-31 08:00:00 EDT"

structure(now_ct, tzone = "Australia/Lord_Howe")

## [1] "2022-03-31 23:00:00 +11"

structure(now_ct, tzone = "Europe/Paris")

## [1] "2022-03-31 14:00:00 CEST"

Durations

Represent the amount of time between pairs of dates or date-times.
Stored in difftimes:
- Built on top of doubles.
- units attribute determines how to interpret the integer.

one_week_1 <- as.difftime(1, units = "weeks")
one_week_1

## Time difference of 1 weeks

typeof(one_week_1)

## [1] "double"

attributes(one_week_1)

## $class
## [1] "difftime"
## 
## $units
## [1] "weeks"

one_week_2 <- as.difftime(7, units = "days")
one_week_2

## Time difference of 7 days

typeof(one_week_2)

## [1] "double"

attributes(one_week_2)

## $class
## [1] "difftime"
## 
## $units
## [1] "days"

Lists

A step up in complexity from atomic vectors.
Each element can be any type.
Construct lists with list().

l1 <- list(
  1:3,
  "a",
  c(TRUE, FALSE, TRUE),
  c(2.3, 5.9)
)
typeof(l1)

## [1] "list"

str(l1)

## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9

Sometimes called recursive vectors:

l3 <- list(list(list(1)))
str(l3)

## List of 1
##  $ :List of 1
##   ..$ :List of 1
##   .. ..$ : num 1

c() will combine several lists into one:

l4 <- list(list(1, 2), c(3, 4))
l5 <- c(list(1, 2), c(3, 4))
str(l4)

## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ : num [1:2] 3 4

str(l5)

## List of 4
##  $ : num 1
##  $ : num 2
##  $ : num 3
##  $ : num 4

Testing and coercion

The typeof() a list is list.
Test for a list with is.list(), and coerce to a list with as.list().

list(1:3)

## [[1]]
## [1] 1 2 3

as.list(1:3)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

Turn a list into an atomic vector with unlist(), but careful:
- Rules for the resulting type are complex, not well documented, and not always equivalent to what you’d get with c().

Data frames and tibbles

The two most important S3 vectors built on top of lists.
If you do data analysis in R, you’ll use them.
A data frame is a named list of vectors with attributes for (column) names, row.names, and its class, data.frame.

df1 <- data.frame(x = 1:3, y = letters[1:3])
typeof(df1)

## [1] "list"

attributes(df1)

## $names
## [1] "x" "y"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3

Similar to a list, but with an additional constraint:
- The length of each of its vectors must be the same.
- “Rectangular structure” explaining why they share properties of both matrices and lists:
  - It has rownames() and colnames(), but its names() are the column names.
  - It has nrow() rows and ncol() columns, but its length() is the number of columns.
One of the biggest and most important ideas in R!
One of the things that makes R different from many other programming languages.
… but
- 20 years have passed since their creation, and some of the design decisions that made sense at the time can now cause frustration.
- … which lead to the creation of the tibble, a modern reimagining of the data frame.

Tibbles

Provided by the tibble package.
Main difference: lazy (do less) & surly (complain more).
Technically:
- Share the same structure as data.frame.
- Only difference is that the class vector includes tbl_df.
- Allows tibbles to behave differently.

library(tibble)
df2 <- tibble(x = 1:3, y = letters[1:3])
typeof(df2)

## [1] "list"

attributes(df2)

## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

Creating a `data.frame`

Supply name-vector pairs to data.frame().

df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c")
)
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Beware of the default conversion of strings to factors.

df1 <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE
)
str(df1)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Creating a `tibble`

Similar to creating a data frame, but tibbles never coerce their input (i.e., lazy):

df2 <- tibble(
  x = 1:3,
  y = c("a", "b", "c")
)
str(df2)

## tbl_df [3 x 2] (S3: tbl_df/tbl/data.frame)
##  $ x: int [1:3] 1 2 3
##  $ y: chr [1:3] "a" "b" "c"

Next few topics: some of the differences between data.frame() and tibble().
- Non-syntactic names.
- Recycling shorter inputs.
- Variables created during construction.
- Printing.

Non-syntactic names

Strict rules about what constitutes a valid name.
- Syntactic names consist of letters2, digits, . and _ but can’t begin with _ or a digit.
- Additionally, can’t use any of the reserved words like TRUE, NULL, if, and function (see the complete list in ?Reserved).
A name that doesn’t follow these rules is non-syntactic.

_abc <- 1
#> Error: unexpected input in "_"

if <- 10
#> Error: unexpected assignment in "if <-"

To override these rules and use any name:

`_abc` <- 1
`_abc`

## [1] 1

`if` <- 10
`if`

## [1] 10

Don’t deliberately create but understand such names:
- You’ll come across them, e.g within data created outside of R.
In data frames and tibbles:

names(data.frame(`1` = 1))

## [1] "X1"

names(data.frame(`1` = 1, check.names = FALSE))

## [1] "1"

names(tibble(`1` = 1))

## [1] "1"

Recycling shorter inputs

Both data.frame() and tibble() recycle shorter inputs, but
- data frames automatically recycle columns that are an integer multiple of the longest column,
- tibbles will only recycle vectors of length one.

data.frame(x = 1:4, y = 1:2)

##   x y
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2

data.frame(x = 1:4, y = 1:3)
# Error in data.frame(x = 1:4, y = 1:3) : arguments imply differing number of rows: 4, 3

tibble(x = 1:4, y = 1)

## # A tibble: 4 x 2
##       x     y
##   <int> <dbl>
## 1     1     1
## 2     2     1
## 3     3     1
## 4     4     1

tibble(x = 1:4, y = 1:2)
# Error: Tibble columns must have compatible sizes. * Size 4: Existing data. * Size 2: Column `y`. i Only values of size one are recycled.

Variables created during construction

tibble() allows you to refer to variables created during construction:

tibble(
  x = 1:3,
  y = x * 2
)

## # A tibble: 3 x 2
##       x     y
##   <int> <dbl>
## 1     1     2
## 2     2     4
## 3     3     6

(Inputs are evaluated left-to-right.)

Printing

print(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

print(dplyr::starwars)

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex  
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Luke~    172    77 blond      fair       blue            19   male 
##  2 C-3PO    167    75 <NA>       gold       yellow         112   none 
##  3 R2-D2     96    32 <NA>       white, bl~ red             33   none 
##  4 Dart~    202   136 none       white      yellow          41.9 male 
##  5 Leia~    150    49 brown      light      brown           19   fema~
##  6 Owen~    178   120 brown, gr~ light      blue            52   male 
##  7 Beru~    165    75 brown      light      blue            47   fema~
##  8 R5-D4     97    32 <NA>       white, red red             NA   none 
##  9 Bigg~    183    84 black      light      brown           24   male 
## 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender <chr>,
## #   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Only the first 10 rows + the columns that fit on screen.
Each column is labelled with its abbreviated type.
Wide columns are truncated.
In RStudio, color highlights important information.

Testing and coercing

To check if an object is a data frame or tibble:

df1 <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE
)
df2 <- tibble(
  x = 1:3,
  y = c("a", "b", "c")
)
is.data.frame(df1)

## [1] TRUE

is.data.frame(df2)

## [1] TRUE

Typically, it should not matter if you have a tibble or data frame, but if you need to be certain:

df1 <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE
)
df2 <- tibble(
  x = 1:3,
  y = c("a", "b", "c")
)
is_tibble(df1)

## [1] FALSE

is_tibble(df2)

## [1] TRUE

Coerce an object to a data frame or tibble with as.data.frame() or as_tibble().

List columns

Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list.
- Useful because a list can contain any other object, i.e., you can put any object in a data frame.
- Allows you to keep related objects together in a row, no matter how complex the individual objects are.
- We’ll see applications later in the course.
In data frames, either add the list-column after creation or wrap the list in I().
In tibbles, easier and printed columns.

df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
data.frame(
  x = 1:3,
  y = I(list(1:2, 1:3, 1:4))
)

##   x          y
## 1 1       1, 2
## 2 2    1, 2, 3
## 3 3 1, 2, 3, 4

tibble(
  x = 1:3,
  y = list(1:2, 1:3, 1:4)
)

## # A tibble: 3 x 2
##       x y        
##   <int> <list>   
## 1     1 <int [2]>
## 2     2 <int [3]>
## 3     3 <int [4]>

Matrix and data frame columns

dfm <- data.frame(
  x = 1:3 * 10
)
dfm$y <- matrix(1:9, nrow = 3)
dfm$z <- data.frame(a = 3:1, b = letters[1:3],
                    stringsAsFactors = FALSE)
str(dfm)

## 'data.frame':    3 obs. of  3 variables:
##  $ x: num  10 20 30
##  $ y: int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
##  $ z:'data.frame':   3 obs. of  2 variables:
##   ..$ a: int  3 2 1
##   ..$ b: chr  "a" "b" "c"

Careful:
- Many functions that work with data frames assume that all columns are vectors.
- The printed display can be confusing.

dfm <- data.frame(
  x = 1:3 * 10
)
dfm$y <- matrix(1:9, nrow = 3)
dfm$z <- data.frame(a = 3:1, b = letters[1:3],
                    stringsAsFactors = FALSE)
print(dfm[1, ])

##    x y.1 y.2 y.3 z.a z.b
## 1 10   1   4   7   3   a

`NULL`

Closely related to vectors.
Special because it has a unique type, is always length zero, and can’t have any attributes.

typeof(NULL)

## [1] "NULL"

length(NULL)

## [1] 0

x <- NULL

x <- NULL
attr(x, "y") <- 1
#> Error in attr(x, "y") <- 1: attempt to set an attribute on NULL

Can test for NULLs with is.null():

is.null(NULL)

## [1] TRUE

NULL commonly represents
- an absent vector.
  - For example, NULL is often used as a default function argument.
  - Contrast this with NA, which indicates that an element of a vector is absent.
- an empty vector (a vector of length zero) of arbitrary type.

c()

## NULL

c(NULL, NULL)

## NULL

c(NULL, 1:3)

## [1] 1 2 3

If you’re familiar with SQL, you’ll know about relational NULL, but the database NULL is actually equivalent to R’s NA.

Subsetting

R’s subsetting operators are fast and powerful.
- Allows to succinctly perform complex operations in a way that few other languages can match.
- Easy to learn but hard to master because of a number of interrelated concepts:
  - Six ways to subset atomic vectors.
  - Three subsetting operators, [[, [, and \$.
  - The operators interact differently with different vector types.
  - Subsetting can be combined with assignment.
Subsetting is a natural complement to str():
- str() shows the pieces of any object (its structure).
- Subsetting pulls out the pieces that you’re interested in.
Outline:
- Selecting multiple elements with [.
- Selectomg a single element with [[ and \$.
- Subsetting and assignment.

`[` for atomic vectors

We’ll look at the following vector:

x <- c(2.1, 4.2, 3.3, 5.4)

Note that the number after the decimal point represents the original position in the vector.
There are six things that you can use to subset a vector:
- Positive integers.
- Negative integers.
- Logical vectors.
- Nothing.
- Zero.
- Character vectors.
Positive integers return elements at the specified positions:

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(3, 1)]

## [1] 3.3 2.1

x[order(x)]

## [1] 2.1 3.3 4.2 5.4

x[c(1, 1)] # Duplicate indices will duplicate values

## [1] 2.1 2.1

x[c(2.1, 2.9)] # Real numbers are silently truncated to integers

## [1] 4.2 4.2

Negative integers exclude elements at the specified positions:

x <- c(2.1, 4.2, 3.3, 5.4)
x[-c(3, 1)]

## [1] 4.2 5.4

Can’t mix positive and negative integers in a single subset:

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
#> Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts

Logical vectors select elements where the corresponding logical value is TRUE (probably the most useful):

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, FALSE, FALSE)]

## [1] 2.1 4.2

x[x > 3]

## [1] 4.2 3.3 5.4

In x[y], what happens if x and y are different lengths?
- Behavior controlled by the recycling rules with the shorter recycled to the length of the longer.
- Convenient and easy to understand when x OR y is length one, but avoid for other lengths because of inconsistencies in base R.

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, FALSE)]

## [1] 2.1 3.3

# Equivalent to
x[c(TRUE, FALSE, TRUE, FALSE)]

## [1] 2.1 3.3

Nothing returns the original vector (not useful for 1D vectors, but important for matrices, data frames, and arrays:

x <- c(2.1, 4.2, 3.3, 5.4)
x[]

## [1] 2.1 4.2 3.3 5.4

Zero returns a zero-length vector (not usually done on purpose):

x <- c(2.1, 4.2, 3.3, 5.4)
x[0]

## numeric(0)

If the vector is named, you can also use character vectors to return elements with matching names:

x <- c(2.1, 4.2, 3.3, 5.4)
(y <- setNames(x, letters[1:4]))

##   a   b   c   d 
## 2.1 4.2 3.3 5.4

y[c("d", "c", "a")]

##   d   c   a 
## 5.4 3.3 2.1

# Like integer indices, you can repeat indices
y[c("a", "a", "a")]

##   a   a   a 
## 2.1 2.1 2.1

# When subsetting with [, names are always matched exactly
z <- c(abc = 1, def = 2)
z[c("a", "d")]

## <NA> <NA> 
##   NA   NA

Note that a missing value in the index always yields a missing value in the output:

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, NA, FALSE)]

## [1] 2.1 4.2  NA

Factors are not treated specially when subsetting:
- Subsetting will use the underlying integer vector, not the character levels.
- Typically unexpected, so avoid!

x <- c(2.1, 4.2, 3.3, 5.4)
(y <- setNames(x, letters[1:4]))

##   a   b   c   d 
## 2.1 4.2 3.3 5.4

y[factor("b")]

##   a 
## 2.1

`[` for lists

Exactly as for atomic vectors.
Using [ always returns a list; [[ and \$ (see later), lets you pull out elements of a list.

`[` for matrices and arrays

Subset higher-dimensional structures in three ways:
- With multiple vectors.
- With a single vector.
- With a matrix.
The most common way:
- Supply a 1D index for each dimension, separated by a comma.
- Blank subsetting is now useful!

a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a[1:2, ]

##      A B C
## [1,] 1 4 7
## [2,] 2 5 8

a[c(TRUE, FALSE, TRUE), c("B", "A")]

##      B A
## [1,] 4 1
## [2,] 6 3

a[0, -2]

##      A C

By default, [ simplifies the results to the lowest possible dimensionality.
- For example, both of the following expressions return 1D vectors.
- You’ll learn how to avoid “dropping” dimensions later.

a <- matrix(1:9, nrow = 3)
a[1, ]

## [1] 1 4 7

a[1, 1]

## [1] 1

Can subset them with a vector as if they were 1D.
Note that arrays in R are stored in column-major order:

vals <- outer(1:5, 1:5, FUN = "paste", sep = ",")
vals

##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "1,1" "1,2" "1,3" "1,4" "1,5"
## [2,] "2,1" "2,2" "2,3" "2,4" "2,5"
## [3,] "3,1" "3,2" "3,3" "3,4" "3,5"
## [4,] "4,1" "4,2" "4,3" "4,4" "4,5"
## [5,] "5,1" "5,2" "5,3" "5,4" "5,5"

vals[c(4, 15)]

## [1] "4,1" "5,3"

Can also subset higher-dimensional data structures with an integer matrix (or, if named, a character matrix).
- Each row in the matrix specifies the location of one value.
- Each column corresponds to a dimension in the array.
- E.g., use a 2 column matrix to subset a matrix, a 3 column matrix to subset a 3D array, etc.
- The result is a vector of values.

vals <- outer(1:5, 1:5, FUN = "paste", sep = ",")
select <- matrix(ncol = 2, byrow = TRUE, c(
  1, 1,
  3, 1,
  2, 4
))
vals[select]

## [1] "1,1" "3,1" "2,4"

`[` for data frames and tibbles

Characteristics of both lists and matrices.
When subsetting with a single index:
- Behave like lists and index the columns.
- E.g. df[1:2] selects the first two columns.
When subsetting with two indices:
- Behave like matrices.
- E.g. df[1:3, ] selects the first three rows (and all columns)

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
df[df$x == 2, ]

##   x y z
## 2 2 2 b

df[c(1, 3), ]

##   x y z
## 1 1 3 a
## 3 3 1 c

Two ways to select columns from a data frame:

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
# Like a list
df[c("x", "z")]

##   x z
## 1 1 a
## 2 2 b
## 3 3 c

# Like a matrix
df[, c("x", "z")]

##   x z
## 1 1 a
## 2 2 b
## 3 3 c

Important difference if you select a single column:
- Matrix subsetting simplifies by default.
- List subsetting does not.

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
str(df[, "x"])

##  int [1:3] 1 2 3

str(df["x"])

## 'data.frame':    3 obs. of  1 variable:
##  $ x: int  1 2 3

Subsetting a tibble with [ always returns a tibble:

df <- tibble::tibble(x = 1:3, y = 3:1, z = letters[1:3])
str(df["x"])

## tbl_df [3 x 1] (S3: tbl_df/tbl/data.frame)
##  $ x: int [1:3] 1 2 3

str(df[, "x"])

## tbl_df [3 x 1] (S3: tbl_df/tbl/data.frame)
##  $ x: int [1:3] 1 2 3

Preserving dimensionality

For matrices and arrays, dimensions with length 1 are dropped:

a <- matrix(1:4, nrow = 2)
str(a[1, ])

##  int [1:2] 1 3

str(a[1, , drop = FALSE])

##  int [1, 1:2] 1 3

Data frames with a single column returns just that column:

df <- data.frame(a = 1:2, b = 1:2)
str(df[, "a"])

##  int [1:2] 1 2

str(df[, "a", drop = FALSE])

## 'data.frame':    2 obs. of  1 variable:
##  $ a: int  1 2

The default drop = TRUE is a common source of bugs:
- Your code with a dataset with multiple columns works.
- Six months later, you use it with a single column dataset and it fails with a mystifying error.
- Always use ‘drop = FALSE’ when subsetting a 2D object!
- Tibbles default to drop = FALSE and [ always returns a tibble.
Factor subsetting also has a drop argument:
- Controls whether or not levels (rather than dimensions) are preserved defaults to FALSE.
- When using drop = TRUE, use a character vector instead.

z <- factor(c("a", "b"))
z[1]

## [1] a
## Levels: a b

z[1, drop = TRUE]

## [1] a
## Levels: a

The other two subsetting operators:

[[ is used for extracting single items.
x\$y is a useful shorthand for x[["y"]].

`[[`

[[ is most important when working with lists because subsetting a list with [ always returns a smaller list.

If list x is a train carrying objects, then x[[5]] is the object in car 5; x[4:6] is a train of cars 4-6. — @RLangTip, https://twitter.com/RLangTip/status/268375867468681216

Use this metaphor to make a simple list:

x <- list(1:3, "a", 4:6)

When extracting a single element, you have two options:
- Create a smaller train, i.e., fewer carriages, with [.
- Extract the contents of a particular carriage with [[.

When extracting multiple (or even zero!) elements, you have to make a smaller train.

`$`

Shorthand operator:
- x$y is roughly equivalent to x[["y"]].
- Often used to access variables in a data frame.
- E.g., mtcars$cyl or diamonds$carat.
One common mistake with $:

var <- "cyl"
# Doesn't work - mtcars$var translated to mtcars[["var"]]
mtcars$var

## NULL

# Instead use [[
mtcars[[var]]

##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

The one important difference between \$ and [[ is (left-to-right) partial matching:

x <- list(abc = 1)
x$a

## [1] 1

x[["a"]]

## NULL

To avoid this, the following is highly recommended:

options(warnPartialMatchDollar = TRUE)
x <- list(abc = 1)
x$a

## Warning in x$a: partial match of 'a' to 'abc'

## [1] 1

(For data frames, you can also avoid this problem by using tibbles, which never do partial matching.)

Data frames and tibbles again

Data frames have two undesirable subsetting behaviors.
- When you subset columns with df[, vars]:
  - Returns a vector if vars selects one variable.
  - Otherwise, returns a data frame.
  - Frequent unless you use drop = FALSE.
- When extracting a single column with df\$x:
  - If there is no column x, selects any variable that starts with x.
  - If no variable starts with x, returns NULL.
  - Easy to select the wrong variable/a variable that doesn’t exist.
Tibbles tweak these behaviors:
- [ always returns a tibble.
- \$ doesn’t do partial matching and warns if it can’t find a variable (makes tibbles surly).

df1 <- data.frame(xyz = "a")
str(df1$x)

## Warning in df1$x: partial match of 'x' to 'xyz'

##  chr "a"

df2 <- tibble(xyz = "a")
str(df2$x)

## Warning: Unknown or uninitialised column: `x`.

##  NULL

Subsetting and assignment `–>`

Subsetting operators can be combined with assignment.
- Modifies selected values of an input vector
- Called subassignment.
The basic form is x[i] <- value:

x <- 1:5
x[c(1, 2)] <- c(101, 102)
x

## [1] 101 102   3   4   5

Recommendation:
- Make sure that length(value) is the same as length(x[i]),
- and that i is unique.
- Otherwise, you’ll end-up in recycling hell.
Subsetting lists with NULL
- x[[i]] <- NULL removes a component.
- To add a literal NULL, use x[i] <- list(NULL).

x <- list(a = 1, b = 2)
x[["b"]] <- NULL
str(x)

## List of 1
##  $ a: num 1

y <- list(a = 1, b = 2)
y["b"] <- list(NULL)
str(y)

## List of 2
##  $ a: num 1
##  $ b: NULL

Subsetting with nothing can be useful with assignment
- Preserves the structure of the original object.
- Compare the following two expressions.

mtcars[] <- lapply(mtcars, as.integer)
is.data.frame(mtcars)

## [1] TRUE

mtcars <- lapply(mtcars, as.integer)
is.data.frame(mtcars)

## [1] FALSE

Data Structures

Jitong

11/29/2021

Data structures

Warm-up

Data structures

Vector

Atomic vectors

Scalars

Making longer vectors with c()

Missing or unknown values

Testing and coercion

Attributes

Getting and setting

Names

Dimensions

S3 atomic vectors

Factors

Time

Dates

Dates-times

Durations

Lists

Testing and coercion

Data frames and tibbles

Tibbles

Creating a data.frame

Creating a tibble

Non-syntactic names

Recycling shorter inputs

Variables created during construction

Printing

Testing and coercing

List columns

Matrix and data frame columns

NULL

Subsetting

Subsetting

[ for atomic vectors

[ for lists

[ for matrices and arrays

[ for data frames and tibbles

Preserving dimensionality

[[

$

Data frames and tibbles again

Subsetting and assignment –>

Making longer vectors with `c()`

Creating a `data.frame`

Creating a `tibble`

`NULL`

`[` for atomic vectors

`[` for lists

`[` for matrices and arrays

`[` for data frames and tibbles

`[[`

`$`

Subsetting and assignment `–>`