STA 326 2.0 Programming and Data Analysis with RIntroduction to the tidyverseDr Thiyanga TalagalaOnline distance learning/teaching materials during the COVID-19 outbreak.1 / 47

2 / 47

What is the tidyverse?

Collection of essential R packages for data science.
All packages share a common design philosophy, grammar, and data structures.

Setup

install.packages("tidyverse") # install tidyverse packages
library(tidyverse) # load tidyverse packages

3 / 47

Workflow

Image Credit: Wickham

4 / 47

Workflow: import

700px

5 / 47

Workflow: tidy

700px

6 / 47

Workflow: transform

700px

7 / 47

Workflow: visualise

700px

Illustration

library(ggplot2)
ggplot(iris, 
aes(Sepal.Width, Sepal.Length, 
color=Species)) + 
geom_point() +
theme(aspect.ratio  = 1) +
scale_color_manual(values = 
c("#1b9e77", "#d95f02", "#7570b3"))

8 / 47

Workflow: model

700px

Illustration: Apply a linear model to each group

nested_iris <- group_by(iris, Species) %>% nest()
fit_model <- function(df) lm(Sepal.Length ~ Sepal.Width, data = df)
nested_iris <- nested_iris %>%
 mutate(model = map(data, fit_model))
nested_iris$model[[1]] # To print other two models nested_iris$model[[2]] nested_iris$model[[3]]


Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = df)
Coefficients:
(Intercept)  Sepal.Width  
     2.6390       0.6905

9 / 47

Workflow: communicate

700px

10 / 47

Workflow: R packages11 / 47

1. Tibble2. Factor3. Pipe12 / 47

Tibble

13 / 47

Tibble

Tibbles are data frames.
A modern re-imagining of data frames.

Create a tibble

library(tidyverse) # library(tibble)
first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl

# A tibble: 3 x 2
  height weight
   <dbl>  <dbl>
1    150     45
2    200     60
3    160     51

class(first.tbl)

[1] "tbl_df"     "tbl"        "data.frame"

14 / 47

Convert an existing dataframe to a tibble

as_tibble(iris)

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

15 / 47

Convert a tibble to a dataframe

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
class(first.tbl)

[1] "tbl_df"     "tbl"        "data.frame"

first.tbl.df <- as.data.frame(first.tbl)
class(first.tbl.df)

[1] "data.frame"

16 / 47

tibble vs. data.frame

Output

tibble

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl

# A tibble: 3 x 2
  height weight
   <dbl>  <dbl>
1    150     45
2    200     60
3    160     51

data.frame

dataframe <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51))
dataframe

  height weight
1    150     45
2    200     60
3    160     51

17 / 47

tibble vs data.frame (cont.)

You can create new variables that are functions of existing variables.

tibble

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51), 
                    bmi = (weight)/height^2)
first.tbl

# A tibble: 3 x 3
  height weight     bmi
   <dbl>  <dbl>   <dbl>
1    150     45 0.002  
2    200     60 0.0015 
3    160     51 0.00199

data.frame

df <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), 
                    bmi = (weight)/height^2) # Not working

You will get an error message

Error in data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), : object 'height' not found.

18 / 47

tibble vs data.frame (cont.)

With data.frame this is how we should create a new variable from the existing columns.

df <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51)) 
df$bmi <- (df$weight)/(df$height^2)
df

  height weight         bmi
1    150     45 0.002000000
2    200     60 0.001500000
3    160     51 0.001992188

19 / 47

tibble vs data.frame (cont.)

In contrast to data frames, the variable names in tibbles can contain spaces.

Example 1

tbl <- tibble(`patient id` = c(1, 2, 3))
tbl

# A tibble: 3 x 1
  `patient id`
         <dbl>
1            1
2            2
3            3

df <- data.frame(`patient id` = c(1, 2, 3))
df

  patient.id
1          1
2          2
3          3

20 / 47

tibble vs data.frame (cont.)

In contrast to data frames, the variable names in tibbles can start with a number.

tbl <- tibble(`1var` = c(1, 2, 3))
tbl

# A tibble: 3 x 1
  `1var`
   <dbl>
1      1
2      2
3      3

df <- data.frame(`1var` = c(1, 2, 3))
df

In general, tibbles do not change the names of input variables and do not use row names.

21 / 47

tibble vs data.frame (cont.)

A tibble can have columns that are lists.

tbl <- tibble (x = 1:3, y = list(1:3, 1:4, 1:10))
tbl

# A tibble: 3 x 2
      x y         
  <int> <list>    
1     1 <int [3]> 
2     2 <int [4]> 
3     3 <int [10]>

This feature is not available in data.frame.

If we try to do this with a traditional data frame we get an error.

df <- data.frame(x = 1:3, y = list(1:3, 1:4, 1:10)) ## Not working, error

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 4, 10

22 / 47

Subsetting: tibble vs data.frame

Subsetting single columns:

data frame

df <- data.frame(x = 1:3, 
                 yz = c(10, 20, 30))
df

df[, "x"]

[1] 1 2 3

df[, "x", drop=FALSE]

tibble

tbl <- tibble(x = 1:3, 
              yz = c(10, 20, 30))
tbl

# A tibble: 3 x 2
      x    yz
  <int> <dbl>
1     1    10
2     2    20
3     3    30

tbl[, "x"]

# A tibble: 3 x 1
      x
  <int>
1     1
2     2
3     3

23 / 47

Subsetting single columns (cont):

tibble

tbl <- tibble(x = 1:3, 
              yz = c(10, 20, 30))
tbl

# A tibble: 3 x 2
      x    yz
  <int> <dbl>
1     1    10
2     2    20
3     3    30

tbl[, "x"]

# A tibble: 3 x 1
      x
  <int>
1     1
2     2
3     3

# Method 1
tbl[, "x", drop = TRUE]

[1] 1 2 3

# Method 2
as.data.frame(tbl)[, "x"]

[1] 1 2 3

24 / 47

Subsetting single rows with the drop argumentdataframe
df[1, , drop = TRUE]

$x
[1] 1
$yz
[1] 10
tibble
tbl[1, , drop = TRUE]

# A tibble: 1 x 2
      x    yz
  <int> <dbl>
1     1    10
as.list(tbl[1, ])

$x
[1] 1
$yz
[1] 10
25 / 47

Accessing non-existent columnsdataframe
df$y

[1] 10 20 30
df[["y", exact = FALSE]]

[1] 10 20 30
tibble
tbl$y

Warning: Unknown or uninitialised column: `y`.
NULL
tbl[["y", exact = FALSE]]

Warning: `exact` ignored.
NULL
26 / 47

Functions work with both tibbles and dataframes

names(), colnames(), rownames(), ncol(), nrow(),
length() # length of the underlying list

tb <- tibble(a = 1:3)
names(tb)

[1] "a"

colnames(tb)

[1] "a"

rownames(tb)

[1] "1" "2" "3"

nrow(tb); ncol(tb); length(tb)

[1] 3

[1] 1

[1] 1

df <- data.frame(a = 1:3)
names(df)

[1] "a"

colnames(df)

[1] "a"

rownames(df)

[1] "1" "2" "3"

nrow(df); ncol(df); length(df)

[1] 3

[1] 1

[1] 1

27 / 47

However, when using tibble, we can use some additional commands

is.tibble(tb)

Warning: `is.tibble()` is deprecated as of tibble 2.0.0.
Please use `is_tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.

[1] TRUE

is_tibble(tb) # is.tibble()` is deprecated as of tibble 2.0.0, Please use `is_tibble()` instead of is.tibble

[1] TRUE

glimpse(tb)

Rows: 3
Columns: 1
$ a <int> 1, 2, 3

28 / 47

Factors29 / 47

Factors

A vector that is used to store categorical variables.
It can only contain predefined values. Hence, factors are useful when you know the possible values a variable may take.

Creating a factor vector

grades <- factor(c("A", "A", "A", "C", "B"))
grades

[1] A A A C B
Levels: A B C

30 / 47

Factors

A vector that is used to store categorical variables.
It can only contain predefined values. Hence, factors are useful when you know the possible values a variable may take.

Creating a factor vector

grades <- factor(c("A", "A", "A", "C", "B"))
grades

[1] A A A C B
Levels: A B C

Now let's check the class type

class(grades) # It's a factor

[1] "factor"

30 / 47

Factors

A vector that is used to store categorical variables.
It can only contain predefined values. Hence, factors are useful when you know the possible values a variable may take.

Creating a factor vector

grades <- factor(c("A", "A", "A", "C", "B"))
grades

[1] A A A C B
Levels: A B C

Now let's check the class type

class(grades) # It's a factor

[1] "factor"

To obtain all levels

levels(grades)

[1] "A" "B" "C"

30 / 47

Creating a factor vector (cont)

With factors all possible values of the variables can be defined under levels.

grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr

[1] A D A C B
Levels: A B C D E

levels(grade_factor_vctr)

[1] "A" "B" "C" "D" "E"

class(levels(grade_factor_vctr))

[1] "character"

31 / 47

Character vector vs Factor

Observe the differences in outputs. Factor prints all possible levels of the variable.

Character vector

grade_character_vctr <- 
  c("A", "D", "A", "C", "B")
grade_character_vctr

[1] "A" "D" "A" "C" "B"

Factor vector

grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr

[1] A D A C B
Levels: A B C D E

32 / 47

Character vector vs Factor (cont.)

Factors behave like character vectors but they are actually integers.

Character vector

typeof(grade_character_vctr)

[1] "character"

Factor vector

typeof(grade_factor_vctr)

[1] "integer"

33 / 47

Character vector vs Factor (cont.)

Let's create a contingency table with table function.

Character vector output with table function

grade_character_vctr <- c("A", "D", "A", "C", "B")
table(grade_character_vctr)

grade_character_vctr
A B C D 
2 1 1 1

Factor vector (with levels) output with table function

grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
table(grade_factor_vctr)

grade_factor_vctr
A B C D E 
2 1 1 1 0

Output corresponds to factor prints counts for all possible levels of the variable. Hence, with factors it is obvious when some levels contain no observations.

34 / 47

Character vector vs Factor (cont.)

With factors you can't use values that are not listed in the levels, but with character vectors there is no such restrictions.

Character vector

grade_character_vctr[2] <- "A+"
grade_character_vctr

[1] "A"  "A+" "A"  "C"  "B"

Factor vector

grade_factor_vctr[2] <- "A+"

Warning in `[<-.factor`(`*tmp*`, 2, value = "A+"): invalid factor level, NA
generated

grade_factor_vctr

[1] A    <NA> A    C    B   
Levels: A B C D E

35 / 47

Modify factor levels

This our factor

grade_factor_vctr

[1] A    <NA> A    C    B   
Levels: A B C D E

Change labels

levels(grade_factor_vctr) <- 
  c("Excellent", "Good", "Average", "Poor", "Fail")
grade_factor_vctr

[1] Excellent <NA>      Excellent Average   Good     
Levels: Excellent Good Average Poor Fail

Reverse the level arrangement

levels(grade_factor_vctr) <- rev(levels(grade_factor_vctr))
grade_factor_vctr

[1] Fail    <NA>    Fail    Average Poor   
Levels: Fail Poor Average Good Excellent

36 / 47

Order of factor levels

Default order of levels

fv1 <- factor(c("D","E","E","A", "B", "C"))
fv1

[1] D E E A B C
Levels: A B C D E

fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"))
fv2

[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 1T 2T 3A 4A 5A 6B

37 / 47

Order of factor levels

Default order of levels

fv1 <- factor(c("D","E","E","A", "B", "C"))
fv1

[1] D E E A B C
Levels: A B C D E

fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"))
fv2

[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 1T 2T 3A 4A 5A 6B

qplot(fv2, geom = "bar")

37 / 47

Order of factor levels (cont.)

You can change the order of levels

fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"), 
              levels = c("3A", "4A", "5A", "6B", "1T", "2T"))
fv2

[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 3A 4A 5A 6B 1T 2T

qplot(fv2, geom = "bar")

38 / 47

Note that tibbles do not change the types of input variables (e.g., strings are not converted to factors by default).

tbl <- tibble(x1 = c("setosa", "versicolor", "virginica", "setosa"))
tbl

# A tibble: 4 x 1
  x1        
  <chr>     
1 setosa    
2 versicolor
3 virginica 
4 setosa

df <- data.frame(x1 = c("setosa", "versicolor", "virginica", "setosa"))
df

          x1
1     setosa
2 versicolor
3  virginica
4     setosa

class(df$x1)

[1] "character"

39 / 47

Pipe operator: %>%

40 / 47

Pipe operator: %>%

Required package: `magrittr`

install.packages("magrittr")
library(magrittr)

What does it do?

It takes whatever is on the left-hand-side of the pipe and makes it the first argument of whatever function is on the right-hand-side of the pipe.

For instance,

mean(1:10)

[1] 5.5

can be written as

1:10 %>% mean()

[1] 5.5

41 / 47

Pipe operator: %>%

Illustrations

x %>% f(y) turns into f(x, y)
x %>% f(y) %>% g(z) turns into g(f(x, y), z)

42 / 47

Why %>%

This helps to make your code more readable.

Method 1: Without using pipe (hard to read)

colSums(matrix(c(1, 2, 3, 4, 8, 9, 10, 12), nrow=2))

[1]  3  7 17 22

Method 2: Using pipe (easy to read)

c(1, 2, 3, 4, 8, 9, 10, 12) %>%
  matrix( , nrow = 2) %>%
  colSums()

[1]  3  7 17 22

c(1, 2, 3, 4, 8, 9, 10, 12) %>%
  matrix(nrow = 2) %>% # remove comma
  colSums()

[1]  3  7 17 22

43 / 47

Rules

library(tidyverse) # to use as_tibble
library(magrittr) # to use %>%
df <- data.frame(x1 = 1:3, x2 = 4:6)

Rule 1

head(df) 
df %>% head()

Rule 2

head(df, n = 2)  
df %>% head(n = 2)

  x1 x2
1  1  4
2  2  5

Rule 3

head(df, n = 2)
2 %>% head(df, n = .)

  x1 x2
1  1  4
2  2  5

Rule 4

head(as_tibble(df), n = 2)
df %>% as_tibble() %>%
head(n = 2)

# A tibble: 2 x 2
     x1    x2
  <int> <int>
1     1     4
2     2     5

44 / 47

Rules (cont.)

Rule 5: subsetting

df$x1
df %>% .$x1

[1] 1 2 3

df[["x1"]]
df %>% .[["x1"]]

[1] 1 2 3

df[[1]]
df %>% .[[1]]

[1] 1 2 3

45 / 47

Offline reading materials

Type the following codes to see more examples:

vignette("magrittr")
vignette("tibble")

46 / 47

Slides available at: hellor.netlify.com

47 / 47

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Start & Stop the presentation timer

Reset the presentation timer

?, h

Toggle this help

STA 326 2.0 Programming and Data Analysis with R

Introduction to the tidyverse

Dr Thiyanga Talagala

Online distance learning/teaching materials during the COVID-19 outbreak.

What is the tidyverse?

Setup

Workflow

Workflow: import

Workflow: tidy

Workflow: transform

Workflow: visualise

Illustration

Workflow: model

Illustration: Apply a linear model to each group

Workflow: communicate

Workflow: R packages

1. Tibble

2. Factor

3. Pipe

Tibble

Tibble

Create a tibble

Convert an existing dataframe to a tibble

Convert a tibble to a dataframe

tibble vs. data.frame

tibble vs data.frame (cont.)

tibble vs data.frame (cont.)

tibble vs data.frame (cont.)

tibble vs data.frame (cont.)

tibble vs data.frame (cont.)

Subsetting: tibble vs data.frame

data frame

tibble

tibble

Subsetting single rows with the drop argument

dataframe

tibble

Accessing non-existent columns

dataframe

tibble

Functions work with both tibbles and dataframes

Factors

Factors

Creating a factor vector

Factors

Creating a factor vector

Factors

Creating a factor vector

Creating a factor vector (cont)

Character vector vs Factor

Character vector vs Factor (cont.)

Character vector vs Factor (cont.)

Character vector vs Factor (cont.)

Modify factor levels

Change labels

Reverse the level arrangement

Order of factor levels

Order of factor levels

Order of factor levels (cont.)

Pipe operator: %>%

Pipe operator: %>%

Required package: magrittr

What does it do?

Pipe operator: %>%

Illustrations

Why %>%

Rules

Rules (cont.)

Offline reading materials

Help

Required package: `magrittr`