+ - 0:00:00
Notes for current slide
Notes for next slide

STA 517 3.0 Programming and Statistical Computing with R

Reshaping Data

Dr Thiyanga Talagala

1 / 33

Data Wrangling/ Data Munging

2 / 33

Data Wrangling/ Data Munging

Reshaping Data (tidying your data)

How to reshape your data in order to make the analysis easier.

2 / 33

Tidy Data

Figure 1: Components of a dataframe.

  • Each variable is saved in its column.

  • Each observation is saved in its own row.

Image Credit: Hadley Wickham and Garrett Grolemund

3 / 33

packages

library(tidyverse) #or library(tidyr)
library(magrittr)

4 / 33

tidyr package

Hadley Wickham, Chief Scientist at RStudio explaining tidyr at WOMBAT organized by Monash University, Australia.

knitrhex

Image taken by Thiyanga S Talagala at WOMBAT Melbourne, Australia, December-2019

5 / 33

6 / 33

7 / 33

8 / 33

9 / 33

10 / 33

tidyr verbs

Main verbs

  • pivot_longer

    In tidyr (2014) gather

  • pivot_wider

    In tidyr (2014) spread

Other

  • separate

  • unite

Input and Output

Main input: data frame or tibble.

Output: tibble

11 / 33

pivot_longer

12 / 33

pivot_longer()

  • Turns columns into rows.

  • From wide format to long format.

13 / 33

pivot_longer()

dengue <- tibble( dist = c("Colombo", "Gampaha", "Kalutara"),
'2017' = c(20718, 10258, 34274),
'2018' = c(16573, 5857, 31647),
'2019' = c(8395, 3155, 10961)); dengue
# A tibble: 3 × 4
dist `2017` `2018` `2019`
<chr> <dbl> <dbl> <dbl>
1 Colombo 20718 16573 8395
2 Gampaha 10258 5857 3155
3 Kalutara 34274 31647 10961
dengue %>%
pivot_longer(2:4, names_to="Year", values_to = "Dengue counts")
# A tibble: 9 × 3
dist Year `Dengue counts`
<chr> <chr> <dbl>
1 Colombo 2017 20718
2 Colombo 2018 16573
3 Colombo 2019 8395
4 Gampaha 2017 10258
5 Gampaha 2018 5857
6 Gampaha 2019 3155
7 Kalutara 2017 34274
8 Kalutara 2018 31647
9 Kalutara 2019 10961
14 / 33

pivot_wider

15 / 33

pivot_wider()

  • From long to wide format.

16 / 33

pivot_wider()

Corona <- tibble(
country = rep(c("USA", "Brazil", "Russia"), each=2),
status = rep(c("Death", "Recovered"), 3),
count = c(99381, 451745, 22746, 149911, 3633, 118798))
Corona
# A tibble: 6 × 3
country status count
<chr> <chr> <dbl>
1 USA Death 99381
2 USA Recovered 451745
3 Brazil Death 22746
4 Brazil Recovered 149911
5 Russia Death 3633
6 Russia Recovered 118798
17 / 33

pivot_wider()

Corona
# A tibble: 6 × 3
country status count
<chr> <chr> <dbl>
1 USA Death 99381
2 USA Recovered 451745
3 Brazil Death 22746
4 Brazil Recovered 149911
5 Russia Death 3633
6 Russia Recovered 118798
Corona %>%
pivot_wider(names_from=status,
values_from=count)
# A tibble: 3 × 3
country Death Recovered
<chr> <dbl> <dbl>
1 USA 99381 451745
2 Brazil 22746 149911
3 Russia 3633 118798
18 / 33

Assign a name:

corona_wide_format <- Corona %>%
pivot_wider(names_from=status,
values_from=count)
corona_wide_format
# A tibble: 3 × 3
country Death Recovered
<chr> <dbl> <dbl>
1 USA 99381 451745
2 Brazil 22746 149911
3 Russia 3633 118798
19 / 33

pivot_longer vs pivot_wider

20 / 33

pivot_longer and pivot_wider

profit <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
quarter = c( 1, 2, 3, 4, 2, 3, 4),
income = c(2, NA, 3, NA, 4, 5, 6)
)
profit
# A tibble: 7 × 3
year quarter income
<dbl> <dbl> <dbl>
1 2015 1 2
2 2015 2 NA
3 2015 3 3
4 2015 4 NA
5 2016 2 4
6 2016 3 5
7 2016 4 6
21 / 33

pivot_longer and pivot_wider

# A tibble: 7 × 3
year quarter income
<dbl> <dbl> <dbl>
1 2015 1 2
2 2015 2 NA
3 2015 3 3
4 2015 4 NA
5 2016 2 4
6 2016 3 5
7 2016 4 6
profit %>%
pivot_wider(names_from = year, values_from = income)
# A tibble: 4 × 3
quarter `2015` `2016`
<dbl> <dbl> <dbl>
1 1 2 NA
2 2 NA 4
3 3 3 5
4 4 NA 6
22 / 33

Missing values

# A tibble: 4 × 3
quarter `2015` `2016`
<dbl> <dbl> <dbl>
1 1 2 NA
2 2 NA 4
3 3 3 5
4 4 NA 6
profit %>%
pivot_wider(names_from = year, values_from = income) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "income",
values_drop_na = TRUE
)
# A tibble: 5 × 3
quarter year income
<dbl> <chr> <dbl>
1 1 2015 2
2 2 2016 4
3 3 2015 3
4 3 2016 5
5 4 2016 6
23 / 33

separate

24 / 33

separate()

  • Separate one column into several columns.
Melbourne <-
tibble(Date = c("10-5-2020", "11-5-2020", "12-5-2020","13-5-2020"),
Tmin = c(5, 9, 9, 7), Tmax = c(18, 16, 16, 17),
Rainfall= c(30, 40, 10, 5)); Melbourne
# A tibble: 4 × 4
Date Tmin Tmax Rainfall
<chr> <dbl> <dbl> <dbl>
1 10-5-2020 5 18 30
2 11-5-2020 9 16 40
3 12-5-2020 9 16 10
4 13-5-2020 7 17 5
Melbourne %>%
separate(Date, into=c("day", "month", "year"), sep="-")
# A tibble: 4 × 6
day month year Tmin Tmax Rainfall
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 10 5 2020 5 18 30
2 11 5 2020 9 16 40
3 12 5 2020 9 16 10
4 13 5 2020 7 17 5
25 / 33

separate()

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df
x
1 <NA>
2 a.b
3 a.d
4 b.c
df %>% separate(x, c("Text1", "Text2"))
Text1 Text2
1 <NA> <NA>
2 a b
3 a d
4 b c
26 / 33

separate()

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl
# A tibble: 4 × 1
input
<chr>
1 a
2 a b
3 a-b c
4 <NA>
27 / 33

separate()

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl
# A tibble: 4 × 1
input
<chr>
1 a
2 a b
3 a-b c
4 <NA>
tbl %>% separate(input, c("Input1", "Input2"))
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
## # A tibble: 4 × 2
## Input1 Input2
## <chr> <chr>
## 1 a <NA>
## 2 a b
## 3 a b
## 4 <NA> <NA>
27 / 33

separate()

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl
# A tibble: 4 × 1
input
<chr>
1 a
2 a b
3 a-b c
4 <NA>
28 / 33

separate()

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl
# A tibble: 4 × 1
input
<chr>
1 a
2 a b
3 a-b c
4 <NA>
tbl %>% separate(input,
c("Input1", "Input2", "Input3"))
tbl %>% separate(input, c("Input1", "Input2", "Input3"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
## # A tibble: 4 × 3
## Input1 Input2 Input3
## <chr> <chr> <chr>
## 1 a <NA> <NA>
## 2 a b <NA>
## 3 a b c
## 4 <NA> <NA> <NA>
28 / 33

unite

29 / 33

unite()

  • Unite several columns into one.
projects <- tibble(
Country = c("USA", "USA", "AUS", "AUS"),
State = c("LA", "CO", "VIC", "NSW"),
Cost = c(1000, 11000, 20000,30000)
)
projects
# A tibble: 4 × 3
Country State Cost
<chr> <chr> <dbl>
1 USA LA 1000
2 USA CO 11000
3 AUS VIC 20000
4 AUS NSW 30000
projects %>% unite("Location", c("State", "Country"))
# A tibble: 4 × 2
Location Cost
<chr> <dbl>
1 LA_USA 1000
2 CO_USA 11000
3 VIC_AUS 20000
4 NSW_AUS 30000
30 / 33

unite()

projects %>% unite("Location", c("State", "Country"))
# A tibble: 4 × 2
Location Cost
<chr> <dbl>
1 LA_USA 1000
2 CO_USA 11000
3 VIC_AUS 20000
4 NSW_AUS 30000
projects %>% unite("Location", c("State", "Country"),
sep="-")
# A tibble: 4 × 2
Location Cost
<chr> <dbl>
1 LA-USA 1000
2 CO-USA 11000
3 VIC-AUS 20000
4 NSW-AUS 30000
31 / 33

separate vs unite

32 / 33

Slides available at: hellor.netlify.app

All rights reserved by Thiyanga S. Talagala

33 / 33

Data Wrangling/ Data Munging

2 / 33
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
s Start & Stop the presentation timer
t Reset the presentation timer
?, h Toggle this help
Esc Back to slideshow