STA 517 3.0 Programming and Statistical Computing with RReshaping DataDr Thiyanga Talagala1 / 33

Data Wrangling/ Data Munging

2 / 33

Data Wrangling/ Data Munging

Reshaping Data (tidying your data)

How to reshape your data in order to make the analysis easier.

2 / 33

Tidy Data

Figure 1: Components of a dataframe.

Each variable is saved in its column.
Each observation is saved in its own row.

Image Credit: Hadley Wickham and Garrett Grolemund

3 / 33

packages

library(tidyverse) #or library(tidyr)
library(magrittr)

4 / 33

`tidyr` package

Hadley Wickham, Chief Scientist at RStudio explaining tidyr at WOMBAT organized by Monash University, Australia.

knitrhex

Image taken by Thiyanga S Talagala at WOMBAT Melbourne, Australia, December-2019

5 / 33

6 / 33

7 / 33

8 / 33

9 / 33

10 / 33

tidyr verbs

Main verbs

pivot_longer

In tidyr (2014) gather
pivot_wider

In tidyr (2014) spread

Other

separate
unite

Input and Output

Main input: data frame or tibble.

Output: tibble

11 / 33

pivot_longer12 / 33

`pivot_longer()`

Turns columns into rows.
From wide format to long format.

13 / 33

`pivot_longer()`

dengue <- tibble( dist = c("Colombo", "Gampaha", "Kalutara"), 
                  '2017' = c(20718, 10258, 34274), 
                  '2018' = c(16573, 5857, 31647), 
                  '2019' = c(8395, 3155, 10961)); dengue

# A tibble: 3 × 4
  dist     `2017` `2018` `2019`
  <chr>     <dbl>  <dbl>  <dbl>
1 Colombo   20718  16573   8395
2 Gampaha   10258   5857   3155
3 Kalutara  34274  31647  10961

dengue %>% 
  pivot_longer(2:4, names_to="Year", values_to = "Dengue counts")

# A tibble: 9 × 3
  dist     Year  `Dengue counts`
  <chr>    <chr>           <dbl>
1 Colombo  2017            20718
2 Colombo  2018            16573
3 Colombo  2019             8395
4 Gampaha  2017            10258
5 Gampaha  2018             5857
6 Gampaha  2019             3155
7 Kalutara 2017            34274
8 Kalutara 2018            31647
9 Kalutara 2019            10961

14 / 33

pivot_wider15 / 33

`pivot_wider()`

From long to wide format.

16 / 33

`pivot_wider()`

Corona <- tibble(
country = rep(c("USA", "Brazil", "Russia"), each=2),
status = rep(c("Death", "Recovered"), 3),
count = c(99381, 451745, 22746, 149911, 3633, 118798))

Corona

# A tibble: 6 × 3
  country status     count
  <chr>   <chr>      <dbl>
1 USA     Death      99381
2 USA     Recovered 451745
3 Brazil  Death      22746
4 Brazil  Recovered 149911
5 Russia  Death       3633
6 Russia  Recovered 118798

17 / 33

pivot_wider()Corona

# A tibble: 6 × 3
  country status     count
  <chr>   <chr>      <dbl>
1 USA     Death      99381
2 USA     Recovered 451745
3 Brazil  Death      22746
4 Brazil  Recovered 149911
5 Russia  Death       3633
6 Russia  Recovered 118798
Corona %>% 
  pivot_wider(names_from=status, 
              values_from=count)

# A tibble: 3 × 3
  country Death Recovered
  <chr>   <dbl>     <dbl>
1 USA     99381    451745
2 Brazil  22746    149911
3 Russia   3633    118798
18 / 33

Assign a name:

corona_wide_format <- Corona %>% 
  pivot_wider(names_from=status, 
  values_from=count)
corona_wide_format

# A tibble: 3 × 3
  country Death Recovered
  <chr>   <dbl>     <dbl>
1 USA     99381    451745
2 Brazil  22746    149911
3 Russia   3633    118798

19 / 33

`pivot_longer` vs `pivot_wider`

20 / 33

`pivot_longer` and `pivot_wider`

profit <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
quarter = c( 1, 2, 3, 4, 2, 3, 4),
income = c(2, NA, 3, NA, 4, 5, 6)
)
profit

# A tibble: 7 × 3
   year quarter income
  <dbl>   <dbl>  <dbl>
1  2015       1      2
2  2015       2     NA
3  2015       3      3
4  2015       4     NA
5  2016       2      4
6  2016       3      5
7  2016       4      6

21 / 33

`pivot_longer` and `pivot_wider`

# A tibble: 7 × 3
   year quarter income
  <dbl>   <dbl>  <dbl>
1  2015       1      2
2  2015       2     NA
3  2015       3      3
4  2015       4     NA
5  2016       2      4
6  2016       3      5
7  2016       4      6

profit %>%
pivot_wider(names_from = year, values_from = income)

# A tibble: 4 × 3
  quarter `2015` `2016`
    <dbl>  <dbl>  <dbl>
1       1      2     NA
2       2     NA      4
3       3      3      5
4       4     NA      6

22 / 33

Missing values

# A tibble: 4 × 3
  quarter `2015` `2016`
    <dbl>  <dbl>  <dbl>
1       1      2     NA
2       2     NA      4
3       3      3      5
4       4     NA      6

profit %>%
pivot_wider(names_from = year, values_from = income) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "income",
values_drop_na = TRUE
)

# A tibble: 5 × 3
  quarter year  income
    <dbl> <chr>  <dbl>
1       1 2015       2
2       2 2016       4
3       3 2015       3
4       3 2016       5
5       4 2016       6

23 / 33

separate24 / 33

`separate()`

Separate one column into several columns.

Melbourne <- 
  tibble(Date = c("10-5-2020", "11-5-2020", "12-5-2020","13-5-2020"),
         Tmin = c(5, 9, 9, 7), Tmax = c(18, 16, 16, 17),
         Rainfall= c(30, 40, 10, 5)); Melbourne

# A tibble: 4 × 4
  Date       Tmin  Tmax Rainfall
  <chr>     <dbl> <dbl>    <dbl>
1 10-5-2020     5    18       30
2 11-5-2020     9    16       40
3 12-5-2020     9    16       10
4 13-5-2020     7    17        5

Melbourne %>%
  separate(Date, into=c("day", "month", "year"), sep="-")

# A tibble: 4 × 6
  day   month year   Tmin  Tmax Rainfall
  <chr> <chr> <chr> <dbl> <dbl>    <dbl>
1 10    5     2020      5    18       30
2 11    5     2020      9    16       40
3 12    5     2020      9    16       10
4 13    5     2020      7    17        5

25 / 33

`separate()`

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df

     x
1 <NA>
2  a.b
3  a.d
4  b.c

df %>% separate(x, c("Text1", "Text2"))

  Text1 Text2
1  <NA>  <NA>
2     a     b
3     a     d
4     b     c

26 / 33

`separate()`

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl

# A tibble: 4 × 1
  input
  <chr>
1 a    
2 a b  
3 a-b c
4 <NA>

27 / 33

`separate()`

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl

# A tibble: 4 × 1
  input
  <chr>
1 a    
2 a b  
3 a-b c
4 <NA>

tbl %>% separate(input, c("Input1", "Input2"))

## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].

## # A tibble: 4 × 2
##   Input1 Input2
##   <chr>  <chr> 
## 1 a      <NA>  
## 2 a      b     
## 3 a      b     
## 4 <NA>   <NA>

27 / 33

`separate()`

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl

# A tibble: 4 × 1
  input
  <chr>
1 a    
2 a b  
3 a-b c
4 <NA>

28 / 33

`separate()`

tbl <- tibble(input = c("a", "a b", "a-b c", NA)); tbl

# A tibble: 4 × 1
  input
  <chr>
1 a    
2 a b  
3 a-b c
4 <NA>

tbl %>% separate(input, 
c("Input1", "Input2", "Input3"))

tbl %>% separate(input, c("Input1", "Input2", "Input3"))

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].

## # A tibble: 4 × 3
##   Input1 Input2 Input3
##   <chr>  <chr>  <chr> 
## 1 a      <NA>   <NA>  
## 2 a      b      <NA>  
## 3 a      b      c     
## 4 <NA>   <NA>   <NA>

28 / 33

unite29 / 33

`unite()`

Unite several columns into one.

projects <- tibble(
  Country = c("USA", "USA", "AUS", "AUS"),
  State = c("LA", "CO", "VIC", "NSW"),
  Cost = c(1000, 11000, 20000,30000)
)
projects

# A tibble: 4 × 3
  Country State  Cost
  <chr>   <chr> <dbl>
1 USA     LA     1000
2 USA     CO    11000
3 AUS     VIC   20000
4 AUS     NSW   30000

projects %>% unite("Location", c("State", "Country"))

# A tibble: 4 × 2
  Location  Cost
  <chr>    <dbl>
1 LA_USA    1000
2 CO_USA   11000
3 VIC_AUS  20000
4 NSW_AUS  30000

30 / 33

`unite()`

projects %>% unite("Location", c("State", "Country"))

# A tibble: 4 × 2
  Location  Cost
  <chr>    <dbl>
1 LA_USA    1000
2 CO_USA   11000
3 VIC_AUS  20000
4 NSW_AUS  30000

projects %>% unite("Location", c("State", "Country"),
                    sep="-")

# A tibble: 4 × 2
  Location  Cost
  <chr>    <dbl>
1 LA-USA    1000
2 CO-USA   11000
3 VIC-AUS  20000
4 NSW-AUS  30000

31 / 33

`separate` vs `unite`

32 / 33

Slides available at: hellor.netlify.app

33 / 33

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Start & Stop the presentation timer

Reset the presentation timer

?, h

Toggle this help

STA 517 3.0 Programming and Statistical Computing with R

Reshaping Data

Dr Thiyanga Talagala

Data Wrangling/ Data Munging

Data Wrangling/ Data Munging

Reshaping Data (tidying your data)

Tidy Data

packages

tidyr package

tidyr verbs

Main verbs

Other

Input and Output

pivot_longer

pivot_longer()

pivot_longer()

pivot_wider

pivot_wider()

pivot_wider()

pivot_wider()

Assign a name:

pivot_longer vs pivot_wider

pivot_longer and pivot_wider

pivot_longer and pivot_wider

Missing values

separate

separate()

separate()

separate()

separate()

separate()

separate()

unite

unite()

unite()

separate vs unite

Data Wrangling/ Data Munging

Help

`tidyr` package

`pivot_longer`

`pivot_longer()`

`pivot_longer()`

`pivot_wider`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

`pivot_longer` vs `pivot_wider`

`pivot_longer` and `pivot_wider`

`pivot_longer` and `pivot_wider`

`separate`

`separate()`

`separate()`

`separate()`

`separate()`

`separate()`

`separate()`

`unite`

`unite()`

`unite()`

`separate` vs `unite`