Manipulating data with tidyverse

What's the tidyverse?¶

The tidyverse is a set of R packages for data manipulation and visualisation. You can learn more on their website. It contains the following R packages :

dplyr: A Grammar of Data Manipulation
tidyr: Tidy Messy Data
stringr: Simple, Consistent Wrappers for Common String Operations
tibble: Simple Data Frames
ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics (coming next)
readr: Read Rectangular Text Data (seen previously here)
forcats: Tools for Working with Categorical Variables (Factors)
purrr: Functional Programming Tools

You can install and load all these packages with following commands :

install.packages("tidyverse")
library(tidyverse)

It's a collection that is really useful and powerful in data science. We are going to rely on the book R for Data Science (2e edition) (O'Reilly Book on tidyverse, "R4DS" in short).

Take a break & Read

For an introduction of the tidyverse, please go read the introduction of R4DS.

Tibbles, the "new" data.frame¶

Tidyverse packages are based on the manipulation of a new type of variable, the "tibble". It's a variant of data.frame. Don't worry, you can manipulate tibble the same way as data.frame that you learn previously (brackets [ ], $, and with basics functions such as : colnames, rownames, str, etc.).

What does it look like?

# An example of a tibble from tidyverse
starwars
## # A tibble: 87 × 14
##    name   height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films
##    <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis>
##  1 Luke …    172    77 blond      fair       blue            19   male  mascu… Tatooine  Human   <chr>
##  2 C-3PO     167    75 NA         gold       yellow         112   none  mascu… Tatooine  Droid   <chr>
##  3 R2-D2      96    32 NA         white, bl… red             33   none  mascu… Naboo     Droid   <chr>
##  4 Darth…    202   136 none       white      yellow          41.9 male  mascu… Tatooine  Human   <chr>
##  5 Leia …    150    49 brown      light      brown           19   fema… femin… Alderaan  Human   <chr>
##  6 Owen …    178   120 brown, gr… light      blue            52   male  mascu… Tatooine  Human   <chr>
##  7 Beru …    165    75 brown      light      blue            47   fema… femin… Tatooine  Human   <chr>
##  8 R5-D4      97    32 NA         white, red red             NA   none  mascu… Tatooine  Droid   <chr>
##  9 Biggs…    183    84 black      light      brown           24   male  mascu… Tatooine  Human   <chr>
## 10 Obi-W…    182    77 auburn, w… fair       blue-gray       57   male  mascu… Stewjon   Human   <chr>
## # ℹ 77 more rows
## # ℹ 2 more variables: vehicles <list>, starships <list>
## # ℹ Use `print(n = ...)` to see more rows

Compared to a data.frame version of starwars :

# First rows of a converted tibble in data.frame
head(as.data.frame(starwars))
##             name height mass  hair_color  skin_color eye_color birth_year    sex    gender homeworld
## 1 Luke Skywalker    172   77       blond        fair      blue       19.0   male masculine  Tatooine
## 2          C-3PO    167   75        <NA>        gold    yellow      112.0   none masculine  Tatooine
## 3          R2-D2     96   32        <NA> white, blue       red       33.0   none masculine     Naboo
## 4    Darth Vader    202  136        none       white    yellow       41.9   male masculine  Tatooine
## 5    Leia Organa    150   49       brown       light     brown       19.0 female  feminine  Alderaan
## 6      Owen Lars    178  120 brown, grey       light      blue       52.0   male masculine  Tatooine
  species
## 1   Human
## 2   Droid
## 3   Droid
## 4   Human
## 5   Human
## 6   Human
##                                                                                                                                       films
## 1                                           The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
## 2                    The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope
## 3 The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
## 4                                                              The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope
## 5                                           The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
## 6                                                                                     Attack of the Clones, Revenge of the Sith, A New Hope
##                             vehicles                starships
## 1 Snowspeeder, Imperial Speeder Bike X-wing, Imperial shuttle
## 2                                                            
## 3                                                            
## 4                                             TIE Advanced x1
## 5              Imperial Speeder Bike                         
## 6

Take a break & Read

For a detailed description of tibbles, please go read the section 10 of "R4DS".

Manipulating data¶

Some tidyverse packages such as dplyr, tidyr and stringr are used for manipulating your data. You can add/remove columns or rows, filter or manipulating strings and tables.

Take a break & Read

You can read the section 3 Data Transform of "R4DS" where you can see the basic commands for manipulating data.frames and tibbles thanks to dplyr.

Pipe operator¶

In the previous section, you must have noticed some weird code: |>. It's called a pipe operator. Tidyverse used this weird grammar to improve code readability when you combine multiple functions to accomplish a certain task.

The following code lines are equivalent:

# with the pipe operator
starwars |> 
  filter(species == "Droid") |>
  head()

# without the pipe operator
head(filter(starwars, species == "Droid"))

But for those cases where multiple, possibly more complex, functions can be combined, it does get confusing and difficult to read. You can imagine "Russian dolls" using pipes when they are not stacked, this is the best way to understand and see how many dolls we have and what they look like.

Take a break & Read

In order to be an expert of pipe operator, please go read again section 3.4 but also section 4.3 of "R4DS" that show you the best practices.

Tidy data¶

Using tibbles is not enough to make full use of the tidyverse's functionalities, you need to tidy your data. Tidy a data.frame or a tibble is a manipulation of the columns in order to obtain one column = one variable and one row = one observation.

For instance here is the difference between tidy and untidy data :

# Untidy data
untidy_df <- data.frame(Sample = paste0("Sample", 1:7),
                        T_cells = c(72, 0, 12, 11, 4, 10, 164), 
                        NK_cells = c(118, 24, 2, 0, 30, 4, 0),
                        Endothelial_cells = c(212, 49, 0, 29, 23, 4, 125)
                        )
untidy_df
##    Sample T_cells    NK_cells Endothelial_cells
## 1 Sample1      72         118               212
## 2 Sample2       0          24                49
## 3 Sample3      12           2                 0
## 4 Sample4      11           0                29
## 5 Sample5       4          30                23
## 6 Sample6      10           4                 4
## 7 Sample7     164           0               125

This format is often used when operating Excel tables, but it also has some inconveniences. What are the numbers stands for? Potatoes? Okay, I may overstate it, but for complicated tables it may be an issue and it makes it harder to manipulate untidy data. For example, if you need to visualize the number of cells for each sample but also for each cell type, it's not possible to do so easily in R. Instead we are going to favor this architecture:

# Tidy data
tidy_df <- untidy_df |> 
  pivot_longer(cols = contains("cells"),       # Tidy all columns that starts with "Sample"
               names_to = "Cell_types",        # Resume to a new column called "Sample"
               values_to = "Nbr_of_cells")     # Store the numeric value to a column called 
tidy_df
# A tibble: 21 × 3
##    Sample  Cell_types        Nbr_of_cells
##    <chr>   <chr>                    <dbl>
##  1 Sample1 T_cells                     72
##  2 Sample1 NK_cells                   118
##  3 Sample1 Endothelial_cells          212
##  4 Sample2 T_cells                      0
##  5 Sample2 NK_cells                    24
##  6 Sample2 Endothelial_cells           49
##  7 Sample3 T_cells                     12
##  8 Sample3 NK_cells                     2
##  9 Sample3 Endothelial_cells            0
## 10 Sample4 T_cells                     11
## # ℹ 11 more rows
## # ℹ Use `print(n = ...)` to see more rows

The R function pivot_longer was used to tidy the data.frame, because it's a tidyverse function, the resulting value of the variable tidy_df is now a tibble. As you can see, we have less columns and more rows but now each row describes one observation.

Take a break & Read

To understand more about the power of tidy data, let's go read section 5 of "R4DS".

Transform data¶

Okay now your data is ready, you can use the pipe operator with your eyes closed, it's time to take a closer look at dplyr and stringr. Thanks to these packages, you will be able to manipulate and transform tables as you wish. And bonus! It will be useful when you want to visualise your data!

Take a break & Read

Please read carefully the sections 12 to 19 of "R4DS.

Cheatsheets¶

You can retrieve overviews of all tidyverse packages, you can also download their cheatsheets, a small document that resumes all main functions and their utilisation for each package.