4 Wrangling data with tidyverse
: Reshaping and combining tables
4.1 Pivot tables- wider and longer data
Users of Excel may be familiar with the idea of pivot tables. These are functions that let us make our data tidier. To quote Wickham and Grolemund:
here are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
While these authors present “tidiness” of data as an objective property, I’d argue that data is always tidy for a specific purpose. For example, our data is relatively tidy with one row per tasting event (one person tasting one berry), but this data still has an unruly number of variables (92 columns!!). You’ve already learned some tricks for dealing with large numbers of columns at once like across()
and other functions using select helpers, but we have to do this every time we use mutate()
, summarize()
, or a similar function.
We could also treat the attribute or question as an independent variable affecting the response. If we take this view, then the tidiest dataset actually has one row for each person’s response to a single question
. If we want to make plots or do other modelling, this longer form is often more tractable and lets us do operations on the whole dataset with less code.
We can use the pivot_longer()
function to change our data to make the implicit variable explicit and to make our data tidier.
%>%
berry_data select(`Subject Code`, `Sample Name`, berry, starts_with("cata_"), starts_with("9pt")) %>% # for clarity
pivot_longer(cols = starts_with("cata_"),
names_prefix = "cata_",
names_to = "attribute",
values_to = "presence") ->
berry_data_cata_long#The names_prefix will be *removed* from the start of every column name
#before putting the rest of the name in the `names_to` column
berry_data_cata_long
## # A tibble: 270,252 × 10
## `Subject Code` `Sample Name` berry `9pt_appearance` `9pt_overall` `9pt_taste`
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1001 raspberry 6 rasp… 4 4 4
## 2 1001 raspberry 6 rasp… 4 4 4
## 3 1001 raspberry 6 rasp… 4 4 4
## 4 1001 raspberry 6 rasp… 4 4 4
## 5 1001 raspberry 6 rasp… 4 4 4
## 6 1001 raspberry 6 rasp… 4 4 4
## 7 1001 raspberry 6 rasp… 4 4 4
## 8 1001 raspberry 6 rasp… 4 4 4
## 9 1001 raspberry 6 rasp… 4 4 4
## 10 1001 raspberry 6 rasp… 4 4 4
## # ℹ 270,242 more rows
## # ℹ 4 more variables: `9pt_texture` <dbl>, `9pt_aroma` <dbl>, attribute <chr>,
## # presence <dbl>
Remember that tibble
s and data.frame
s can only have one data type per column (logical > integer > numeric > character
), however! If we have one row for each CATA, JAR, hedonic scale, AND free response question, the value
column would have a mixture of different data types. This is why we have to tell pivot_longer()
which cols
to pull the names
and values
from.
Now for each unique combination of Sample Name
and Subject Code
, we have 36 rows, one for each CATA question that was asked. The variables that weren’t listed in the cols
argument are just replicated on each of these rows. Each of the 36 rows that represent Subject Code
1001’s CATA responses for raspberry 6
has the same Subject Code
, Sample Name
, berry
, and various 9pt_
ratings as the other 35.
Sometimes we want to have “wider” or “untidy” data. We can use pivot_wider()
to reverse the effects of pivot_longer()
.
%>%
berry_data_cata_long pivot_wider(names_from = "attribute",
values_from = "presence",
names_prefix = "cata_") #pivot_wider *adds* the names_prefix
## # A tibble: 7,507 × 44
## `Subject Code` `Sample Name` berry `9pt_appearance` `9pt_overall` `9pt_taste`
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1001 raspberry 6 rasp… 4 4 4
## 2 1001 raspberry 5 rasp… 8 9 9
## 3 1001 raspberry 2 rasp… 4 3 3
## 4 1001 raspberry 3 rasp… 7 7 6
## 5 1001 raspberry 4 rasp… 7 4 3
## 6 1001 raspberry 1 rasp… 7 4 3
## 7 1002 raspberry 6 rasp… 6 4 4
## 8 1002 raspberry 5 rasp… 8 7 4
## 9 1002 raspberry 2 rasp… 8 7 6
## 10 1002 raspberry 3 rasp… 7 9 9
## # ℹ 7,497 more rows
## # ℹ 38 more variables: `9pt_texture` <dbl>, `9pt_aroma` <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>,
## # cata_appearance_fresh <dbl>, cata_appearance_goodshape <dbl>,
## # cata_appearance_goodquality <dbl>, cata_appearance_none <dbl>, …
Pivoting is an incredibly powerful and incredibly common data manipulation technique that will become even more powerful when we need to make complex graphs later. Different functions and analyses may require the data in different longer or wider formats, and you will often find yourself starting with even less tidy data than what we’ve provided.
For an example of this power, let’s imagine that we want to compare the 3 different liking scales by normalizing each by the mean()
and sd()
of that particular scale, then comparing average liking for each attribute of each berry across the three scales.
%>%
berry_data pivot_longer(cols = starts_with(c("9pt_","lms_","us_")),
names_to = c("scale", "attribute"),
names_sep = "_",
values_to = "rating",
values_drop_na = TRUE) %>%
group_by(scale) %>%
mutate(normalized_rating = (rating - mean(rating)) / sd(rating)) %>%
group_by(scale, attribute, berry) %>%
summarize(avg_liking = mean(normalized_rating)) %>%
pivot_wider(names_from = scale,
values_from = avg_liking)
## `summarise()` has grouped output by 'scale', 'attribute'. You can override
## using the `.groups` argument.
## # A tibble: 17 × 5
## # Groups: attribute [5]
## attribute berry `9pt` lms us
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 appearance blackberry 0.284 0.364 0.327
## 2 appearance blueberry 0.381 0.424 0.462
## 3 appearance raspberry 0.246 0.236 0.223
## 4 appearance strawberry -0.164 -0.226 -0.220
## 5 aroma strawberry 0.0351 -0.0676 -0.0951
## 6 overall blackberry -0.177 -0.209 -0.177
## 7 overall blueberry 0.0234 0.0576 0.166
## 8 overall raspberry 0.0261 0.0300 0.0707
## 9 overall strawberry -0.150 -0.179 -0.200
## 10 taste blackberry -0.289 -0.301 -0.336
## 11 taste blueberry -0.0611 0.0202 0.0366
## 12 taste raspberry -0.0291 -0.0336 -0.0359
## 13 taste strawberry -0.306 -0.292 -0.339
## 14 texture blackberry -0.0467 0.00645 -0.0118
## 15 texture blueberry 0.159 0.202 0.284
## 16 texture raspberry -0.00677 -0.00607 0.0289
## 17 texture strawberry -0.0602 -0.0531 -0.0768
While pivoting may seem simple at first, it can also get pretty confusing! That example required two different pivots! We’ll be using these tools throughout the rest of the tutorial, so I wanted to give exposure, but mastering them takes trial and error. I recommend taking a look at the relevant chapter in Wickham and Grolemund for details.
4.2 Combining data
While we’ve been using a single dataset read in from one .csv
file, this will not always be the way your data is stored when you start working with it. There may be several surveys you’re combining, or a separate file with survey responses, another with your blinding code key, and still another with data from a collaborator.
The tidyverse
verb that you use for this combination depends on whether your datasets have matching columns/variables (say, data from two different years or locations of a project) or matching rows/observations (say, sensory and chemical data).
The “matching variables” case can also happen if we want to stack the outputs of multiple independent summarize()
calls, such as making a stacked demographic table. Since our data doesn’t have demographic data removed, let’s instead try to make a table that reports our sample size for each berry
type and each of the three kinds of liking scales.
If we wanted the combinations of these levels (e.g., the sample size of 9-pt scale ratings of strawberries), we could use one summarize()
or count()
call, but we need to do a little extra work if we want to summarize()
the whole dataset more than one different way.
<-
berry_type_counts %>%
berry_data group_by(berry) %>%
summarize(n = n_distinct(`Subject Code`, test_day)) %>%
rename(Level = berry)
<-
berry_scale_counts %>%
berry_data pivot_longer(ends_with("_overall"),
names_sep = "_", names_to = c("Scale", NA),
values_to = "Used", values_drop_na = TRUE) %>%
group_by(Scale) %>%
summarize(n = n_distinct(`Subject Code`, test_day)) %>%
rename(Level = Scale)
#These are both summarizing the same set of observations into different groups:
sum(berry_scale_counts$n)
## [1] 1301
sum(berry_type_counts$n)
## [1] 1301
#Hence needing two summarize() calls
bind_rows(Berry = berry_type_counts,
Scale = berry_scale_counts,
.id = "Variable") #This makes a new column called "Variable" which will
## # A tibble: 7 × 3
## Variable Level n
## <chr> <chr> <int>
## 1 Berry blackberry 299
## 2 Berry blueberry 313
## 3 Berry raspberry 358
## 4 Berry strawberry 331
## 5 Scale 9pt 423
## 6 Scale lms 435
## 7 Scale us 443
#specify which of the two tables each row came from
For multiple tables that share the same observations, we could want to add follow-up survey data using the same participants or genetic information about the berries. In the latter case, our table of berry genetics would have less rows, but the tidyverse
actually handles them with the same verbs.
There is a bind_cols()
function, but it’s easy to accidentally have the raspberries on the top in one table and the blueberries on the top in another, or to have one table sorted alphabetically and another by blinding code or participant ID, so it’s safer to use the *_join()
functions if you’re adding columns instead of rows. left_join()
is the most common.
We’ll make up some demographic data to join to our existing table.
<-
demographics %>%
berry_data distinct(`Subject Code`) %>%
mutate(Age = round(rnorm(n(), 45, 6)),
Gender = ifelse(rbinom(n(), 1, 0.6), "F", "M"),
Location = sample(state.name, n(), replace = TRUE)) %>%
rename(ID = `Subject Code`) #To demonstrate how you can manually configure
#the columns used to align the datasets
#And now we can join it:
%>%
berry_data select(-Age, -Gender) %>% #Getting rid of the empty demographic columns first
left_join(demographics, by = c("Subject Code" = "ID"))
## # A tibble: 7,507 × 93
## `Subject Code` `Participant Name` `Start Time (UTC)` `End Time (UTC)`
## <dbl> <dbl> <chr> <chr>
## 1 1001 1001 6/13/2019 21:05 6/13/2019 21:09
## 2 1001 1001 6/13/2019 20:55 6/13/2019 20:59
## 3 1001 1001 6/13/2019 20:49 6/13/2019 20:53
## 4 1001 1001 6/13/2019 20:45 6/13/2019 20:48
## 5 1001 1001 6/13/2019 21:00 6/13/2019 21:03
## 6 1001 1001 6/13/2019 21:10 6/13/2019 21:13
## 7 1002 1002 6/13/2019 20:08 6/13/2019 20:11
## 8 1002 1002 6/13/2019 19:57 6/13/2019 20:01
## 9 1002 1002 6/13/2019 20:13 6/13/2019 20:17
## 10 1002 1002 6/13/2019 20:03 6/13/2019 20:07
## # ℹ 7,497 more rows
## # ℹ 89 more variables: `Serving Position` <dbl>, `Sample Identifier` <dbl>,
## # `Sample Name` <chr>, `9pt_appearance` <dbl>, pre_expectation <dbl>,
## # jar_color <dbl>, jar_gloss <dbl>, jar_size <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>, …
anti_join()
can be used to remove data. If you have a list of participants whose responses had data quality issues, you can put it in the second argument of anti_join()
to return the lefthand table with those entries removed.
We do have some repeat participants in the berry tests, because the actual study involved repeated measures. But in online surveys, repeat answers could be a problem necessitating all data removal. Let’s demonstrate how we’d do that with anti_join()
:
<-
problem_participants %>%
berry_data group_by(`Participant Name`) %>%
summarize(sessions = n_distinct(test_day)) %>%
filter(sessions > 1)
#We don't need to specify by if they share the column name they're joining on
#and NO OTHERS
%>%
berry_data anti_join(problem_participants)
## Joining with `by = join_by(`Participant Name`)`
## # A tibble: 3,755 × 92
## `Subject Code` `Participant Name` Gender Age `Start Time (UTC)`
## <dbl> <dbl> <lgl> <lgl> <chr>
## 1 1001 1001 NA NA 6/13/2019 21:05
## 2 1001 1001 NA NA 6/13/2019 20:55
## 3 1001 1001 NA NA 6/13/2019 20:49
## 4 1001 1001 NA NA 6/13/2019 20:45
## 5 1001 1001 NA NA 6/13/2019 21:00
## 6 1001 1001 NA NA 6/13/2019 21:10
## 7 1002 1002 NA NA 6/13/2019 20:08
## 8 1002 1002 NA NA 6/13/2019 19:57
## 9 1002 1002 NA NA 6/13/2019 20:13
## 10 1002 1002 NA NA 6/13/2019 20:03
## # ℹ 3,745 more rows
## # ℹ 87 more variables: `End Time (UTC)` <chr>, `Serving Position` <dbl>,
## # `Sample Identifier` <dbl>, `Sample Name` <chr>, `9pt_appearance` <dbl>,
## # pre_expectation <dbl>, jar_color <dbl>, jar_gloss <dbl>, jar_size <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>, …
anti_join()
also gives priority to the first/lefthand argument, usually the one you’re piping in with %>%
. It returns the rows in your left tibble that don’t have corresponding entries in the righthand one. It also does not add the columns that are unique to your right table. There is no n
column in the output.
4.3 Utilities for data management
Honestly, the amount of power in tidyverse
is way more than we can cover today, and is covered more comprehensively (obviously) by Wickham and Grolemund. However, I want to name a few more utilities we will make a lot of use of today (and you will want to know about for your own work).
4.3.1 Rename your columns
Often you will import data with bad column names or you’ll realize you need to rename variables during your workflow. This is one way to get around having to type a bunch of backticks forever. For this, you can use the rename()
function:
names(berry_data)
## [1] "Subject Code" "Participant Name"
## [3] "Gender" "Age"
## [5] "Start Time (UTC)" "End Time (UTC)"
## [7] "Serving Position" "Sample Identifier"
## [9] "Sample Name" "9pt_appearance"
## [11] "pre_expectation" "jar_color"
## [13] "jar_gloss" "jar_size"
## [15] "cata_appearance_unevencolor" "cata_appearance_misshapen"
## [17] "cata_appearance_creased" "cata_appearance_seedy"
## [19] "cata_appearance_bruised" "cata_appearance_notfresh"
## [21] "cata_appearance_fresh" "cata_appearance_goodshape"
## [23] "cata_appearance_goodquality" "cata_appearance_none"
## [25] "9pt_overall" "verbal_likes"
## [27] "verbal_dislikes" "9pt_taste"
## [29] "grid_sweetness" "grid_tartness"
## [31] "grid_raspberryflavor" "jar_sweetness"
## [33] "jar_tartness" "cata_taste_floral"
## [35] "cata_taste_berry" "cata_taste_green"
## [37] "cata_taste_grassy" "cata_taste_fermented"
## [39] "cata_taste_tropical" "cata_taste_fruity"
## [41] "cata_taste_citrus" "cata_taste_earthy"
## [43] "cata_taste_candy" "cata_taste_none"
## [45] "9pt_texture" "grid_seediness"
## [47] "grid_firmness" "grid_juiciness"
## [49] "jar_firmness" "jar_juciness"
## [51] "post_expectation" "price"
## [53] "product_tier" "purchase_intent"
## [55] "subject" "test_day"
## [57] "us_appearance" "us_overall"
## [59] "us_taste" "us_texture"
## [61] "lms_appearance" "lms_overall"
## [63] "lms_taste" "lms_texture"
## [65] "cata_appearane_bruised" "cata_appearance_goodshapre"
## [67] "cata_appearance_goodcolor" "grid_blackberryflavor"
## [69] "cata_taste_cinnamon" "cata_taste_lemon"
## [71] "cata_taste_clove" "cata_taste_minty"
## [73] "cata_taste_grape" "grid_crispness"
## [75] "jar_crispness" "jar_juiciness"
## [77] "cata_appearane_creased" "grid_blueberryflavor"
## [79] "cata_taste_piney" "cata_taste_peachy"
## [81] "9pt_aroma" "grid_strawberryflavor"
## [83] "cata_taste_caramel" "cata_taste_grapey"
## [85] "cata_taste_melon" "cata_taste_cherry"
## [87] "grid_crunchiness" "jar_crunch"
## [89] "us_aroma" "lms_aroma"
## [91] "berry" "sample"
%>%
berry_data rename(Sample = `Sample Name`,
Subject = `Participant Name`) %>%
select(Subject, Sample, everything()) #no more backticks!
## # A tibble: 7,507 × 92
## Subject Sample `Subject Code` Gender Age `Start Time (UTC)`
## <dbl> <chr> <dbl> <lgl> <lgl> <chr>
## 1 1001 raspberry 6 1001 NA NA 6/13/2019 21:05
## 2 1001 raspberry 5 1001 NA NA 6/13/2019 20:55
## 3 1001 raspberry 2 1001 NA NA 6/13/2019 20:49
## 4 1001 raspberry 3 1001 NA NA 6/13/2019 20:45
## 5 1001 raspberry 4 1001 NA NA 6/13/2019 21:00
## 6 1001 raspberry 1 1001 NA NA 6/13/2019 21:10
## 7 1002 raspberry 6 1002 NA NA 6/13/2019 20:08
## 8 1002 raspberry 5 1002 NA NA 6/13/2019 19:57
## 9 1002 raspberry 2 1002 NA NA 6/13/2019 20:13
## 10 1002 raspberry 3 1002 NA NA 6/13/2019 20:03
## # ℹ 7,497 more rows
## # ℹ 86 more variables: `End Time (UTC)` <chr>, `Serving Position` <dbl>,
## # `Sample Identifier` <dbl>, `9pt_appearance` <dbl>, pre_expectation <dbl>,
## # jar_color <dbl>, jar_gloss <dbl>, jar_size <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>, …
You can also rename by position, but be sure you have the right order and don’t change the input data later:
%>%
berry_data rename(Subject = 1)
## # A tibble: 7,507 × 92
## Subject `Participant Name` Gender Age `Start Time (UTC)` `End Time (UTC)`
## <dbl> <dbl> <lgl> <lgl> <chr> <chr>
## 1 1001 1001 NA NA 6/13/2019 21:05 6/13/2019 21:09
## 2 1001 1001 NA NA 6/13/2019 20:55 6/13/2019 20:59
## 3 1001 1001 NA NA 6/13/2019 20:49 6/13/2019 20:53
## 4 1001 1001 NA NA 6/13/2019 20:45 6/13/2019 20:48
## 5 1001 1001 NA NA 6/13/2019 21:00 6/13/2019 21:03
## 6 1001 1001 NA NA 6/13/2019 21:10 6/13/2019 21:13
## 7 1002 1002 NA NA 6/13/2019 20:08 6/13/2019 20:11
## 8 1002 1002 NA NA 6/13/2019 19:57 6/13/2019 20:01
## 9 1002 1002 NA NA 6/13/2019 20:13 6/13/2019 20:17
## 10 1002 1002 NA NA 6/13/2019 20:03 6/13/2019 20:07
## # ℹ 7,497 more rows
## # ℹ 86 more variables: `Serving Position` <dbl>, `Sample Identifier` <dbl>,
## # `Sample Name` <chr>, `9pt_appearance` <dbl>, pre_expectation <dbl>,
## # jar_color <dbl>, jar_gloss <dbl>, jar_size <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>, …
4.3.2 Relocate your columns
If you mutate()
columns or just have a big data set with a lot of variables, often you want to move columns around. This is a pain to do with []
, but again tidyverse
has a utility to move things around easily: relocate()
.
%>%
berry_data relocate(`Sample Name`) # giving no other arguments will move to front
## # A tibble: 7,507 × 92
## `Sample Name` `Subject Code` `Participant Name` Gender Age
## <chr> <dbl> <dbl> <lgl> <lgl>
## 1 raspberry 6 1001 1001 NA NA
## 2 raspberry 5 1001 1001 NA NA
## 3 raspberry 2 1001 1001 NA NA
## 4 raspberry 3 1001 1001 NA NA
## 5 raspberry 4 1001 1001 NA NA
## 6 raspberry 1 1001 1001 NA NA
## 7 raspberry 6 1002 1002 NA NA
## 8 raspberry 5 1002 1002 NA NA
## 9 raspberry 2 1002 1002 NA NA
## 10 raspberry 3 1002 1002 NA NA
## # ℹ 7,497 more rows
## # ℹ 87 more variables: `Start Time (UTC)` <chr>, `End Time (UTC)` <chr>,
## # `Serving Position` <dbl>, `Sample Identifier` <dbl>,
## # `9pt_appearance` <dbl>, pre_expectation <dbl>, jar_color <dbl>,
## # jar_gloss <dbl>, jar_size <dbl>, cata_appearance_unevencolor <dbl>,
## # cata_appearance_misshapen <dbl>, cata_appearance_creased <dbl>,
## # cata_appearance_seedy <dbl>, cata_appearance_bruised <dbl>, …
You can also use relocate()
to specify positions
%>%
berry_data relocate(Gender, Age, `Subject Code`, `Start Time (UTC)`,
`End Time (UTC)`, `Sample Identifier`,
# move repetitive and empty columns to the end
.after = berry)
## # A tibble: 7,507 × 92
## `Participant Name` `Serving Position` `Sample Name` `9pt_appearance`
## <dbl> <dbl> <chr> <dbl>
## 1 1001 5 raspberry 6 4
## 2 1001 3 raspberry 5 8
## 3 1001 2 raspberry 2 4
## 4 1001 1 raspberry 3 7
## 5 1001 4 raspberry 4 7
## 6 1001 6 raspberry 1 7
## 7 1002 3 raspberry 6 6
## 8 1002 1 raspberry 5 8
## 9 1002 4 raspberry 2 8
## 10 1002 2 raspberry 3 7
## # ℹ 7,497 more rows
## # ℹ 88 more variables: pre_expectation <dbl>, jar_color <dbl>, jar_gloss <dbl>,
## # jar_size <dbl>, cata_appearance_unevencolor <dbl>,
## # cata_appearance_misshapen <dbl>, cata_appearance_creased <dbl>,
## # cata_appearance_seedy <dbl>, cata_appearance_bruised <dbl>,
## # cata_appearance_notfresh <dbl>, cata_appearance_fresh <dbl>,
## # cata_appearance_goodshape <dbl>, cata_appearance_goodquality <dbl>, …
4.3.3 Remove missing values
Missing values (the NA
s you’ve been seeing so much) can be a huge pain, because they make more of themselves.
mean(berry_data$price) #This column had no NAs, so we can take the average
## [1] 2.962896
mean(berry_data$`9pt_overall`) #This column has some NAs, so we get NA
## [1] NA
Many base R functions that take a vector and return some mathematical function (e.g., mean()
, sum()
, sd()
) have an argument called na.rm
that can be set to just act as if the values aren’t there at all.
mean(berry_data$`9pt_overall`, na.rm = TRUE) #We get the average of only the valid numbers
## [1] 5.679346
sum(berry_data$`9pt_overall`, na.rm = TRUE) /
length(berry_data$`9pt_overall`) #The denominator is NOT the same as the total number of values anymore
## [1] 1.84974
sum(berry_data$`9pt_overall`, na.rm = TRUE) /
sum(!is.na(berry_data$`9pt_overall`)) #The denominator is the number of non-NA values
## [1] 5.679346
However, this isn’t always convenient. Sometimes it may be easier to simply get rid of all observations with any missing values, which tidyverse has a handy drop_na()
function for:
%>%
berry_data drop_na() #All of our rows have *some* NA values, so this returns nothing
## # A tibble: 0 × 92
## # ℹ 92 variables: Subject Code <dbl>, Participant Name <dbl>, Gender <lgl>,
## # Age <lgl>, Start Time (UTC) <chr>, End Time (UTC) <chr>,
## # Serving Position <dbl>, Sample Identifier <dbl>, Sample Name <chr>,
## # 9pt_appearance <dbl>, pre_expectation <dbl>, jar_color <dbl>,
## # jar_gloss <dbl>, jar_size <dbl>, cata_appearance_unevencolor <dbl>,
## # cata_appearance_misshapen <dbl>, cata_appearance_creased <dbl>,
## # cata_appearance_seedy <dbl>, cata_appearance_bruised <dbl>, …
%>%
berry_data select(`Participant Name`, `Sample Name`, contains("9pt_")) %>%
drop_na() #Now we get only respondants who answered all 9-point liking questions.
## # A tibble: 600 × 7
## `Participant Name` `Sample Name` `9pt_appearance` `9pt_overall` `9pt_taste`
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2001 Strawberry4 5 5 4
## 2 2001 Strawberry1 6 2 2
## 3 2001 Strawberry2 1 6 6
## 4 2001 Strawberry6 3 3 2
## 5 2001 Strawberry3 8 7 8
## 6 2001 Strawberry5 4 6 6
## 7 2002 Strawberry4 2 4 3
## 8 2002 Strawberry1 4 6 7
## 9 2002 Strawberry2 3 6 6
## 10 2002 Strawberry6 7 7 8
## # ℹ 590 more rows
## # ℹ 2 more variables: `9pt_texture` <dbl>, `9pt_aroma` <dbl>
You can also use drop_na()
with specific columns, which is useful to avoid losing all of your data!
%>%
berry_data drop_na(`9pt_overall`)
## # A tibble: 2,445 × 92
## `Subject Code` `Participant Name` Gender Age `Start Time (UTC)`
## <dbl> <dbl> <lgl> <lgl> <chr>
## 1 1001 1001 NA NA 6/13/2019 21:05
## 2 1001 1001 NA NA 6/13/2019 20:55
## 3 1001 1001 NA NA 6/13/2019 20:49
## 4 1001 1001 NA NA 6/13/2019 20:45
## 5 1001 1001 NA NA 6/13/2019 21:00
## 6 1001 1001 NA NA 6/13/2019 21:10
## 7 1002 1002 NA NA 6/13/2019 20:08
## 8 1002 1002 NA NA 6/13/2019 19:57
## 9 1002 1002 NA NA 6/13/2019 20:13
## 10 1002 1002 NA NA 6/13/2019 20:03
## # ℹ 2,435 more rows
## # ℹ 87 more variables: `End Time (UTC)` <chr>, `Serving Position` <dbl>,
## # `Sample Identifier` <dbl>, `Sample Name` <chr>, `9pt_appearance` <dbl>,
## # pre_expectation <dbl>, jar_color <dbl>, jar_gloss <dbl>, jar_size <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>, …
Or you may want to remove any columns/variables that have some missing data, which is one of the most common uses of where()
:
#Only 38 columns with absolutely no missing values.
#This loses all of the liking data.
%>%
berry_data select(where(~none(.x, is.na)))
## # A tibble: 7,507 × 38
## `Subject Code` `Participant Name` `Start Time (UTC)` `End Time (UTC)`
## <dbl> <dbl> <chr> <chr>
## 1 1001 1001 6/13/2019 21:05 6/13/2019 21:09
## 2 1001 1001 6/13/2019 20:55 6/13/2019 20:59
## 3 1001 1001 6/13/2019 20:49 6/13/2019 20:53
## 4 1001 1001 6/13/2019 20:45 6/13/2019 20:48
## 5 1001 1001 6/13/2019 21:00 6/13/2019 21:03
## 6 1001 1001 6/13/2019 21:10 6/13/2019 21:13
## 7 1002 1002 6/13/2019 20:08 6/13/2019 20:11
## 8 1002 1002 6/13/2019 19:57 6/13/2019 20:01
## 9 1002 1002 6/13/2019 20:13 6/13/2019 20:17
## 10 1002 1002 6/13/2019 20:03 6/13/2019 20:07
## # ℹ 7,497 more rows
## # ℹ 34 more variables: `Serving Position` <dbl>, `Sample Identifier` <dbl>,
## # `Sample Name` <chr>, pre_expectation <dbl>, jar_color <dbl>,
## # jar_size <dbl>, cata_appearance_unevencolor <dbl>,
## # cata_appearance_misshapen <dbl>, cata_appearance_notfresh <dbl>,
## # cata_appearance_fresh <dbl>, cata_appearance_goodquality <dbl>,
## # cata_appearance_none <dbl>, grid_sweetness <dbl>, grid_tartness <dbl>, …
Both of the above methods guarantee that you will have an output with absolutely no missing data, but may be over-zealous if, say, everyone answered overall liking on one of the three scales and we want to do some work to combine those later. filter()
and select()
can be combined to do infinitely complex missing value removal.
#You'll notice that only strawberries have any non-NA liking values, actually
%>%
berry_data select(where(~!every(.x, is.na))) %>% #remove columns with no data
filter(!(is.na(`9pt_aroma`) & is.na(lms_aroma) & is.na(us_aroma)))
## # A tibble: 1,986 × 90
## `Subject Code` `Participant Name` `Start Time (UTC)` `End Time (UTC)`
## <dbl> <dbl> <chr> <chr>
## 1 2001 2001 6/24/2019 20:18 6/24/2019 20:22
## 2 2001 2001 6/24/2019 20:30 6/24/2019 20:35
## 3 2001 2001 6/24/2019 20:23 6/24/2019 20:28
## 4 2001 2001 6/24/2019 20:14 6/24/2019 20:17
## 5 2001 2001 6/24/2019 20:35 6/24/2019 20:39
## 6 2001 2001 6/24/2019 20:08 6/24/2019 20:13
## 7 2002 2002 6/24/2019 20:21 6/24/2019 20:25
## 8 2002 2002 6/24/2019 20:14 6/24/2019 20:17
## 9 2002 2002 6/24/2019 19:59 6/24/2019 20:04
## 10 2002 2002 6/24/2019 20:09 6/24/2019 20:13
## # ℹ 1,976 more rows
## # ℹ 86 more variables: `Serving Position` <dbl>, `Sample Identifier` <dbl>,
## # `Sample Name` <chr>, `9pt_appearance` <dbl>, pre_expectation <dbl>,
## # jar_color <dbl>, jar_gloss <dbl>, jar_size <dbl>,
## # cata_appearance_unevencolor <dbl>, cata_appearance_misshapen <dbl>,
## # cata_appearance_creased <dbl>, cata_appearance_seedy <dbl>,
## # cata_appearance_bruised <dbl>, cata_appearance_notfresh <dbl>, …
4.3.4 Counting categorical variables
Often, we’ll want to count how many observations are in a group without having to actually count ourselves. Do we have enough observations for each sample? How many people in each demographic category do we have? Is it balanced?
You’ve already written code to do this, if you’ve been following along! summarize()
is incredibly powerful, and it will happily use any function that takes a vector or vectors and returns a single value. This includes categorical or chr
data!
%>%
berry_data group_by(`Sample Name`) %>%
summarize(n_responses = n())
## # A tibble: 23 × 2
## `Sample Name` n_responses
## <chr> <int>
## 1 Blackberry 1 299
## 2 Blackberry 2 299
## 3 Blackberry 3 299
## 4 Blackberry 4 299
## 5 Blackberry 5 299
## 6 Blueberry 1 313
## 7 Blueberry 2 313
## 8 Blueberry 3 313
## 9 Blueberry 4 313
## 10 Blueberry 5 313
## # ℹ 13 more rows
We can also do this with a little less typing using count()
, which is handy if we’re repeatedly doing a lot of counting observations in various categories (like for CATA tests and Correspondence Analyses):
#Counts the number of observations (rows) of each berry
%>%
berry_data count(`Sample Name`)
## # A tibble: 23 × 2
## `Sample Name` n
## <chr> <int>
## 1 Blackberry 1 299
## 2 Blackberry 2 299
## 3 Blackberry 3 299
## 4 Blackberry 4 299
## 5 Blackberry 5 299
## 6 Blueberry 1 313
## 7 Blueberry 2 313
## 8 Blueberry 3 313
## 9 Blueberry 4 313
## 10 Blueberry 5 313
## # ℹ 13 more rows
#Number of observations, *not necessarily* the number of participants!
%>%
berry_data count(berry)
## # A tibble: 4 × 2
## berry n
## <chr> <int>
## 1 blackberry 1495
## 2 blueberry 1878
## 3 raspberry 2148
## 4 strawberry 1986
Depending on the shape of your data, the number of rows may or may not be the count you actually want. Maybe we want to know how many people participated in each day of testing, but we have one row per tasting event.
We could use pivot_wider()
to reshape our data first, so we have one row per completed tasting session, but since count()
drops most columns anyways, we only really need one row for each thing we care about. distinct()
can be handy here. It keeps one row for each distinct combination of the columns you give it, getting rid of all other columns so it doesn’t have to worry about the fact that one person gave multiple different 9pt_overall
ratings per test_day
.
#Two columns, with one row for each completed tasting session
#(each reflects 5-6 rows in the initial data)
%>%
berry_data distinct(test_day, `Subject Code`)
## # A tibble: 1,301 × 2
## test_day `Subject Code`
## <chr> <dbl>
## 1 Raspberry Day 1 1001
## 2 Raspberry Day 1 1002
## 3 Raspberry Day 1 1004
## 4 Raspberry Day 1 1005
## 5 Raspberry Day 1 1006
## 6 Raspberry Day 1 1007
## 7 Raspberry Day 1 1008
## 8 Raspberry Day 1 1009
## 9 Raspberry Day 1 1010
## 10 Raspberry Day 1 1011
## # ℹ 1,291 more rows
#Counts the number of participants per testing day
%>%
berry_data distinct(test_day, `Subject Code`) %>%
count(test_day)
## # A tibble: 12 × 2
## test_day n
## <chr> <int>
## 1 Blackberry Day 1 108
## 2 Blackberry Day 2 88
## 3 Blackberry Day 3 103
## 4 Blueberry Day 1 102
## 5 Blueberry Day 2 114
## 6 Blueberry Day 3 97
## 7 Raspberry Day 1 131
## 8 Raspberry Day 2 120
## 9 Raspberry Day 3 107
## 10 Strawberry Day 1 108
## 11 Strawberry Day 2 106
## 12 Strawberry Day 3 117
4.3.5 Sort your data
More frequently, we will want to rearrange our rows, which can be done with arrange()
. All you have to do is give arrange()
one or more columns to sort the data by. You can use either the desc()
or the -
shortcut to sort in reverse order. Whether ascending or descending, arrange()
places missing values at the bottom.
%>%
berry_data arrange(desc(lms_overall)) %>%
# which berries had the highest liking on the lms?
select(`Sample Name`, `Participant Name`, lms_overall)
## # A tibble: 7,507 × 3
## `Sample Name` `Participant Name` lms_overall
## <chr> <dbl> <dbl>
## 1 raspberry 2 5135 100
## 2 raspberry 6 7033 100
## 3 Blackberry 4 5273 100
## 4 Blackberry 4 7135 100
## 5 Blackberry 3 7135 100
## 6 Blackberry 5 7135 100
## 7 Blueberry 5 5113 100
## 8 Blueberry 6 5127 100
## 9 Blueberry 3 7040 100
## 10 Strawberry1 1273 100
## # ℹ 7,497 more rows
You can sort alphabetically as well:
# using a dataset of US States for demonstration
tibble(state_name = state.name, area = state.area) %>%
# sort states reverse-alphabetically
arrange(desc(state_name))
## # A tibble: 50 × 2
## state_name area
## <chr> <dbl>
## 1 Wyoming 97914
## 2 Wisconsin 56154
## 3 West Virginia 24181
## 4 Washington 68192
## 5 Virginia 40815
## 6 Vermont 9609
## 7 Utah 84916
## 8 Texas 267339
## 9 Tennessee 42244
## 10 South Dakota 77047
## # ℹ 40 more rows
It’s not a bad idea to restart your R session here. Make sure to save your work, but a clean Environment
is great when we’re shifting topics.
You can accomplish this by going to Session > Restart R
in the menu.
Then, we want to make sure to re-load our packages and import our data.
# The packages we're using
library(tidyverse)
library(ca)
## Warning: package 'ca' was built under R version 4.3.1
# The dataset
<- read_csv("data/clt-berry-data.csv") berry_data
## Rows: 7507 Columns: 92
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Start Time (UTC), End Time (UTC), Sample Name, verbal_likes, verba...
## dbl (83): Subject Code, Participant Name, Serving Position, Sample Identifie...
## lgl (2): Gender, Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.