5 Basic pivoting
library(tidyverse)
library(dcldata)
Most of the data you’ll encounter won’t be tidy, and it will be your job to figure out how to make it tidy. In this chapter, you’ll learn about two of the most important tidying tools: pivot_longer()
and pivot_wider()
.
First, recall the characteristics of tidy data:
- Each value has its own cell.
- Each variable has its own column.
- Each observation has its own row.
Non-tidy data will not fulfill one or more of these characteristics.
5.1 Longer
example_eagle_nests
contains data on the number of bald eagle nesting sites across multiple regions and years.
# Source: US Fish and Wildlife Service
example_eagle_nests#> # A tibble: 3 × 3
#> region `2007` `2009`
#> <chr> <dbl> <dbl>
#> 1 Pacific 1039 2587
#> 2 Southwest 51 176
#> 3 Rocky Mountains and Plains 200 338
The data isn’t tidy. In the following steps, we’ll show you how to tidy example_eagle_nests
using pivot_longer()
.
Step 1: Identify the variables.
There are three variables in this dataset:
region
: The US region where the nests are located.year
: The year the nests were found.num_nests
: The number of nests found.
Only one of these variables (region
) is currently a column. Values of year
are currently stored horizontally as column names, and values of num_nests
are stored as values of 2007
and 2009
. In order for this data to be tidy, we’ll need to pivot 2007
and 2009
into a year
column, and the values of 2007
and 2009
into a num_nests
column.
Now that we’ve identified the variables, we can start filling in our call to pivot_longer()
. We’ll need three arguments, which we’ll identify over the next three steps.
%>%
example_eagle_nests pivot_longer(
# Step 2
# Step 3
# Step 4
)
Step 2: Identify the columns to pivot.
To decide which columns to pivot, identify which columns are keeping the data from being tidy. In our example, those columns are 2007
and 2009
.
2007
and 2009
are actually values of year
, not variables themselves, and their values are actually values of num_nests
.
The cols
argument controls which columns pivot_longer()
pivots.
%>%
example_eagle_nests pivot_longer(
cols = c(`2007`, `2009`),
# Step 3
# Step 4
)
cols
is similar to select()
. You can specify columns by name, with contains()
, starts_with()
, etc. Here, we have to wrap 2007
and 2009
in backticks (` `
) because they start with numbers.
Step 3: Name the column that will store the values from the column names.
Now, we’re just going to focus on the columns we identified in cols
. Ultimately, pivot_longer()
is going to move both the names of these columns and their values into new, separate columns.
First, we’ll focus on the column names: 2007
and 2009
. pivot_longer()
’s names_to
argument controls the name of the column that will store the old column names. We want to name this new column "year"
. The argument is called names_to
because you’re specifying which column to move the column names to.
%>%
example_eagle_nests pivot_longer(
cols = c(`2007`, `2009`),
names_to = "year",
# Step 4
)
Note that the argument to names_to
has to be in quotes, while the arguments to cols
do not. It’s easy to get confused about which pivot arguments need to be quoted. Here’s the general rule: if you’re identifying an existing column (e.g., 2007
), do not quote. If you’re talking about a column that does not currently exist (e.g., year
), quote it.
Now, pivot_longer()
will create a new column called year
and fill it with the column names 2007
and 2009
. Because we specified two columns in cols
, we will get two values of year
for each region.
Step 4: Name the column that will store the column values.
Now, we need to name the column that will store the values from 2007
and 2009
.
Just as names_to
controls the name of the column for the names, values_to
controls the name of column for the values. In example_eagle_nests
, the column values represent the number of nests, so we’ll name the new column "num_nests"
.
%>%
example_eagle_nests pivot_longer(
cols = c(`2007`, `2009`),
names_to = "year",
values_to = "num_nests"
)
Again, notice that you have to quote any argument to values_to
because it references a column that does not exist.
pivot_longer()
will now move the values from 2007
and 2009
to a column called num_nests
.
Here’s the function call again with the results.
%>%
example_eagle_nests pivot_longer(
cols = c(`2007`, `2009`),
names_to = "year",
values_to = "num_nests"
)#> # A tibble: 6 × 3
#> region year num_nests
#> <chr> <chr> <dbl>
#> 1 Pacific 2007 1039
#> 2 Pacific 2009 2587
#> 3 Southwest 2007 51
#> 4 Southwest 2009 176
#> 5 Rocky Mountains and Plains 2007 200
#> 6 Rocky Mountains and Plains 2009 338
The data is now tidy! pivot_longer()
has many optional arguments, but cols
, names_to
, and values_to
will cover most of your use-cases. The Missing values section below and the Advanced pivoting chapter cover some more specialized uses of pivot_longer()
.
Here’s another eagle-related example. example_eagle_pairs
contains data on the number of observed bald eagle breeding pairs across years and states.
example_eagle_pairs#> # A tibble: 48 × 12
#> state state_abbr `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 Alabama AL 22 23 26 27 NA NA 47 NA
#> 2 Arizona AZ 34 36 38 37 37 43 43 NA
#> 3 Arkansas AR 24 29 34 36 NA NA 36 42
#> 4 California CA 142 148 151 NA NA NA 160 NA
#> 5 Colorado CO 29 27 29 42 45 NA NA NA
#> 6 Connecticut CT 2 2 2 4 6 8 8 NA
#> # … with 42 more rows, and 2 more variables: 2005 <int>, 2006 <int>
Again, the data isn’t tidy because values are spread across column names. We need to pivot all the year columns (1997
through 2006
), moving their names into a column named "year"
and their values into a column named "num_nests"
.
Here’s the full call to pivot_longer()
:
%>%
example_eagle_pairs pivot_longer(
cols = !starts_with("state"),
names_to = "year",
values_to = "num_pairs"
)#> # A tibble: 480 × 4
#> state state_abbr year num_pairs
#> <chr> <chr> <chr> <int>
#> 1 Alabama AL 1997 22
#> 2 Alabama AL 1998 23
#> 3 Alabama AL 1999 26
#> 4 Alabama AL 2000 27
#> 5 Alabama AL 2001 NA
#> 6 Alabama AL 2002 NA
#> # … with 474 more rows
5.2 Wider
pivot_wider()
is the inverse of pivot_longer()
. pivot_longer()
moves data from column names to cell values, while pivot_wider()
pulls data from cell values into column names, creating a wider tibble.
You’ll likely use pivot_longer()
more often than pivot_wider()
when tidying. Often, you’ll actually use pivot_wider()
to un-tidy data. The non-tidy format may be more convenient for some tasks (e.g., creating a specific visualization).
To explain pivot_wider()
, we’ll turn the tidied example_eagle_nests
back into its original form. Here’s the tidied version:
example_eagle_nests_tidy#> # A tibble: 6 × 3
#> region year num_nests
#> <chr> <chr> <dbl>
#> 1 Pacific 2007 1039
#> 2 Pacific 2009 2587
#> 3 Southwest 2007 51
#> 4 Southwest 2009 176
#> 5 Rocky Mountains and Plains 2007 200
#> 6 Rocky Mountains and Plains 2009 338
Step 1 Identify the column whose values will supply the column names.
pivot_wider()
turns the values from one column and turns them into column names. In our example, we want the unique values from year
to become column names.
pivot_wider()
’s names_from
argument controls which column is pivoted into column names.
%>%
example_eagle_nests_tidy pivot_wider(
names_from = year,
# Step 2
)
Notice that year
is unquoted because, following the rule, year
does exist in example_eagle_nests_tidy
.
Step 2 Identify the column whose values will supply the column values.
Now, we need to identify the column that will supply the values of 2007
and 2009
. In example_eagle_nests_tidy
, that’s num_nests
.
We specify num_nests
as the values_from
argument.
%>%
example_eagle_nests_tidy pivot_wider(
names_from = year,
values_from = num_nests
)#> # A tibble: 3 × 3
#> region `2007` `2009`
#> <chr> <dbl> <dbl>
#> 1 Pacific 1039 2587
#> 2 Southwest 51 176
#> 3 Rocky Mountains and Plains 200 338
Again, supply the name of the column unquoted.
We’re done! The tibble is now transformed back into its original form.
Let’s see an example of a tibble that actually does need pivot_wider()
to be tidy. example_acs_1
contains data from the 2013-2017 American Community Survey, obtained through the tidycensus package.
example_acs_1#> # A tibble: 156 × 4
#> geoid name variable estimate
#> <chr> <chr> <chr> <dbl>
#> 1 01 Alabama pop_housed 4731852
#> 2 01 Alabama pop_renter 1434765
#> 3 01 Alabama median_rent 747
#> 4 02 Alaska pop_housed 710743
#> 5 02 Alaska pop_renter 241484
#> 6 02 Alaska median_rent 1200
#> # … with 150 more rows
variable
and estimate
are not really variables (if you see a variable named variable
it’s a good sign you need pivot_wider()
). There are three distinct values in variable
:
%>%
example_acs_1 distinct(variable)
#> # A tibble: 3 × 1
#> variable
#> <chr>
#> 1 pop_housed
#> 2 pop_renter
#> 3 median_rent
Each of these values is actually a variable whose values are currently stored in estimate
. To pivot, we’ll set names_from
to variable
and values_from
to estimate
.
%>%
example_acs_1 pivot_wider(names_from = variable, values_from = estimate)
#> # A tibble: 52 × 5
#> geoid name pop_housed pop_renter median_rent
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 01 Alabama 4731852 1434765 747
#> 2 02 Alaska 710743 241484 1200
#> 3 04 Arizona 6656124 2460534 972
#> 4 05 Arkansas 2894098 965690 709
#> 5 06 California 38168482 17066023 1358
#> 6 08 Colorado 5318396 1782975 1125
#> # … with 46 more rows
5.3 Missing values
The United Nations compiles data on the origin and destination countries of international migrants. example_migration
contains a subset of this data from 2017. The countries in the column names represent countries of origin, and the countries in dest
represent destination countries.
example_migration#> # A tibble: 3 × 6
#> dest Afghanistan Canada India Japan `South Africa`
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Albania <NA> 913 <NA> <NA> <NA>
#> 2 Bulgaria 483 713 281 213 260
#> 3 Romania <NA> <NA> 102 <NA> <NA>
Again, the data isn’t tidy. Afghanistan
, Canada
, etc. are values of a variable, not variables themselves. We can use pivot_longer()
to tidy the data.
%>%
example_migration pivot_longer(cols = !dest, names_to = "origin", values_to = "migrants")
#> # A tibble: 15 × 3
#> dest origin migrants
#> <chr> <chr> <chr>
#> 1 Albania Afghanistan <NA>
#> 2 Albania Canada 913
#> 3 Albania India <NA>
#> 4 Albania Japan <NA>
#> 5 Albania South Africa <NA>
#> 6 Bulgaria Afghanistan 483
#> # … with 9 more rows
There are a lot of NA
s in the data. However, they don’t actually represent missing values. Someone didn’t forget to measure the number of migrants Afghanistan to Albania—there just weren’t any. It doesn’t really make sense to include these rows in our new, tidied dataset. We can use values_drop_na
to exclude these rows.
%>%
example_migration pivot_longer(
cols = !dest,
names_to = "origin",
values_to = "migrants",
values_drop_na = TRUE
) #> # A tibble: 7 × 3
#> dest origin migrants
#> <chr> <chr> <chr>
#> 1 Albania Canada 913
#> 2 Bulgaria Afghanistan 483
#> 3 Bulgaria Canada 713
#> 4 Bulgaria India 281
#> 5 Bulgaria Japan 213
#> 6 Bulgaria South Africa 260
#> # … with 1 more row
When you use values_drop_na = TRUE
in pivot_longer()
, you’re turning explicit missing values into implicit missing values. This is only a good idea if the NA
s were in the non-tidy data for a purely structural reason, like in example_migration
.
In contrast, example_eagle_pairs
’s NA
s aren’t structural and represent actual missing data. Paired Alabamian eagles probably existed in 2001, but the data isn’t there.
example_eagle_pairs#> # A tibble: 48 × 12
#> state state_abbr `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 Alabama AL 22 23 26 27 NA NA 47 NA
#> 2 Arizona AZ 34 36 38 37 37 43 43 NA
#> 3 Arkansas AR 24 29 34 36 NA NA 36 42
#> 4 California CA 142 148 151 NA NA NA 160 NA
#> 5 Colorado CO 29 27 29 42 45 NA NA NA
#> 6 Connecticut CT 2 2 2 4 6 8 8 NA
#> # … with 42 more rows, and 2 more variables: 2005 <int>, 2006 <int>
If we used values_drop_na = TRUE
when we pivoted example_eagle_pairs
, we would turn all these explicit missing values implicit, which isn’t a good idea.