6 Advanced pivoting

pivot_longer() and pivot_wider() are very flexible, and can easily tidy a wide variety of non-tidy datasets. The previous chapter only covered the basics. In this chapter, we’ll explore this flexibility by introducing some of the pivot functions’ advanced functionality.

6.1 Longer

6.1.1 Types

By default, pivot_longer() creates the names_to column as a character variable. For example, when we pivot_longer() example_eagle_nests, year becomes a character column.

#> # A tibble: 6 x 3
#>   region                     year  num_nests
#>   <chr>                      <chr>     <dbl>
#> 1 Pacific                    2007       1039
#> 2 Pacific                    2009       2587
#> 3 Southwest                  2007         51
#> 4 Southwest                  2009        176
#> 5 Rocky Mountains and Plains 2007        200
#> 6 Rocky Mountains and Plains 2009        338

It’s probably more useful to store year as an integer. We can tell pivot_longer() our desired type for the names_to column by using the optional names_ptypes argument.

names_types takes a named list of column name and prototype pairs. A prototype is a vector with a type, but no elements. The integer prototype is integer(). Here’s how we would create year as an integer column:

#> # A tibble: 6 x 3
#>   region                      year num_nests
#>   <chr>                      <int>     <dbl>
#> 1 Pacific                     2007      1039
#> 2 Pacific                     2009      2587
#> 3 Southwest                   2007        51
#> 4 Southwest                   2009       176
#> 5 Rocky Mountains and Plains  2007       200
#> 6 Rocky Mountains and Plains  2009       338

If we wanted year to be a double, we would use the double() prototype.

#> # A tibble: 6 x 3
#>   region                      year num_nests
#>   <chr>                      <dbl>     <dbl>
#> 1 Pacific                     2007      1039
#> 2 Pacific                     2009      2587
#> 3 Southwest                   2007        51
#> 4 Southwest                   2009       176
#> 5 Rocky Mountains and Plains  2007       200
#> 6 Rocky Mountains and Plains  2009       338

pivot_longer() also has a values_ptypes argument that controls the type of the values_to column. You specify values_ptypes in the same way as names_ptypes. For example, say we wanted to change num_nests from its default type (double) to an integer.

#> # A tibble: 6 x 3
#>   region                      year num_nests
#>   <chr>                      <int>     <int>
#> 1 Pacific                     2007      1039
#> 2 Pacific                     2009      2587
#> 3 Southwest                   2007        51
#> 4 Southwest                   2009       176
#> 5 Rocky Mountains and Plains  2007       200
#> 6 Rocky Mountains and Plains  2009       338

6.1.2 Prefixes

Sometimes, the columns you want to pivot will contain extra information, either in their names or their values. For example, example_gymnastics_1 contains data on the scores of three countries’ women’s Olympic gymnastic teams in 2016.

#> # A tibble: 3 x 3
#>   country       score_vault score_floor
#>   <chr>               <dbl>       <dbl>
#> 1 United States        46.9        46.0
#> 2 Russia               45.7        42.0
#> 3 China                44.3        42.1

To tidy example_gymnastics_1, we need to pivot score_vault and score_floor.

#> # A tibble: 6 x 3
#>   country       event       score
#>   <chr>         <chr>       <dbl>
#> 1 United States score_vault  46.9
#> 2 United States score_floor  46.0
#> 3 Russia        score_vault  45.7
#> 4 Russia        score_floor  42.0
#> 5 China         score_vault  44.3
#> 6 China         score_floor  42.1

However, score_vault isn’t really a value of event. It would be better for the values to be "vault" and "floor". We can remove the "score_" prefix with pivot_longer()'s names_prefix argument.

#> # A tibble: 6 x 3
#>   country       event score
#>   <chr>         <chr> <dbl>
#> 1 United States vault  46.9
#> 2 United States floor  46.0
#> 3 Russia        vault  45.7
#> 4 Russia        floor  42.0
#> 5 China         vault  44.3
#> 6 China         floor  42.1

6.1.3 Multiple values

example_gymnastics_2 includes data on both the 2012 and 2016 Olympics.

#> # A tibble: 3 x 5
#>   country       vault_2012 floor_2012 vault_2016 floor_2016
#>   <chr>              <dbl>      <dbl>      <dbl>      <dbl>
#> 1 United States       48.1       45.4       46.9       46.0
#> 2 Russia              46.4       41.6       45.7       42.0
#> 3 China               44.3       40.8       44.3       42.1

There are four variables in this non-tidy dataset: country, event, year, and score. To tidy the data, we’ll need to pivot vault_2012 through floor_2016. In contrast to other examples we’ve seen so far, though, each of these variable names contains two values: a event value and a year value. If we pivot_longer() as usual, both of these values will get placed in the same cell:

#> # A tibble: 12 x 3
#>   country       event_year score
#>   <chr>         <chr>      <dbl>
#> 1 United States vault_2012  48.1
#> 2 United States floor_2012  45.4
#> 3 United States vault_2016  46.9
#> 4 United States floor_2016  46.0
#> 5 Russia        vault_2012  46.4
#> 6 Russia        floor_2012  41.6
#> # … with 6 more rows

event_year contains two variables, and so the result isn’t tidy. Instead, we want pivot_longer() to split each pivoted variable name into two, placing "vault" or "floor" into an event variable and 2012 or 2016 into a year variable.

We’ll need to make the following two changes to our pivot_longer() call:

  • Change names_to to be a vector of two names: c("event", "year). This will tell pivot_longer() to create two variables from the column names instead of one.
  • Use the names_sep argument to tell pivot_longer() what separates an event value from a year in each of the column names. Here, that’s an underscore (_).
#> # A tibble: 12 x 4
#>   country       event year  score
#>   <chr>         <chr> <chr> <dbl>
#> 1 United States vault 2012   48.1
#> 2 United States floor 2012   45.4
#> 3 United States vault 2016   46.9
#> 4 United States floor 2016   46.0
#> 5 Russia        vault 2012   46.4
#> 6 Russia        floor 2012   41.6
#> # … with 6 more rows

pivot_longer() created three variables, instead of the default two: event, year, and score. Now, the data is tidy.

We can extend this idea to work with any number of columns. example_gymnastics_3 has three values stored in each column name: event, year, and gender.

#> # A tibble: 3 x 9
#>   country vault_2012_f vault_2012_m vault_2016_f vault_2016_m floor_2012_f
#>   <chr>          <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#> 1 United…         48.1         46.6         46.9         45.9         45.4
#> 2 Russia          46.4         46.9         45.7         46.0         41.6
#> 3 China           44.3         48.3         44.3         45           40.8
#> # … with 3 more variables: floor_2012_m <dbl>, floor_2016_f <dbl>,
#> #   floor_2016_m <dbl>

names_sep is still _, but we’ll need to change names_to to include "gender".

#> # A tibble: 24 x 5
#>   country       event year  gender score
#>   <chr>         <chr> <chr> <chr>  <dbl>
#> 1 United States vault 2012  f       48.1
#> 2 United States vault 2012  m       46.6
#> 3 United States vault 2016  f       46.9
#> 4 United States vault 2016  m       45.9
#> 5 United States floor 2012  f       45.4
#> 6 United States floor 2012  m       45.3
#> # … with 18 more rows

Finally, what if there isn’t a separator like _ in the column names?

#> # A tibble: 3 x 5
#>   country       vault2012 floor2012 vault2016 floor2016
#>   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
#> 1 United States      48.1      45.4      46.9      46.0
#> 2 Russia             46.4      41.6      45.7      42.0
#> 3 China              44.3      40.8      44.3      42.1

Instead of using names_sep, we can a related pivot_longer() argument: names_pattern. names_pattern is more flexible than names_sep because it allows regular expression groups, matching each group with a variable from names_to.

For example_gymnastics_4, we want to create two variables from the column names: event and year. This means names_pattern needs two groups. Here’s a regular expression we could use:

"([A-Za-z]+)" matches only letters, so will pull out just "vault" or "floor". "(\\d+)" matches digits and will pull out 2012 or 2016.

Here’s the call to pivot_longer():

#> # A tibble: 12 x 4
#>   country       event year  score
#>   <chr>         <chr> <chr> <dbl>
#> 1 United States vault 2012   48.1
#> 2 United States floor 2012   45.4
#> 3 United States vault 2016   46.9
#> 4 United States floor 2016   46.0
#> 5 Russia        vault 2012   46.4
#> 6 Russia        floor 2012   41.6
#> # … with 6 more rows

pivot_longer() successfully separates the events from the years.

6.2 Wider

6.2.1 Prefixes

Like pivot_longer(), pivot_wider() also has a names_prefix argument. However, it adds a prefix instead of removing one.

example_twins contains some data on two sets of (real) twins. n defines the birth order.

#> # A tibble: 4 x 3
#>   family name      n
#>   <chr>  <chr> <dbl>
#> 1 Kelly  Mark      1
#> 2 Kelly  Scott     2
#> 3 Quin   Tegan     1
#> 4 Quin   Sara      2

If we pivot_wider() without a prefix, we’ll get numbers as column names, which isn’t very informative.

#> # A tibble: 2 x 3
#>   family `1`   `2`  
#>   <chr>  <chr> <chr>
#> 1 Kelly  Mark  Scott
#> 2 Quin   Tegan Sara

We can use names_prefix to add an informative prefix.

#> # A tibble: 2 x 3
#>   family twin_1 twin_2
#>   <chr>  <chr>  <chr> 
#> 1 Kelly  Mark   Scott 
#> 2 Quin   Tegan  Sara

6.2.2 Multiple values

Earlier, we showed how you can create multiple columns from data stored in column names using pivot_longer(). Analogously, you can use pivot_wider() to create column names that combine values from multiple columns.

For example, take the following modified subset of the American Community Survey data from last chapter:

#> # A tibble: 8 x 5
#>   geoid name    variable    measure    value
#>   <chr> <chr>   <chr>       <chr>      <dbl>
#> 1 01    Alabama pop_renter  estimate 1434765
#> 2 01    Alabama pop_renter  error      16736
#> 3 01    Alabama median_rent estimate     747
#> 4 01    Alabama median_rent error          3
#> 5 13    Georgia pop_renter  estimate 3592422
#> 6 13    Georgia pop_renter  error      33385
#> # … with 2 more rows

The measure column indicates if a given row represents the estimate of the variable or the margin of error. We might want each combination of variable and measure to become the name of a new variable (i.e., pop_renter_estimate, pop_renter_error, median_rent_estimate, median_rent_error). Recall that pivot_wider()’s names_from argument controls which column’s values are used for the new column names. If we supply multiple columns to names_from, pivot_wider() will create one new column for each unique combination.

#> # A tibble: 2 x 6
#>   geoid name  pop_renter_esti… pop_renter_error median_rent_est…
#>   <chr> <chr>            <dbl>            <dbl>            <dbl>
#> 1 01    Alab…          1434765            16736              747
#> 2 13    Geor…          3592422            33385              927
#> # … with 1 more variable: median_rent_error <dbl>

By default, pivot_wider() will combine the two values with an underscore, but you can control the separator with names_sep.

In example_acs_2, the names of our desired columns were stored across two variables: variable and measure. You might also encounter values stored across multiple variables.

#> # A tibble: 4 x 5
#>   geoid name    variable    estimate error
#>   <chr> <chr>   <chr>          <dbl> <dbl>
#> 1 01    Alabama pop_renter   1434765 16736
#> 2 01    Alabama median_rent      747     3
#> 3 13    Georgia pop_renter   3592422 33385
#> 4 13    Georgia median_rent      927     3

Now, values of our desired variables (pop_renter_esimate, pop_renter_error, median_rent_estimate, and median_rent_error) are in two separate columns: estimate and error. We can specify both columns to pivot_wider()’s values_from argument.

#> # A tibble: 2 x 6
#>   geoid name  estimate_pop_re… estimate_median… error_pop_renter
#>   <chr> <chr>            <dbl>            <dbl>            <dbl>
#> 1 01    Alab…          1434765              747            16736
#> 2 13    Geor…          3592422              927            33385
#> # … with 1 more variable: error_median_rent <dbl>

pivot_wider() pivots variable, then creates columns by combining the value of measure with each column name specified in values_from.

6.2.3 id_cols

In the previous chapter, we mentioned that you’ll need to name names_from and values_from in pivot_wider() because pivot_wider()’s first argument is actually id_cols. But what does id_cols do? To explain, we’ll go over a brief example.

Here’s another ACS dataset:

#> # A tibble: 6 x 5
#>   geoid name    variable    estimate error
#>   <chr> <chr>   <chr>          <dbl> <dbl>
#> 1 01    Alabama pop_housed   4731852    NA
#> 2 01    Alabama pop_renter   1434765 16736
#> 3 01    Alabama median_rent      747     3
#> 4 13    Georgia pop_housed   9943137    NA
#> 5 13    Georgia pop_renter   3592422 33385
#> 6 13    Georgia median_rent      927     3

Now, variable includes three variables: the total housed population ("pop_housed"), total renter population ("pop_renter"), and median rent ("median_rent"). However, error is only available for "pop_renter" and "median_rent". Say that we don’t want to pivot error because we don’t need the margin of error data. Here’s an attempt:

#> # A tibble: 6 x 6
#>   geoid name    error pop_housed pop_renter median_rent
#>   <chr> <chr>   <dbl>      <dbl>      <dbl>       <dbl>
#> 1 01    Alabama    NA    4731852         NA          NA
#> 2 01    Alabama 16736         NA    1434765          NA
#> 3 01    Alabama     3         NA         NA         747
#> 4 13    Georgia    NA    9943137         NA          NA
#> 5 13    Georgia 33385         NA    3592422          NA
#> 6 13    Georgia     3         NA         NA         927

This probably isn’t the result we want. pivot_wider() created one row for each geoid-name-error combination, thinking that geoid, name, and error all identify an observation. We can fix this by setting id_cols.

id_cols controls which columns define an observation. By default, id_cols includes all columns not specified in names_from or values_from. Here, we want pivot_wider() to understand that each observation should be a state, defined by geoid and name. To fix the problem, we’ll tell pivot_longer() that error should not be included in id_cols.

#> # A tibble: 2 x 5
#>   geoid name    pop_housed pop_renter median_rent
#>   <chr> <chr>        <dbl>      <dbl>       <dbl>
#> 1 01    Alabama    4731852    1434765         747
#> 2 13    Georgia    9943137    3592422         927

That looks a lot better! pivot_wider() threw out the error column and created one row for each state.