2 Other single-table verbs

You’ve learned the most important verbs for data analysis: filter(), mutate(), group_by() and summarize(). There are a number of other verbs that are not quite as important but still come in handy from time-to-time. The goal of this document is to familiarise you with their purpose and basic operation

2.1 Select

Most of the datasets you’ll work with in this class only have a relatively small number of variables, and generally you don’t need to reduce further. In real life, you’ll sometimes encounter datasets with hundreds or even thousands of variables, and the first challenge is just to narrow down to a useful subset. Solving that problem is the job of select().

select() allows you to work with column names using a handful of helper functions:

  • starts_with("x") and ends_with("x") select variables that start with a common prefix or end with a common suffix.

  • contains("x") selects variables that contain a phrase. matches("x.y") select all variables that match a given regular expression (which you’ll learn about later in the course).

  • a:e selects all variables from variable a to variable e inclsive.

You can also select a single varible just by using its name directly.

#> # A tibble: 336,776 x 5
#>    year month   day dep_delay arr_delay
#>   <int> <int> <int>     <dbl>     <dbl>
#> 1  2013     1     1         2        11
#> 2  2013     1     1         4        20
#> 3  2013     1     1         2        33
#> 4  2013     1     1        -1       -18
#> 5  2013     1     1        -6       -25
#> 6  2013     1     1        -4        12
#> # … with 3.368e+05 more rows

To remove variables from selection, put a - in front of the expression.

#> # A tibble: 336,776 x 17
#>    year month   day sched_dep_time arr_time sched_arr_time arr_delay carrier
#>   <int> <int> <int>          <int>    <int>          <int>     <dbl> <chr>  
#> 1  2013     1     1            515      830            819        11 UA     
#> 2  2013     1     1            529      850            830        20 UA     
#> 3  2013     1     1            540      923            850        33 AA     
#> 4  2013     1     1            545     1004           1022       -18 B6     
#> 5  2013     1     1            600      812            837       -25 DL     
#> 6  2013     1     1            558      740            728        12 UA     
#> # … with 3.368e+05 more rows, and 9 more variables: flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

There’s one last helper that’s useful if you just want to move a few variables to the start: everything().

#> # A tibble: 336,776 x 19
#>   dep_delay arr_delay  year month   day dep_time sched_dep_time arr_time
#>       <dbl>     <dbl> <int> <int> <int>    <int>          <int>    <int>
#> 1         2        11  2013     1     1      517            515      830
#> 2         4        20  2013     1     1      533            529      850
#> 3         2        33  2013     1     1      542            540      923
#> 4        -1       -18  2013     1     1      544            545     1004
#> 5        -6       -25  2013     1     1      554            600      812
#> 6        -4        12  2013     1     1      554            558      740
#> # … with 3.368e+05 more rows, and 11 more variables: sched_arr_time <int>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.2 Rename

To change the name of a variable use df %>% rename(new_name = old_name). If you have trouble remembering which sides old and new go on, remember it’s the same order as mutate().

#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 3.368e+05 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

If you’re selecting and renaming, note that you can also use select() to rename. This sometimes allows you to save a step.

#> # A tibble: 336,776 x 4
#>    year month   day tail_num
#>   <int> <int> <int> <chr>   
#> 1  2013     1     1 N14228  
#> 2  2013     1     1 N24211  
#> 3  2013     1     1 N619AA  
#> 4  2013     1     1 N804JB  
#> 5  2013     1     1 N668DN  
#> 6  2013     1     1 N39463  
#> # … with 3.368e+05 more rows

2.3 Transmute

Transmute is a minor variation of mutate(). The main difference is that it drops any variables that you didn’t explicitly mention. It’s a useful shortcut for mutate() + select().

#> # A tibble: 3 x 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     3     4
#> 2     2     2     4
#> 3     3     1     4
#> # A tibble: 3 x 1
#>       z
#>   <int>
#> 1     4
#> 2     4
#> 3     4

2.4 Arrange

arrange() lets you change the order of the rows. To put a column in descending order, use desc().

#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     9      641            900      1301     1242           1530
#> 2  2013     6    15     1432           1935      1137     1607           2120
#> 3  2013     1    10     1121           1635      1126     1239           1810
#> 4  2013     9    20     1139           1845      1014     1457           2210
#> 5  2013     7    22      845           1600      1005     1044           1815
#> 6  2013     4    10     1100           1900       960     1342           2211
#> # … with 3.368e+05 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 3.368e+05 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.5 Distinct

distinct() removes duplicates from a dataset. The result is ordered by first occurence in original dataset.

#> # A tibble: 5,725 x 2
#>   carrier flight
#>   <chr>    <int>
#> 1 UA        1545
#> 2 UA        1714
#> 3 AA        1141
#> 4 B6         725
#> 5 DL         461
#> 6 UA        1696
#> # … with 5,719 more rows

2.6 Sample

When working with very large datasets, sometimes it’s convenient to reduce to a smaller dataset, just by taking a random sample. That’s the job of sample_n() and sample_frac(). sample_n() selects the same number of observations from each group, sample_frac() selects the same proportion.

#> # A tibble: 5,800 x 19
#> # Groups:   dest [58]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     8    16     1543           1545        -2     1807           1806
#> 2  2013     4    22     1537           1520        17     1835           1745
#> 3  2013    10    29     1917           1930       -13     2152           2159
#> 4  2013    11    14     1747           1729        18     2007           1955
#> 5  2013     4    26      703            700         3      940            931
#> 6  2013    12    17       NA           1240        NA       NA           1513
#> # … with 5,794 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#> # A tibble: 3,205 x 19
#> # Groups:   dest [58]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     9     3     1955           1959        -4     2151           2230
#> 2  2013     8     6     1924           1925        -1     2153           2156
#> 3  2013     7    29     1256           1300        -4     1521           1519
#> 4  2013     9     6      559            600        -1      811            815
#> 5  2013     9    12      615            600        15      839            829
#> 6  2013     6     3     1543           1545        -2     1844           1823
#> # … with 3,199 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.7 Slice rows

slice() allows to pick rows by position, by group. head() and tail() just show the first (or last) few rows of the entire data frame.

#> # A tibble: 517 x 19
#> # Groups:   dest [105]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013    10     1     1955           2001        -6     2213           2248
#> 2  2013    10     2     2010           2001         9     2230           2248
#> 3  2013    10     3     1955           2001        -6     2232           2248
#> 4  2013    10     4     2017           2001        16     2304           2248
#> 5  2013    10     5     1959           1959         0     2226           2246
#> 6  2013    10     1     1149           1159       -10     1245           1259
#> # … with 511 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#> # A tibble: 6 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>
#> # A tibble: 6 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     9    30       NA           1842        NA       NA           2019
#> 2  2013     9    30       NA           1455        NA       NA           1634
#> 3  2013     9    30       NA           2200        NA       NA           2312
#> 4  2013     9    30       NA           1210        NA       NA           1330
#> 5  2013     9    30       NA           1159        NA       NA           1344
#> 6  2013     9    30       NA            840        NA       NA           1020
#> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>