Choosing columns & rows

On this page, we’re going to explain how to select columns (to print or manipulate), how to filter rows by values in specific columns, and how to select rows containing distinct/unique values.

This is the data that we’re going to be working with on this page:

names(survey)
 [1] "Year"       "ID"         "NPS"        "Field"      "ClassLevel"
 [6] "Status"     "Gender"     "BirthYear"  "FinPL"      "FinSch"    
[11] "FinGov"     "FinSelf"    "FinPar"     "FinOther"   "Question"  
[16] "Response"   "NumResp"   
glimpse(survey)
Rows: 268,481
Columns: 17
$ Year       <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
$ ID         <chr> "aacuutwvloft", "aacuutwvloft", "aacuutwvloft", "aacuutwvlo…
$ NPS        <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4,…
$ Field      <chr> "Undecl", "Undecl", "Undecl", "Undecl", "Undecl", "Undecl",…
$ ClassLevel <ord> Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Fre…
$ Status     <fct> Full-time, Full-time, Full-time, Full-time, Full-time, Full…
$ Gender     <fct> Female, Female, Female, Female, Female, Female, Female, Mal…
$ BirthYear  <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1992, 1992, 1992,…
$ FinPL      <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ FinSch     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
$ FinGov     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ FinSelf    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ FinPar     <chr> NA, NA, NA, NA, NA, NA, NA, "No", "No", "No", "No", "No", "…
$ FinOther   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ Question   <ord> ClassTooBig, FinAid, OverallValue, PoorTeaching, Sched, Too…
$ Response   <ord> Neutral, Agree, Strongly Agree, Strongly Agree, Disagree, D…
$ NumResp    <dbl> 3, 4, 5, 5, 2, 2, 2, 1, 3, 4, 1, 5, 3, 5, 1, 2, 4, 1, 1, 5,…

This is sample data from a data table about a student retention survey. It has over 250,000 rows and 17 columns.

1 Selecting columns with select()

The select() statement tells R to select the columns from the specified data table. Here, R is told to select Year and then Gender from the survey table.

survey |> 
  select(Year, Gender)
# A tibble: 268,481 × 2
    Year Gender
   <int> <fct> 
 1  2012 Female
 2  2012 Female
 3  2012 Female
 4  2012 Female
 5  2012 Female
 6  2012 Female
 7  2012 Female
 8  2012 Male  
 9  2012 Male  
10  2012 Male  
# ℹ 268,471 more rows

You can see that the returned data table has the same number of rows as the original survey table but only two columns.

2 Filtering rows with filter()

The select() statement is used to choose specific columns. In this section, we introduce the filter() statement which is used to choose specific rows based on some of the values they contain.

Remember that in piped commands, the result of every statement is a tibble that has been modified in some way from the result of the tibble it received from the previous statement.

2.1 Standard filter statement

The following command tells R to choose rows for which the Gender column contains “Female”, choose the ID and Year columns from those rows, and pass along the resulting tibble (in this case, to the implied print statement).

survey |> 
  filter(Gender == "Female") |> 
  select(ID, Year)
# A tibble: 127,848 × 2
   ID            Year
   <chr>        <int>
 1 aacuutwvloft  2012
 2 aacuutwvloft  2012
 3 aacuutwvloft  2012
 4 aacuutwvloft  2012
 5 aacuutwvloft  2012
 6 aacuutwvloft  2012
 7 aacuutwvloft  2012
 8 aafejliqlzng  2012
 9 aafejliqlzng  2012
10 aafejliqlzng  2012
# ℹ 127,838 more rows

We can see with the result of the previous command how many surveys were filled out by females. Note that the Gender column values are not printed out because we have chosen not to do so (with the form of the select() statement).

2.2 A filter with no select

The following command differs from the previous one in that the select() statement has been removed. Thus, it can be read as follows: Here is the survey tibble with all of its rows and columns. Choose those rows for which the Gender column contains the value “Female”. This, again, is passed along to the implied print statement.

survey |> 
  filter(Gender == "Female")
# A tibble: 127,848 × 17
    Year ID            NPS Field ClassLevel Status Gender BirthYear FinPL FinSch
   <int> <chr>       <int> <chr> <ord>      <fct>  <fct>      <int> <chr> <chr> 
 1  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 2  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 3  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 4  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 5  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 6  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 7  2012 aacuutwvlo…     7 Unde… Jr         Full-… Female      1995 No    Yes   
 8  2012 aafejliqlz…     4 SocS… Fresh      Full-… Female      1992 No    Yes   
 9  2012 aafejliqlz…     4 SocS… Fresh      Full-… Female      1992 No    Yes   
10  2012 aafejliqlz…     4 SocS… Fresh      Full-… Female      1992 No    Yes   
# ℹ 127,838 more rows
# ℹ 7 more variables: FinGov <chr>, FinSelf <chr>, FinPar <chr>,
#   FinOther <chr>, Question <ord>, Response <ord>, NumResp <dbl>

You’ll note that all 17 columns are printed since we did not use a select() statement to reduce the number of columns in any way. R prints the same number of rows as the previous statement.

2.3 The ordering of filter statements

Now the following statement does not work — R returns an error message. Can you guess why?

survey |> 
  select(ID, Year) |> 
  filter(Gender == "Female")

Let’s try to figure it out. R starts with the whole survey tibble. R is then instructed to select the ID and Year columns from the survey tibble. This is passed along to the filter() statement. The filter statement attempts to filter on the Gender column…but it doesn’t exist! This is where the error occurs.

Here we can fix the problem by adding the Gender column to the select() statement. Now when the tibble is passed along to the filter() statement, the Gender column exists.

survey |> 
  select(ID, Year, Gender) |> 
  filter(Gender == "Female")
# A tibble: 127,848 × 3
   ID            Year Gender
   <chr>        <int> <fct> 
 1 aacuutwvloft  2012 Female
 2 aacuutwvloft  2012 Female
 3 aacuutwvloft  2012 Female
 4 aacuutwvloft  2012 Female
 5 aacuutwvloft  2012 Female
 6 aacuutwvloft  2012 Female
 7 aacuutwvloft  2012 Female
 8 aafejliqlzng  2012 Female
 9 aafejliqlzng  2012 Female
10 aafejliqlzng  2012 Female
# ℹ 127,838 more rows

Note that, again, the statement returns the same number of rows as the previous statement but, this time, it prints three columns (as instructed).

2.4 Composability of statements

Suppose that we wanted to select() and then filter() but that this caused us, as just above, to print out a column that we did not want to print.

We can address this by simply adding another select() statement. There is no limit to the number of times that you can use a specific type of statement. Here we simply tell R to only choose ID and Year but not Gender (as was selected earlier in the statement):

survey |> 
  select(ID, Year, Gender) |> 
  filter(Gender == "Female") |> 
  select(ID, Year)
# A tibble: 127,848 × 2
   ID            Year
   <chr>        <int>
 1 aacuutwvloft  2012
 2 aacuutwvloft  2012
 3 aacuutwvloft  2012
 4 aacuutwvloft  2012
 5 aacuutwvloft  2012
 6 aacuutwvloft  2012
 7 aacuutwvloft  2012
 8 aafejliqlzng  2012
 9 aafejliqlzng  2012
10 aafejliqlzng  2012
# ℹ 127,838 more rows

3 Distinct/unique rows

Sometimes you don’t want all of the rows to print but, instead, you want to see the unique values (or combination of values) that are in the data table. Above, in the first section, we printed the Year and Gender for all of the rows — that ended up being more than 250,000 rows.

But what if we wanted to ask a different question: “What combination of Year and Gender values are in the data table?” Or, “Which genders appear in which year’s surveys?” In this case, you would want the distinct() command:

survey |> 
  distinct(Year, Gender)
# A tibble: 24 × 2
    Year Gender
   <int> <fct> 
 1  2012 Female
 2  2012 Male  
 3  2012 Other 
 4  2012 <NA>  
 5  2013 Female
 6  2013 Male  
 7  2013 Other 
 8  2013 <NA>  
 9  2014 Male  
10  2014 <NA>  
# ℹ 14 more rows

In this case we don’t get 250k+ rows—we get 24 rows! It lists all of the unique/distinct value combinations in the whole data table.

4 Subsets of rows (with slice() etc.)

R provides a useful set of functions

  • to extract certain rows (top, bottom, or some subset),
  • find the rows in which the values in a certain column are the biggest or smallest, and
  • find the rows in which the values in a certain column are the biggest or smallest within groups of rows

If this isn’t quite clear, you can see what we mean in the following examples.

First, here’s the data that we’re going to work with on these examples. This is information about a set of students admitted to a university who have already graduated.

admitsub <-
  admitdata |> 
    filter(GraduationYear > 0 & Gender %in% c("Male", "Female")) |> 
    select(FirstName, LastName, GradYr = GraduationYear, 
           Gender, HSGPA, UnivGPA)

Let’s take a glimpse() at this data frame:

glimpse(admitsub)
Rows: 9,753
Columns: 6
$ FirstName <chr> "Terry", "Lennox", "Dominic", "Dilan", "Isabella", "Adalynn"…
$ LastName  <chr> "Stone", "Harris", "Travis", "King", "Garcia", "Miranda", "T…
$ GradYr    <chr> "2012-13", "2012-13", "2012-13", "2012-13", "2012-13", "2012…
$ Gender    <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Female"…
$ HSGPA     <dbl> 2.89, 3.05, 3.36, 3.25, 3.07, 3.07, 3.57, 3.55, 3.31, 3.09, …
$ UnivGPA   <dbl> 2.468, 2.731, 2.779, 3.185, 2.521, 3.126, 3.041, 3.440, 3.00…

4.1 Take basic row slices by row numbers

The first set of examples have to do with selecting specified rows from a data frame as it is stored; that is, the selection does not have to do with the values in any columns.

The most common use case for these functions are similar to those for head() and tail() — you just want to see what’s in the file. Don’t rely on the order of the rows staying the same over its life!

4.1.1 First row

admitsub |> slice(1)
# A tibble: 1 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Terry     Stone    2012-13 Male    2.89    2.47

4.1.2 Third row

admitsub |> slice(3)
# A tibble: 1 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Dominic   Travis   2012-13 Male    3.36    2.78

4.1.3 First three rows

admitsub |> slice(1:3)
# A tibble: 3 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Terry     Stone    2012-13 Male    2.89    2.47
2 Lennox    Harris   2012-13 Male    3.05    2.73
3 Dominic   Travis   2012-13 Male    3.36    2.78
admitsub |> slice_head(n = 3)
# A tibble: 3 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Terry     Stone    2012-13 Male    2.89    2.47
2 Lennox    Harris   2012-13 Male    3.05    2.73
3 Dominic   Travis   2012-13 Male    3.36    2.78

4.1.4 Last row

admitsub |> slice(n())
# A tibble: 1 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Camila    Hail     2022-23 Female  3.52    2.92
admitsub |> slice_tail(n = 1)
# A tibble: 1 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Camila    Hail     2022-23 Female  3.52    2.92

4.1.5 Last three rows

admitsub |> slice_tail(n = 3)
# A tibble: 3 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Esther    Brown    2022-23 Female  3.14    2.73
2 Olivia    Gonzalez 2022-23 Female  2.78    2.74
3 Camila    Hail     2022-23 Female  3.52    2.92

4.2 Take row slices by value in a specific column

Most of the time you want to select rows based on the value in a specific column. That is, you might want to answer questions such as

  • Which students had the three highest HSGPAs
  • Which students had UnivGPAs in the top 10%?

4.2.1 Students with top 3 HSGPA values

This query asks, “Find me the rows that have the three greatest values of HSGPA.”

You may notice that this prints out many more than 3 rows. The reason for that is because all of these rows have a value equal to 4 in the HSGPA column.

See the next example if you want to keep this from happening.

admitsub |> slice_max(HSGPA, n = 3)
# A tibble: 18 × 6
   FirstName LastName GradYr  Gender HSGPA UnivGPA
   <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
 1 Benjamin  Purcell  2012-13 Male       4    3.37
 2 Alvaro    Decker   2013-14 Male       4    3.29
 3 Atalia    Coleman  2013-14 Female     4    3.78
 4 Parth     Gonzalez 2014-15 Male       4    3.19
 5 Amon      Kern     2015-16 Male       4    3.04
 6 Owen      Johnson  2014-15 Male       4    3.36
 7 Kaelyn    Burnett  2016-17 Female     4    3.62
 8 Keanu     Messer   2016-17 Male       4    3.4 
 9 Theodore  Spencer  2017-18 Male       4    3.43
10 Landon    Knight   2019-20 Male       4    3.45
11 Kemper    Rios     2019-20 Female     4    3.41
12 Jasper    Shah     2020-21 Male       4    3.65
13 Piper     Hollis   2020-21 Female     4    3.78
14 Ana       Mcmanus  2021-22 Female     4    3.48
15 Waylon    Mathews  2021-22 Male       4    3.44
16 Catherine Burch    2022-23 Female     4    3.32
17 Alisson   Baxter   2022-23 Female     4    3.73
18 Noah      Lopez    2017-18 Male       4    3.22

4.2.2 Students with top 3 HSGPA values (but no ties)

This query asks, “Find me the rows that have the three greatest values of HSGPA. I do not care about ties — just show me 3 rows.”

admitsub |> slice_max(HSGPA, n = 3, 
                      with_ties = FALSE)
# A tibble: 3 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Benjamin  Purcell  2012-13 Male       4    3.37
2 Alvaro    Decker   2013-14 Male       4    3.29
3 Atalia    Coleman  2013-14 Female     4    3.78

4.2.3 Students with top 8 HSGPA values (and use UnivGPA as a tie-breaker)

This query asks, “Find me the rows that have the eight greatest values of HSGPA and use UnivGPA as a tie-breaker, if necessary.”

Notice that we use tibble() to group together the two columns instead of c() as we have done in other situations along our learning journey.

admitsub |> slice_max(tibble(HSGPA, UnivGPA), n = 8)
# A tibble: 8 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Atalia    Coleman  2013-14 Female     4    3.78
2 Piper     Hollis   2020-21 Female     4    3.78
3 Alisson   Baxter   2022-23 Female     4    3.73
4 Jasper    Shah     2020-21 Male       4    3.65
5 Kaelyn    Burnett  2016-17 Female     4    3.62
6 Ana       Mcmanus  2021-22 Female     4    3.48
7 Landon    Knight   2019-20 Male       4    3.45
8 Waylon    Mathews  2021-22 Male       4    3.44

4.2.4 The top 0.1% of students ranked by UnivGPA (with no ties)

This query asks, “Find me the rows that have the 0.1% of the greatest university GPAs, with no ties.”

admitsub |> slice_max(UnivGPA, prop = 0.001,
                      with_ties = FALSE)
# A tibble: 9 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Mira      Minor    2020-21 Female  3.74    3.79
2 Atalia    Coleman  2013-14 Female  4       3.78
3 Piper     Hollis   2020-21 Female  4       3.78
4 Kaeden    Tate     2018-19 Male    3.8     3.77
5 Charlotte Ramirez  2020-21 Female  3.77    3.77
6 Gael      Collins  2019-20 Male    3.45    3.74
7 Lumi      Martin   2017-18 Female  3.63    3.73
8 Madeline  Burton   2016-17 Female  3.28    3.73
9 Mercy     Martinez 2020-21 Female  3.72    3.73

4.2.5 Students with the top 2 UnivGPA values

This query asks, “Find me the rows that have the top 2 university GPAs. Show ties if they exist.”

admitsub |> slice_max(UnivGPA, n = 2, with_ties = TRUE)
# A tibble: 2 × 6
  FirstName LastName GradYr  Gender HSGPA UnivGPA
  <chr>     <chr>    <chr>   <chr>  <dbl>   <dbl>
1 Mira      Minor    2020-21 Female  3.74    3.79
2 Atalia    Coleman  2013-14 Female  4       3.78

Let’s look at a possible issue. Notice that row 3 in the previous example shows the same value for the university GPA (3.78) as shown in row 2. Why isn’t it included here since ties are supposed to be shown? It is entirely possible that some rounding is going on. Let’s try something out — change it so that numeric data are printed with 3 digits in the admitsub data frame.

admitsub <-
  admitsub |> 
    mutate(across(where(is.numeric), ~ num(., digits = 3)))

Now, let’s look at the top three values again:

admitsub |> slice_max(UnivGPA, n = 3)
# A tibble: 3 × 6
  FirstName LastName GradYr  Gender     HSGPA   UnivGPA
  <chr>     <chr>    <chr>   <chr>  <num:.3!> <num:.3!>
1 Mira      Minor    2020-21 Female     3.740     3.791
2 Atalia    Coleman  2013-14 Female     4.000     3.780
3 Piper     Hollis   2020-21 Female     4.000     3.778

Rows 2 and 3 really do not have the same value — “Atalia” has a GPA of 3.780 while “Piper” has a GPA of 3.778. This is why R did not, appropriately so, consider them the same value.

4.3 Take row slices by group

Here we also want to select rows based on the value in a specific column, but we want to get these values within groups. For example, we might modify the questions in the previous section in this way:

  • Which students had the three highest HSGPAs by graduation year?
  • Which students had UnivGPAs in the top 10% by graduation year?

4.3.1 Maximum UnivGPA by graduation year

This query asks, “For every distinct value of GradYr, find the row with the greatest UnivGPA, not including ties.”

admitsub |> 
  group_by(GradYr) |> 
  slice_max(UnivGPA, 
            n = 1, with_ties = FALSE)
# A tibble: 12 × 6
# Groups:   GradYr [12]
   FirstName LastName   GradYr  Gender     HSGPA   UnivGPA
   <chr>     <chr>      <chr>   <chr>  <num:.3!> <num:.3!>
 1 Aubrey    Ventura    2011-12 Female     3.860     3.626
 2 Celia     Jones      2012-13 Female     3.600     3.598
 3 Atalia    Coleman    2013-14 Female     4.000     3.780
 4 Yusra     Young      2014-15 Female     3.680     3.697
 5 Sofia     Stahl      2015-16 Female     3.970     3.679
 6 Madeline  Burton     2016-17 Female     3.280     3.729
 7 Lumi      Martin     2017-18 Female     3.630     3.732
 8 Kaeden    Tate       2018-19 Male       3.800     3.774
 9 Gael      Collins    2019-20 Male       3.450     3.736
10 Mira      Minor      2020-21 Female     3.740     3.791
11 Gianna    Strickland 2021-22 Female     3.540     3.684
12 Alisson   Baxter     2022-23 Female     4.000     3.727

As a heads-up, we have had trouble getting consistent results using either the by or order_by arguments (as shown in the following). We are staying away from them until we figure out the issue.

admitsub |> 
  slice_max(UnivGPA, by = GradYr,
            n = 1, with_ties = FALSE)
admitsub |> 
  slice_max(by = GradYr,
            order_by = tibble(GradYr, UnivGPA),
            n = 1, with_ties = FALSE)

4.3.2 Top 2 GPAs by both graduation year and gender

This query asks, “For every distinct value of the combination of GradYr and Gender, find the top 2 UnivGPAs.

admitsub |> 
  group_by(GradYr, Gender) |> 
  slice_max(UnivGPA, 
            n = 2, with_ties = FALSE)
# A tibble: 48 × 6
# Groups:   GradYr, Gender [24]
   FirstName LastName GradYr  Gender     HSGPA   UnivGPA
   <chr>     <chr>    <chr>   <chr>  <num:.3!> <num:.3!>
 1 Aubrey    Ventura  2011-12 Female     3.860     3.626
 2 Emerson   Garrett  2011-12 Female     3.150     3.474
 3 Landyn    Singh    2011-12 Male       3.550     3.291
 4 Lake      Davis    2011-12 Male       3.520     3.084
 5 Celia     Jones    2012-13 Female     3.600     3.598
 6 Amy       Caldwell 2012-13 Female     3.770     3.555
 7 Luka      Pierre   2012-13 Male       3.640     3.443
 8 Cameron   Rossi    2012-13 Male       3.600     3.430
 9 Atalia    Coleman  2013-14 Female     4.000     3.780
10 Jamaria   Edwards  2013-14 Female     3.600     3.703
# ℹ 38 more rows

This result contains 48 rows — 12 years times 2 genders times top 2 results equals 48.

Compare the result of the above to the following result. This query asks for the top 4 UnivGPAs for each GradYr. Notice that, for the first two years at least, no male students are listed. The previous query shows the greatest GPAs per gender, irrespective of whether or not they were the greatest GPAs overall each year.

admitsub |> 
  group_by(GradYr) |> 
  slice_max(UnivGPA, 
            n = 4, with_ties = FALSE)
# A tibble: 48 × 6
# Groups:   GradYr [12]
   FirstName  LastName GradYr  Gender     HSGPA   UnivGPA
   <chr>      <chr>    <chr>   <chr>  <num:.3!> <num:.3!>
 1 Aubrey     Ventura  2011-12 Female     3.860     3.626
 2 Emerson    Garrett  2011-12 Female     3.150     3.474
 3 Michelle   Mccauley 2011-12 Female     3.270     3.294
 4 Vivian     Bland    2011-12 Female     3.020     3.293
 5 Celia      Jones    2012-13 Female     3.600     3.598
 6 Amy        Caldwell 2012-13 Female     3.770     3.555
 7 Alexandria Gibbs    2012-13 Female     3.280     3.553
 8 Josefina   Smith    2012-13 Female     3.500     3.551
 9 Atalia     Coleman  2013-14 Female     4.000     3.780
10 Jamaria    Edwards  2013-14 Female     3.600     3.703
# ℹ 38 more rows

Both are valid queries. The appropriate one to choose depends on what question you want to ask of the data.