Tidying commands

On this page we are going to continue using the st_info data set that was introduced on the import page.

1 Introduction

Tidy data looks like a simple spreadsheet, with columns and rows. Each column is a variable (type of information), and each row is an observation. In a tidy CSV file, the first row is the header row, which contains the names of these variables. Importing into Excel puts those at the top, where you can rename them, move them around, delete them, and so on. This arrangement of rows of data under headers (column names) is the same in R, so the basic data object, called a “data frame,” looks and acts a lot like a spreadsheet.

A major difference between spreadsheets and our work in R using data frames is that spreadsheets are designed around cell operations, which permits infinite flexibility. We can pick any cell in a spreadsheet and write a formula for a calculation using any other data in the attached sheets. This flexibility comes at the cost of complexity. While cell-wise operations are possible in R, generally we will want to work with rows and columns as a whole. The reduction in complexity allows for straightforward description and reproduction of the work we do.

So think of “tidy data” as a simplified spreadsheet, where rows correspond to observations or cases — for example, student registrations for classes. Each column tells us something different about the case. In the case of registrations, it might give us the student ID and the course ID, and maybe the instructor ID. There may be blanks where information is missing, e.g. an instructor hasn’t been assigned yet. See Chapter 12 of R for Data Science for more on tidy data.

Here’s a tidy data set created by the data.frame function, which lets us specify the column names and contents. Don’t worry about the details for now.

data.frame(ID    = c(123,456,789),
           First = c("Saran","Dan","Frylander"),
           Last  = c("Wrap","Zig","French"),
           GPA   = c(3.10, 2.98, 3.50))

   ID     First   Last  GPA
1 123     Saran   Wrap 3.10
2 456       Dan    Zig 2.98
3 789 Frylander French 3.50

Notice that it’s a rectangle of data — that is, 3 rows of 4 columns — with cases in row and attributes in columns.

2 Renaming columns/variables

The following renames the field Application ID to become AppID. Notice that without the assignment operator (<-), the following would not have any lasting effect.

st_info <- st_info |> 
    rename(AppID = "Application ID")

The general form of the statement when you want to rename columns is as follows:

data-frame-name <- data-frame-name |> 
    rename(new-name1 = "old-name-1",
           new-name2 = "old-name-2",
           ...)

3 Removing columns/variables

In the following, we provide several different ways that you can execute the listed actions. We will go through some more details related to different forms that the commands might take at the beginning than we do farther down in this document.

3.1 Using `select`

As a reminder, here’s the current state of the st_info data frame:

glimpse(st_info)

Rows: 2,000
Columns: 10
$ AppID     <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290276249-…
$ Given     <chr> "Tenoch", "Travis", "Axel", "Justice", "Mehran", "Carter", "…
$ Family    <chr> "Rodriguez", "Roe", "Allen", "Hilton", "Smith", "Shipley", "…
$ Birthdate <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003", "11/…
$ Email     <chr> "[email protected]", "[email protected]", "axelall…
$ St        <chr> "GA", "SC", "GA", "GA", "GA", "GA", "SC", "SC", "SC", "SC", …
$ County    <chr> "COWETAGA", "GREENVSC", "BULLOCGA", "DEKALBGA", "WHITFIGA", …
$ Sex       <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", …
$ Race      <chr> "W", "A", "W", "W", "O", "H", "W", "W", "B", "H", "W", "W", …
$ SAT       <dbl> 1436, 1398, 1090, 1516, 1440, 1438, 1452, 1536, 1487, 1373, …

Suppose that you want to print just 7 of the variables from this data frame:

st_info |>
  select(AppID, Birthdate, St, County, Sex, Race, SAT)

# A tibble: 2,000 × 7
   AppID                Birthdate  St    County   Sex   Race    SAT
   <chr>                <chr>      <chr> <chr>    <chr> <chr> <dbl>
 1 4563269562-RODR-2021 05/26/2003 GA    COWETAGA M     W      1436
 2 9221751846-ROEH-2021 03/31/2003 SC    GREENVSC M     A      1398
 3 4290276249-ALLE-2021 06/23/2003 GA    BULLOCGA M     W      1090
 4 3398780452-HILT-2021 06/29/2003 GA    DEKALBGA M     W      1516
 5 7691897840-SMIT-2021 11/03/2002 GA    WHITFIGA M     O      1440
 6 1862245592-SHIP-2021 10/31/2002 GA    DEKALBGA M     H      1438
 7 2085584835-CHAM-2021 02/08/2003 SC    DARLINSC M     W      1452
 8 3918571924-STAH-2021 06/16/2003 SC    CHARLESC M     W      1536
 9 3023361776-ROGE-2021 09/11/2002 SC    LANCASSC M     B      1487
10 7386692838-SCOT-2021 05/25/2003 SC    PICKENSC M     H      1373
# ℹ 1,990 more rows

As promised above, we used the pipe-based method of constructing the command. We tell R to work with the st_info data frame, and then to select() seven specific variables (columns) from that data frame.

Certainly, we could have used this form of the command:

select(st_info, AppID, Birthdate, St, County, Sex, Race, SAT)

However, we like the clarity of the pipe-based method because it makes very clear—right on the first line—what data frame that you are working on…every single time.

Suppose that you want to save the above data frame into a new data frame. You would type something like this:

st_min <- st_info |>
    select(AppID, Birthdate, St, County, Sex, Race, SAT)

As you can see, you simply put the assignment operation before the whole pipe. Whatever would have been the result printed at the end of the pipe gets assigned to the variable on the first line. Let’s glimpse some information about st_min:

glimpse(st_min)

Rows: 2,000
Columns: 7
$ AppID     <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290276249-…
$ Birthdate <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003", "11/…
$ St        <chr> "GA", "SC", "GA", "GA", "GA", "GA", "SC", "SC", "SC", "SC", …
$ County    <chr> "COWETAGA", "GREENVSC", "BULLOCGA", "DEKALBGA", "WHITFIGA", …
$ Sex       <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", …
$ Race      <chr> "W", "A", "W", "W", "O", "H", "W", "W", "B", "H", "W", "W", …
$ SAT       <dbl> 1436, 1398, 1090, 1516, 1440, 1438, 1452, 1536, 1487, 1373, …

Sure enough, the st_min data frame holds 2000 rows for those seven columns that we previously specified.

3.2 Using `select` with minus

Another way to remove variables (columns) from a data frame is to use the “minus” operator with select. To be clear, you aren’t actually removing the variable from the original data frame unless you save the data frame—you are simply removing it from the view that is presented.

If you want to look at the st_info data frame but without the Race variable, you could type in the following:

st_info |> 
  select(-Race)

# A tibble: 2,000 × 9
   AppID                Given    Family Birthdate Email St    County Sex     SAT
   <chr>                <chr>    <chr>  <chr>     <chr> <chr> <chr>  <chr> <dbl>
 1 4563269562-RODR-2021 Tenoch   Rodri… 05/26/20… teno… GA    COWET… M      1436
 2 9221751846-ROEH-2021 Travis   Roe    03/31/20… trav… SC    GREEN… M      1398
 3 4290276249-ALLE-2021 Axel     Allen  06/23/20… axel… GA    BULLO… M      1090
 4 3398780452-HILT-2021 Justice  Hilton 06/29/20… just… GA    DEKAL… M      1516
 5 7691897840-SMIT-2021 Mehran   Smith  11/03/20… mehr… GA    WHITF… M      1440
 6 1862245592-SHIP-2021 Carter   Shipl… 10/31/20… cart… GA    DEKAL… M      1438
 7 2085584835-CHAM-2021 Benjamin Chamb… 02/08/20… benj… SC    DARLI… M      1452
 8 3918571924-STAH-2021 Robert   Stahl  06/16/20… robe… SC    CHARL… M      1536
 9 3023361776-ROGE-2021 Zaiden   Rogers 09/11/20… zaid… SC    LANCA… M      1487
10 7386692838-SCOT-2021 Jakob    Scott  05/25/20… jako… SC    PICKE… M      1373
# ℹ 1,990 more rows

You can see that it now displays nine columns—all of the original columns except for the Race column.

If you want to create a new data frame that is the same as the st_info data frame but without the Race variable, then do the following:

st_min_again <- 
  st_info |> 
    select(-Race)

This tells R to select all of the variables from st_info (note: the original data frame) except the Race variable. Let’s verify that it worked:

glimpse(st_min_again)

Rows: 2,000
Columns: 9
$ AppID     <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290276249-…
$ Given     <chr> "Tenoch", "Travis", "Axel", "Justice", "Mehran", "Carter", "…
$ Family    <chr> "Rodriguez", "Roe", "Allen", "Hilton", "Smith", "Shipley", "…
$ Birthdate <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003", "11/…
$ Email     <chr> "[email protected]", "[email protected]", "axelall…
$ St        <chr> "GA", "SC", "GA", "GA", "GA", "GA", "SC", "SC", "SC", "SC", …
$ County    <chr> "COWETAGA", "GREENVSC", "BULLOCGA", "DEKALBGA", "WHITFIGA", …
$ Sex       <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", …
$ SAT       <dbl> 1436, 1398, 1090, 1516, 1440, 1438, 1452, 1536, 1487, 1373, …

Yep! It’s missing just one of the columns from the original data frame—the Race column.

But what if we wanted to remove multiple columns from the data frame with one command? Here we need to use the c() function—its purpose is to assemble a set of items into a vector that R can work with. This command tells R to select all of the variables except the three listed in the select statement:

st_min_again <- st_min_again |> 
    select(-c(Given, Family, Email))

Again, let’s verify that it worked:

dim(st_min_again)

[1] 2000    6

Yep! We have removed three variables from the st_min_again data frame, just as we wanted.

4 Simplifying tibble print-outs

Just a quick aside. Consider this command:

st_info |>
  select(AppID, Birthdate, St, County, Sex, Race, SAT) |> 
  head(n=3)

# A tibble: 3 × 7
  AppID                Birthdate  St    County   Sex   Race    SAT
  <chr>                <chr>      <chr> <chr>    <chr> <chr> <dbl>
1 4563269562-RODR-2021 05/26/2003 GA    COWETAGA M     W      1436
2 9221751846-ROEH-2021 03/31/2003 SC    GREENVSC M     A      1398
3 4290276249-ALLE-2021 06/23/2003 GA    BULLOCGA M     W      1090

Look at the second row of the print-out—<chr>, <dbl>, and more. It’s informative, but maybe you don’t want to print it.

If that’s the case, then you can add one additional command to to the end and that row will go away. Take a look:

st_info |>
  select(AppID, Birthdate, St, County, Sex, Race, SAT) |> 
  head(n=3) |> 
  as.data.frame()

                 AppID  Birthdate St   County Sex Race  SAT
1 4563269562-RODR-2021 05/26/2003 GA COWETAGA   M    W 1436
2 9221751846-ROEH-2021 03/31/2003 SC GREENVSC   M    A 1398
3 4290276249-ALLE-2021 06/23/2003 GA BULLOCGA   M    W 1090

5 Filtering some rows/observations

R’s command for choosing which rows to display is filter(). In this section we provide several examples for how you can filter data in R.

5.1 Filtering on one equality condition

With the select command, we can specify which particular columns/variables that we want to work with or observe. In this section, we are going to learn to specify which rows/observations that we want to include in our analysis—this is called filtering.

The first two rows of the following start out in the way that we observed above, but we continue on the third line by filtering those rows that we want to work with:

st_info |> 
  select(Given, Family, St, County, Sex, Race, SAT) |> 
  filter(Sex == "F")

# A tibble: 1,080 × 7
   Given    Family     St    County   Sex   Race    SAT
   <chr>    <chr>      <chr> <chr>    <chr> <chr> <dbl>
 1 Mira     Payne      GA    FULTONGA F     W      1456
 2 Penelope Williamson SC    LANCASSC F     AI     1456
 3 Emma     Stephens   SC    LEXINGSC F     W      1491
 4 Ryan     Jackson    GA    DEKALBGA F     W      1233
 5 Jasmine  Truong     SC    BEAUFOSC F     A      1480
 6 Eira     Scott      SC    GREENVSC F     W      1390
 7 Nalani   Siegel     SC    CHARLESC F     W      1261
 8 Esther   Morse      SC    SPARTASC F     A      1338
 9 Claire   Wilson     SC    SALUDASC F     H      1191
10 Delaney  Bishop     SC    GREENVSC F     W      1593
# ℹ 1,070 more rows

A few things to observe:

Notice that we used the select() operator here. We didn’t have to in order to filter data! But we did just to limit the information displayed.
All of the observations passing the filter have F in the Sex column, just as we specified in the command.

5.2 Filtering on one comparison condition

As we’ve seen elsewhere in this site, R accepts a set of comparison conditions:

==: equal
!=: not equal
>=: greater than or equal
>: greater than
=<: less than or equal
<: less than

In the most recent select() and filter() above, we had one condition—Sex == "F. You can replace that comparison with any other single comparison condition. Consider this:

st_info |> 
  select(Given, Family, St, County, Sex, Race, SAT) |> 
  filter(SAT > 1200)

# A tibble: 1,527 × 7
   Given    Family      St    County   Sex   Race    SAT
   <chr>    <chr>       <chr> <chr>    <chr> <chr> <dbl>
 1 Tenoch   Rodriguez   GA    COWETAGA M     W      1436
 2 Travis   Roe         SC    GREENVSC M     A      1398
 3 Justice  Hilton      GA    DEKALBGA M     W      1516
 4 Mehran   Smith       GA    WHITFIGA M     O      1440
 5 Carter   Shipley     GA    DEKALBGA M     H      1438
 6 Benjamin Chamberlain SC    DARLINSC M     W      1452
 7 Robert   Stahl       SC    CHARLESC M     W      1536
 8 Zaiden   Rogers      SC    LANCASSC M     B      1487
 9 Jakob    Scott       SC    PICKENSC M     H      1373
10 Jasper   Davis       GA    HENRYGA  M     W      1297
# ℹ 1,517 more rows

(Note that this displays only M applicants because they are at the top of the data frame and the F applicants are at the bottom. We will work on the arrange operation soon which will address this.)

5.3 Filtering on range conditions

R provides several useful operators for filtering on range conditions. We have seen the first form elsewhere, but the others are new:

V %in% c(list of values)
- where V is any type of variable, and the list of values is a comma-separated list of values that the variable might take
- read as “variable V has the value of one of the list of values shown”
- Example: St %in% c("SC", "GA")
- Example: Race %in% c("A", "H", "B")
V %in% L:U
- where L and U are integers, with L being the lower bound and U being the upper bound
- read as “variable V is in the range of integers from L to U (inclusive)”
- Example: SAT %in% 1200:1225
between(V, L, U)
- where V, L, and U are numbers, with L being the lower bound and U being the upper bound
- read as “variable V (as an integer or decimal) takes a value such that V >= L and V <= U
- Example: between(SAT, 1200, 1225)

Again, as you might hope, you can put any of the above filters within the filter() operation, for example:

st_info |> 
  select(Given, Family, St, County, Sex, Race, SAT) |> 
  filter(Race %in% c("A", "H", "B"))

# A tibble: 782 × 7
   Given    Family   St    County   Sex   Race    SAT
   <chr>    <chr>    <chr> <chr>    <chr> <chr> <dbl>
 1 Travis   Roe      SC    GREENVSC M     A      1398
 2 Carter   Shipley  GA    DEKALBGA M     H      1438
 3 Zaiden   Rogers   SC    LANCASSC M     B      1487
 4 Jakob    Scott    SC    PICKENSC M     H      1373
 5 Hudson   Espinoza GA    GWINNEGA M     A      1456
 6 Jack     Chandler SC    HORRYSC  M     H      1199
 7 Emiliano Rosen    SC    BERKELSC M     H      1246
 8 Kingsley Cronin   SC    PICKENSC M     H      1177
 9 Kenzo    Johnson  GA    COFFEEGA M     A      1189
10 Isaac    Liu      GA    FULTONGA M     H      1462
# ℹ 772 more rows

5.4 Filtering on multiple conditions

Filtering on a single condition is certainly useful, but it is much more common to want to filter on several conditions at once. Here, we have to use logical operators in the filter() to combine different filter conditions.

Here is a filter() that ensures that the applicant is a female from Georgia:

st_info |> 
  select(Given, Family, St, County, Sex, Race, SAT) |> 
  filter(Sex == "F" & St == "GA")

# A tibble: 269 × 7
   Given     Family  St    County   Sex   Race    SAT
   <chr>     <chr>   <chr> <chr>    <chr> <chr> <dbl>
 1 Mira      Payne   GA    FULTONGA F     W      1456
 2 Ryan      Jackson GA    DEKALBGA F     W      1233
 3 Hazel     Larson  GA    FULTONGA F     W      1356
 4 Remi      Brown   GA    WILKESGA F     W      1321
 5 Jeanette  Hood    GA    RICHMOGA F     A      1445
 6 Choyce    Lutz    GA    LOWNDEGA F     H      1347
 7 Marie     Tracy   GA    GWINNEGA F     W      1267
 8 Antonella Johnson GA    HENRYGA  F     W      1459
 9 Harley    Hall    GA    BIBBCGA  F     W      1227
10 Madison   Walls   GA    GLYNNGA  F     W      1184
# ℹ 259 more rows

Or, another one that filters on Sex, Race, and County:

st_info |> 
  select(Given, Family, St, County, Sex, Race, SAT) |> 
  filter(Sex == "F" &
           Race %in% c("A", "H", "B") & 
           County == "SPARTASC")

# A tibble: 22 × 7
   Given   Family   St    County   Sex   Race    SAT
   <chr>   <chr>    <chr> <chr>    <chr> <chr> <dbl>
 1 Esther  Morse    SC    SPARTASC F     A      1338
 2 Henley  Fisher   SC    SPARTASC F     H      1451
 3 Abigail Saavedra SC    SPARTASC F     A      1150
 4 Mila    Price    SC    SPARTASC F     H      1227
 5 Aaliyah Smith    SC    SPARTASC F     H      1329
 6 Finlee  Brown    SC    SPARTASC F     H      1180
 7 Diana   Jackson  SC    SPARTASC F     B      1148
 8 Nova    Lynch    SC    SPARTASC F     B      1346
 9 Adeline Fowler   SC    SPARTASC F     H      1238
10 Isla    Gomez    SC    SPARTASC F     B      1161
# ℹ 12 more rows

In the above, be sure to use the equality comparison == and not the single equal =.

6 Convert string variable/column to date

Let’s take a look at three of the columns:

st_info |> 
  select(AppID, Family, Birthdate) |> 
  glimpse()

Rows: 2,000
Columns: 3
$ AppID     <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290276249-…
$ Family    <chr> "Rodriguez", "Roe", "Allen", "Hilton", "Smith", "Shipley", "…
$ Birthdate <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003", "11/…

(As an aside, note that we have used glimpse() within a pipe command, differently than we introduced it above.) You can see that R has interpreted the data in all three columns to be character strings.

Think about (and look at) the data that are in these columns. The Birthdate column has dates in it, and you will want to do calculations on them. But R cannot do date calculations on the column until it knows that it contains date values.

In this section we will learn how to tell R that it is dealing with a date.

What you have to do is tell R the format of the column—where is the year, the month, etc.? Is it a 2-digit year or a 4-digit year? Is the month represented by letters or by numbers?

R has a set of format indicators, a few of which are shown here:

Date formats
Format String	Description	Examples
`%b`	Abbreviated month	Jan, Feb, Mar
`%m`	Numeric month	01, 02, 03
`%d`	Day of month	01, 02, 03,…, 31
`%y`	Year without century	87, 99, 03
`%Y`	Year with century	1987, 1999, 2003

There are others that you can see here.

Go back up to the output of the glimpse() command above for Birthdate:

$ Birthdate        <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003…

Besides the month, date, and year, each string has a divider in it. Here it is a / but it also typically can be a dash - or just a space.

The format string for the Birthdate field is %m/%d/%Y since it has a numeric month (%m), a /, a day of month (%d), a /, and a year with century (%Y).

Now you are ready to convert the date field:

st_info <-
  st_info |> 
    mutate(BD = as.Date(Birthdate, format="%m/%d/%Y"))

Note that you need to have both the assignment operator <- and the pipe operator |> in this statement for it to work.

The statement above is assigning a value to a new column called BD in the st_info data frame. It is getting the new values by applying the as.Date() function to the Birthdate variable. You can see that this function contains the format string that it needs to interpret the dates.

Let’s take a look at the new column next to the old one:

st_info |> 
  select(Given, Family, Birthdate, BD)

# A tibble: 2,000 × 4
   Given    Family      Birthdate  BD        
   <chr>    <chr>       <chr>      <date>    
 1 Tenoch   Rodriguez   05/26/2003 2003-05-26
 2 Travis   Roe         03/31/2003 2003-03-31
 3 Axel     Allen       06/23/2003 2003-06-23
 4 Justice  Hilton      06/29/2003 2003-06-29
 5 Mehran   Smith       11/03/2002 2002-11-03
 6 Carter   Shipley     10/31/2002 2002-10-31
 7 Benjamin Chamberlain 02/08/2003 2003-02-08
 8 Robert   Stahl       06/16/2003 2003-06-16
 9 Zaiden   Rogers      09/11/2002 2002-09-11
10 Jakob    Scott       05/25/2003 2003-05-25
# ℹ 1,990 more rows

You can see that Birthdate is still a character string but the new BD column contains seemingly equivalent date values. The difference is that you can execute calculations on the BD column.

For example, suppose that you want to calculate the applicant’s age as of today. Here is the statement that would do just that:

st_info <-
  st_info |> 
    mutate(Age = time_length(difftime(Sys.Date(), 
                                      BD), 
                             "years"))

Several things to note here:

The statement is assigning a value to a new column Age in the st_info data frame.
Let’s consider the interior function difftime(Later, Earlier). This calculates the number of seconds between the two times Later - Earlier.
The function Sys.Date() returns the current date and time.
Since `difftime() is operating on the current date and the applicant’s birthdate, this is returning the number of seconds the applicant has been alive.
The function time_length(S, "years") provided by the tidyverse converts that calculated number of seconds into a years value (in terms of a decimal number).

Thus, this calculation puts the applicant’s age in years in the Age column.

Let’s take a look at what this produces:

st_info |> 
  select(Birthdate, BD, Age) |> 
  summary()

  Birthdate               BD                  Age       
 Length:2000        Min.   :2002-08-02   Min.   :21.18  
 Class :character   1st Qu.:2002-11-10   1st Qu.:21.45  
 Mode  :character   Median :2003-02-25   Median :21.73  
                    Mean   :2003-02-23   Mean   :21.74  
                    3rd Qu.:2003-06-08   3rd Qu.:22.03  
                    Max.   :2003-09-15   Max.   :22.30

This report highlights all that we have gained through converting the Birthdate string to the BD date:

We can now see the range of birthdates as well as other computed statistics on the BD column For Birthdate we just know that it is a character string column.
We can see with the Age column a whole set of statistics calculated on that value.

You absolutely do not have to do this converstion from string to date for a column that contains date-related information. You only need to do it if you have to perform date calculations.

7 Combining multiple character (string) columns into one

Suppose you have a data frame that contains two fields, Given and Family, that you would like to combine into one field, FullName, while also putting a space between the contents of the two field. Then use the following command:

st_info <-
  st_info |>
    unite("FullName", 
          c("Given", "Family"), 
          sep = " ", 
          remove = FALSE)

The remove field is set to FALSE so that the original two fields are retained; as you might guess, setting it to TRUE would result in them both being deleted.

Let’s check the results of the above command:

st_info |> 
  select(Given, Family, FullName)

# A tibble: 2,000 × 3
   Given    Family      FullName            
   <chr>    <chr>       <chr>               
 1 Tenoch   Rodriguez   Tenoch Rodriguez    
 2 Travis   Roe         Travis Roe          
 3 Axel     Allen       Axel Allen          
 4 Justice  Hilton      Justice Hilton      
 5 Mehran   Smith       Mehran Smith        
 6 Carter   Shipley     Carter Shipley      
 7 Benjamin Chamberlain Benjamin Chamberlain
 8 Robert   Stahl       Robert Stahl        
 9 Zaiden   Rogers      Zaiden Rogers       
10 Jakob    Scott       Jakob Scott         
# ℹ 1,990 more rows

8 Convert string variable/column to categorical

A categorical variable usually has a limited number of predictable and pre-defined values that it can take. If you have ever filled out a form with a drop-down menu or a selection of radio buttons, then those would be good candidates to be categorical variables.

R calls these variables factors. In this st_info data frame, three columns are good candidates for becoming factors: Sex, Race, and St.

8.1 The conversion process

The following command—definitely not tidyverse-specific—is used to convert a column into a factor:

dataframe$column-name <- factor(dataframe$column-name)

Thus, the following convert the three columns to factors:

st_info$Sex <- factor(st_info$Sex)
st_info$Race <- factor(st_info$Race)
st_info$St <- factor(st_info$St)

8.2 Learning about a factor

Before—or even after—you turn a variable into a factor, you can find all of the unique values in that variable (column) with the unique() function:

category_values <- unique(st_info$Race)

As always, you can see the contents of a variable by simply typing it at the prompt:

category_values

[1] W  A  O  H  B  AI PI
Levels: A AI B H O PI W

Notice there are two lines here. The first simply shows the values in the variable. The second line contains the same values, but they are now in alphabetical order. The order doesn’t mean anything in particular to R, but this is the order in which R will print out the results of particular commands. We’ll get back to this soon.

This would then show how many values the factor has:

length(category_values)

[1] 7

Of course, you could also have used the following to get how many values the factor has:

length(unique(st_info$Race))

[1] 7

This would be the preferred way if you had no particular need to access the values in the variable category_values that was created above.

You can use the table() function to show both the values in the factor and the number of times the factor has that value:

table(st_info$Race)


   A   AI    B    H    O   PI    W 
 136   28  243  403   10    7 1173

Or, if you want to list those factor values by the decreasing frequency of their occurrence:

sort(table(st_info$Race), decreasing = TRUE)


   W    H    B    A   AI    O   PI 
1173  403  243  136   28   10    7

Finally, you can add the prop.table() function and see the proportion of the whole for each factor (instead of the raw value):

sort(prop.table(table(st_info$Race)), 
     decreasing = TRUE)


     W      H      B      A     AI      O     PI 
0.5865 0.2015 0.1215 0.0680 0.0140 0.0050 0.0035

The levels() function will list the variable’s levels in order by value. Notice that this results in the same line as the second output line when the factor value is printed out.

levels(st_info$Race)

[1] "A"  "AI" "B"  "H"  "O"  "PI" "W"

8.3 Changing the order of a factor’s levels

Now, suppose that you want to change the ordering of the factor values. The following reorders the values of st_info$Race.

st_info$Race <- 
  factor(st_info$Race,
         levels = c("W", "H", "B", "A", "AI", 
                    "PI", "O"))

Let’s check that it worked. The columns should be in the same order as the levels list in the above command.

table(st_info$Race)


   W    H    B    A   AI   PI    O 
1173  403  243  136   28    7   10

And so it is.

9 Joining tables together

Excel holds tables of data in a sheet. It is common for one master table to need to get data from some secondary table; this is usually done by means of Excel’s lookup function.

R’s approach is, we believe, easier and more efficient. You will join two data frames together to form one data frame.

9.1 Reading in the second table

We are now going to add some new data to our student information example. We are going to add information about counties and their economic situation.

county_info <- read_csv("../data/class202405/counties.csv", 
                        show_col_types = FALSE)

Let’s take a look at this new data frame:

head(county_info)

# A tibble: 6 × 5
  code     name             state_name     population per_capita_income_2020
  <chr>    <chr>            <chr>               <dbl>                  <dbl>
1 ABBEVISC Abbeville County South Carolina      24356                  38094
2 AIKENSC  Aiken County     South Carolina     174150                  48208
3 ALLENDSC Allendale County South Carolina       7579                  39852
4 ANDERSSC Anderson County  South Carolina     209581                  43933
5 APPLINGA Appling County   Georgia              1156                  39199
6 ATKINSGA Atkinson County  Georgia               585                  31938

We can see that we have the following variables (columns):

code: a unique identifier for the county
name: the name of the county
state_name: the name of the state in which the county is located
population: the pertinent population of the county
per_capita_income_2020: the average per capita income in 2020 for residents in the county

Let’s remind ourselves of the columns in the st_info data frame:

names(st_info)

 [1] "AppID"     "FullName"  "Given"     "Family"    "Birthdate" "Email"    
 [7] "St"        "County"    "Sex"       "Race"      "SAT"       "BD"       
[13] "Age"

We want to make the following observations:

The st_info table has a column named County that contains the county code in which the applicant resides. It would be useful if the county code in the county_info data frame had the same name (in order to minimize confusion).
While it is obvious just when looking at the county_info data frame that name refers to the name of the county, we have a more complicated world now. It would be less confusing if it was more clear that this variable refers to the name of a county.
The variable name per_capita_income_2020 is quite long. It’s always a balancing act between length of variable names and clarity. Since PCI is frequently used to refer to per capita income, we should feel comfortable shortening the name.

9.2 Modifying the table

Let’s make those changes using the rename() function. Here’s the basic form:

df_name <-
  df_name |> 
    rename(new-col-name = old-col-name,
           new-col-name1 = old-col-name2,
           ...)

Let’s apply that to this situation:

county_info <-
  county_info |> rename(County = code, 
                        County_Name = name,
                        PCI20 = per_capita_income_2020)

Take another look at the county_info data frame:

head(county_info)

# A tibble: 6 × 5
  County   County_Name      state_name     population PCI20
  <chr>    <chr>            <chr>               <dbl> <dbl>
1 ABBEVISC Abbeville County South Carolina      24356 38094
2 AIKENSC  Aiken County     South Carolina     174150 48208
3 ALLENDSC Allendale County South Carolina       7579 39852
4 ANDERSSC Anderson County  South Carolina     209581 43933
5 APPLINGA Appling County   Georgia              1156 39199
6 ATKINSGA Atkinson County  Georgia               585 31938

The variables now have new names; the data has not changed.

We aren’t going to use the population or state_name variables so let’s remove them from the data frame:

county_info <-
  county_info |>
    select(County, County_Name, PCI20)

We could have also used the following form of this command:

county_info <-
  county_info |> 
    select(-c(population, state_name))

9.3 Making the combined table

In order to do analysis, R wants all of the information in a single data frame. Thus, given that we have two data frames now that are related—st_info and county_info—we need to get all of the relevant data into one data frame.

In this situation, the information about the student is in the st_info data frame (including the county in which he/she resides) and information about the county is in the county_info data frame. The link between the two tables is the County column in st_info and the County column in county_info.

We can combine these two tables into one with the left_join() command.

Consider the following command:

student_econ <-
  st_info |> 
    select(AppID, County, St, Sex, Race, SAT, Age) |> 
    left_join(county_info, by = join_by(County))

Let’s take this one line at a time:

The results are going to be assigned to the student_econ data frame.
The st_info data frame will be considered the one on the left. Think of reading. The English language is read left-to-right, so the first words encountered are on the left of words that are encountered later.
We only want to include seven variables (columns) from the st_info data frame in the new student_econ table. You need to make sure that one of these is the variable that will be used to join the two tables together. Above, we established that this variable is called County in both data frames.
Now for the left_join() function itself: (a)county_info is the right table; i.e., the table being joined to the left table. (b) The by = join_by(County) phrase tells the function that the two tables are to be joined by the County variable.

Let’s see what the result is:

head(student_econ)

# A tibble: 6 × 9
  AppID                County   St    Sex   Race    SAT   Age County_Name  PCI20
  <chr>                <chr>    <fct> <fct> <fct> <dbl> <dbl> <chr>        <dbl>
1 4563269562-RODR-2021 COWETAGA GA    M     W      1436  21.5 Coweta Coun… 52106
2 9221751846-ROEH-2021 GREENVSC SC    M     A      1398  21.6 Greenville … 52376
3 4290276249-ALLE-2021 BULLOCGA GA    M     W      1090  21.4 Bulloch Cou… 35376
4 3398780452-HILT-2021 DEKALBGA GA    M     W      1516  21.4 Dekalb Coun… 56428
5 7691897840-SMIT-2021 WHITFIGA GA    M     O      1440  22.0 Whitfield C… 43456
6 1862245592-SHIP-2021 DEKALBGA GA    M     H      1438  22.1 Dekalb Coun… 56428

Notice that the seven variables from st_info are in this new data frame as are the three variables from county_info. The variable County is only included once.

The result is that now we know the PCI (in 2020) for the county in which the applicant resides. Even if we don’t know the applicant’s economic situation, we know the average for the county in which he/she lives.

10 Sorting data frames by variable/column

10.1 Single columns

Welcome to the realm of sorting in R! First up, let’s master the art of ordering single columns.

st_min |> 
  arrange(Birthdate) |> 
  head()

# A tibble: 6 × 7
  AppID                Birthdate  St    County   Sex   Race    SAT
  <chr>                <chr>      <chr> <chr>    <chr> <chr> <dbl>
1 1412835066-BRYA-2021 01/01/2003 SC    YORKCSC  M     W      1352
2 7043527378-BERR-2021 01/02/2003 SC    AIKENSC  M     A      1120
3 6119019828-BURN-2021 01/02/2003 SC    DARLINSC F     W      1104
4 4650793794-HERR-2021 01/02/2003 SC    LEXINGSC F     W      1469
5 4054336768-BLAC-2021 01/03/2003 GA    CARROLGA M     W      1146
6 0117952760-FREE-2021 01/03/2003 SC    LANCASSC M     W      1438

In this example, the dataset st_min is being arranged based on the Birthdate column. Using the arrange(Birthdate) function, the rows will be sorted in ascending order based on the dates in the Birthdate column, displaying the earliest dates at the top. The head() function then shows us the first few rows of this beautifully ordered dataset.

10.2 Multiple columns

Next, let’s navigate the landscape of multi-column sorting.

st_min |> 
  arrange(St, County, Birthdate) |> 
  head()

# A tibble: 6 × 7
  AppID                Birthdate  St    County   Sex   Race    SAT
  <chr>                <chr>      <chr> <chr>    <chr> <chr> <dbl>
1 7297156714-TAYL-2021 02/11/2003 GA    BALDWIGA F     W      1296
2 5503874158-ONEI-2021 08/28/2003 GA    BALDWIGA F     W      1120
3 4217895057-PLUM-2021 08/30/2002 GA    BALDWIGA M     W      1336
4 2934873168-SANC-2021 09/07/2002 GA    BALDWIGA M     W      1311
5 1634395813-GORE-2021 09/22/2002 GA    BALDWIGA M     W      1171
6 3395189387-JOHN-2021 06/03/2003 GA    BARROWGA F     B      1354

Here, we’re leveling up by arranging the data by multiple columns: St, County, and Birthdate. The arrange() function works diligently, first sorting by St, then within St, it sorts by County, and finally, within County, it sorts by Birthdate. It’s like a well-organized filing system, making sure everything is in its right place.

10.3 Descending order

Now, let’s unveil the magic of descending order.

st_min |> 
  arrange(St, County, desc(Birthdate)) |> 
  head()

# A tibble: 6 × 7
  AppID                Birthdate  St    County   Sex   Race    SAT
  <chr>                <chr>      <chr> <chr>    <chr> <chr> <dbl>
1 1634395813-GORE-2021 09/22/2002 GA    BALDWIGA M     W      1171
2 2934873168-SANC-2021 09/07/2002 GA    BALDWIGA M     W      1311
3 4217895057-PLUM-2021 08/30/2002 GA    BALDWIGA M     W      1336
4 5503874158-ONEI-2021 08/28/2003 GA    BALDWIGA F     W      1120
5 7297156714-TAYL-2021 02/11/2003 GA    BALDWIGA F     W      1296
6 2419671834-JONE-2021 12/03/2002 GA    BARROWGA M     B      1117

In this powerful move, we’re arranging the data in descending order based on the Birthdate column, while St and County remain in ascending order. By invoking the desc(Birthdate) spell, the Birthdate column is transformed, and the latest dates gallantly appear at the top. A blend of old and new, this approach provides a nuanced view of our dataset.

11 Creating a new column

11.1 Text calculations

Let’s dive into the wonderful world of text manipulation! Strings, characters, texts, or words—whatever you call them, they’re central in data handling. Here we’ll unveil some practical magic tricks to make your text columns perform delightful ballets of coherence and utility!

st_info <-
  st_info |> 
    mutate(Full = paste(Given, Family, sep = " "))

Our first act features the mutate() and paste() functions, a dynamic duo in text calculations. Together, they craft a new column, Full, by pasting the Given and Family names together with a space in between. This newly conjured Full column now graciously hosts the complete names, ensuring each person is addressed with due respect.

# not evaluated; equivalent to the previous command.
st_info |> 
  unite("Full", Given:Family, remove = FALSE, sep = " ") |> 
  head()

Meet the unite() function—a mystical ally in merging columns. Here, it acts like a sophisticated tailor, stitching the Given and Family columns together with a space separator, crafting a chic “Full” name column. The remove = FALSE argument ensures that the original columns retain their places, standing proudly beside their new companion.

st_info |> 
  select(Full, Given, Family) |> 
  arrange(Family, Given) |> 
  head(n=10)

# A tibble: 10 × 3
   Full           Given   Family 
   <chr>          <chr>   <chr>  
 1 Angie Abel     Angie   Abel   
 2 Ollie Acosta   Ollie   Acosta 
 3 Amber Adams    Amber   Adams  
 4 Conner Adams   Conner  Adams  
 5 Gemma Adams    Gemma   Adams  
 6 Jones Adams    Jones   Adams  
 7 Walter Adams   Walter  Adams  
 8 Caspian Adkins Caspian Adkins 
 9 Dalet Aguilar  Dalet   Aguilar
10 Tim Aguilar    Tim     Aguilar

Finally, we adorn our dataset with the elegance of selection and ordering. The select() function thoughtfully curates our columns, ensuring that only Full, Given, and Family grace the final presentation. Guided by the arrange() function, the rows sashay into a splendid procession ordered by Family and Given, showcasing the harmonious blend of our text manipulation craftsmanship.

11.2 Numeric calculations

11.2.1 Calculating summary statistics

Suppose that we want to calculate some summary statistics over the whole data frame. You might type the following:

print(c(mean(st_info$SAT), median(st_info$SAT), 
        sd(st_info$SAT), length(st_info$SAT)))

[1] 1314.7660 1308.5000  136.5139 2000.0000

So, it should be clear that this line of code calculates and prints a collection of statistics for theSATcolumn in the st_info dataset. It specifically calculates the:

Mean: The average of theSATscores.
Median: The middle value in theSATscores.
Standard Deviation (sd): A measure of the amount of variation or dispersion of theSATscores.
Length: The total number of values (count) in theSATcolumn.

Note that there might be a problem with the above. Do you want to include NA values in the calculations?

Let’s take a quick look:

v <- c(1,2,3)
print(c(mean(v), median(v), length(v)))

[1] 2 2 3

This first line of results are just what we expected. Let’s simply add an NA to the vector:

v <- c(1,2,3,NA)
print(c(mean(v), median(v), length(v)))

[1] NA NA  4

Take a look at the second line—both mean() and median() return NA. If this is okay, then you don’t have a problem. However, if you want to calculate the values without including the NAs, then you have two options. Here’s the first:

v <- c(1, 2, 3, NA)
print(c(mean(v, na.rm = TRUE),
        median(v, na.rm = TRUE),
        sum(!is.na(v))))

[1] 2 2 3

The na.rm means “do you want to remove the NA values?” Note that length() does not take a na.rm argument so we count up the values in v that are not NA values.

There is a second way that uses the na.omit() function which strips out NA values from a vector:

v <- c(1, 2, 3, NA)
v_with_no_na <- na.omit(v)
print(c(mean(v_with_no_na),
        median(v_with_no_na),
        length(v_with_no_na)))

[1] 2 2 3

So, going back to our original problem of calculating statistics for the SAT column of the st_info data frame, we could use a more tidyverse-friendly approach that utilizes summarize():

st_info |>
  summarise(
    Mean = mean(SAT, na.rm = TRUE),
    Median = median(SAT, na.rm = TRUE),
    SD = sd(SAT, na.rm = TRUE),
    Count = n()
  )

# A tibble: 1 × 4
   Mean Median    SD Count
  <dbl>  <dbl> <dbl> <int>
1 1315.  1308.  137.  2000

11.2.2 Calculating values per row

Suppose that we want to calculate by what percentage each applicant exceeds (or falls short of) the overall SAT score for all applicants. We could run the following command:

st_info |> 
  mutate(rel_SAT = round(((SAT / mean(st_info$SAT)) - 1.0) * 100.0, 2)) |>
  arrange(Family, Given) |> 
  select(Full, County, St, Sex, Race, SAT, rel_SAT) |> 
  head(n=10)

# A tibble: 10 × 7
   Full           County   St    Sex   Race    SAT rel_SAT
   <chr>          <chr>    <fct> <fct> <fct> <dbl>   <dbl>
 1 Angie Abel     GEORGESC SC    F     W      1472   12.0 
 2 Ollie Acosta   RICHLASC SC    F     W      1280   -2.64
 3 Amber Adams    BEAUFOSC SC    F     W      1585   20.6 
 4 Conner Adams   HORRYSC  SC    M     W      1296   -1.43
 5 Gemma Adams    WILLIASC SC    F     B      1197   -8.96
 6 Jones Adams    HOUSTOGA GA    M     W      1315    0.02
 7 Walter Adams   GREENVSC SC    M     A      1356    3.14
 8 Caspian Adkins LEXINGSC SC    M     A      1181  -10.2 
 9 Dalet Aguilar  HORRYSC  SC    F     W      1151  -12.5 
10 Tim Aguilar    JEFFERGA GA    M     W      1323    0.63

This snippet performs several operations on the st_info dataset:

Mutation: It creates a new column rel_SAT, which calculates the relative SAT score as a percentage difference from the mean SAT score. Each value in rel_SAT indicates how much above or below the mean each individual SAT score is.
Arrangement: The data is then sorted by Family and Given names.
Selection and Display: It selects specific columns to be displayed and shows the first ten rows of this modified dataset.

Again, while this basically works, it also includes NA values in the calculation of the overall mean. While fixing that problem, we can also make the code a bit more readable by making a temporary column mean_SAT to hold that value (though without printing it). Let’s see how this works:

st_info |> 
  mutate(mean_SAT = mean(SAT, na.rm = TRUE)) |> 
  mutate(rel_SAT = round(((SAT / mean_SAT) - 1) * 100, 2)) |> 
  arrange(Family, Given) |> 
  select(Full, County, St, Sex, Race, SAT, rel_SAT) |> 
  head(n = 10)

# A tibble: 10 × 7
   Full           County   St    Sex   Race    SAT rel_SAT
   <chr>          <chr>    <fct> <fct> <fct> <dbl>   <dbl>
 1 Angie Abel     GEORGESC SC    F     W      1472   12.0 
 2 Ollie Acosta   RICHLASC SC    F     W      1280   -2.64
 3 Amber Adams    BEAUFOSC SC    F     W      1585   20.6 
 4 Conner Adams   HORRYSC  SC    M     W      1296   -1.43
 5 Gemma Adams    WILLIASC SC    F     B      1197   -8.96
 6 Jones Adams    HOUSTOGA GA    M     W      1315    0.02
 7 Walter Adams   GREENVSC SC    M     A      1356    3.14
 8 Caspian Adkins LEXINGSC SC    M     A      1181  -10.2 
 9 Dalet Aguilar  HORRYSC  SC    F     W      1151  -12.5 
10 Tim Aguilar    JEFFERGA GA    M     W      1323    0.63

Note that, although we calculated the mean_SAT column, we didn’t include it in the select() function so it didn’t print out.

11.3 Summary columns

11.3.1 Components of summary commands

Let’s take a moment and go through each of the components and the various calculations used within the summarize() function.

group_by(col1, col2, ...): The group_by() function is used to split the data into groups based on one or more variables. Subsequent operations are then performed on these groups rather than the whole dataset. This is useful when you want to perform summary statistics or transformations within individual groups. The groups created by group_by() are not explicitly visible in the output but are retained in the background to affect subsequent operations.
summarize(): This function reduces each group of values down to a single summary statistic. Multiple summaries are allowed.

n(): Counts the number of observations in each group. Useful to quickly understand the size of each group or as a denominator in average calculations.

n_distinct(): Counts the number of distinct values in each group. Helpful for understanding the variety or uniqueness within categorical variables in each group.
mean(col): Calculates the mean (average) of a column in each group. Provides a measure of central tendency; common in various statistical analyses to understand the data’s general behavior.
median(col): Calculates the median (middle value) of a column in each group. Another measure of central tendency that is less sensitive to outliers compared to the mean.
sd(col): Calculates the standard deviation of a column in each group, a measure of dispersion or variability. Useful to understand the spread or variability of the data around the mean.
min(col): Identifies the minimum value in each group. Helpful to understand the range and outliers in the data.
max(col): Identifies the maximum value in each group. Like the minimum, it’s crucial for understanding the data’s range and outliers.

11.3.2 Discussion of the whole pipeline

When you combine group_by() with summarize(), you create a powerful tool to understand the nuanced behaviors within different subsections or categories of your data. Together, they allow a detailed and specific summary analysis, facilitating a deeper understanding of data patterns and variations across groups.

Keep in mind that when you use summarize(), it will remove the grouping structure unless you explicitly choose to keep it. In most cases, this is beneficial as it simplifies the resulting tibble to contain just your summary statistics.

11.3.3 Example in one row

st_info |> 
  summarise(across(where(is.numeric), 
                   list(mean = ~mean(.x, na.rm = TRUE),
                        median = ~median(.x, na.rm = TRUE),
                        sd = ~sd(.x, na.rm = TRUE),
                        count = ~sum(!is.na(.x))),
                   .names = "{col}_{fn}"))

# A tibble: 1 × 8
  SAT_mean SAT_median SAT_sd SAT_count Age_mean Age_median Age_sd Age_count
     <dbl>      <dbl>  <dbl>     <int>    <dbl>      <dbl>  <dbl>     <int>
1    1315.      1308.   137.      2000     21.7       21.7  0.327      2000

11.3.4 Example in multiple rows

Consider the following. It calculates summary statistics per numeric column, with the results for each column shown on a separate row of the result.

st_info |> 
  select(where(is.numeric)) |>    # Select only numeric columns
  gather(column, value) |>        # Convert to long format
  group_by(column) |>             # Group by each original column
  summarise(
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    count = sum(!is.na(value))
  )

# A tibble: 2 × 5
  column   mean median      sd count
  <chr>   <dbl>  <dbl>   <dbl> <int>
1 Age      21.7   21.7   0.327  2000
2 SAT    1315.  1308.  137.     2000

There is a lot going on here, so we want to take a few moments to explain each piece, going line-by-line.

select(where(is.numeric)): This selects only the columns in the data frame st_info where the columns are numeric. This part of the code sifts through the columns, keeping only those with numeric data and excluding others like characters or factors.
gather(column, value): This converts the data frame from a wide format to a long format.

The gather() function takes multiple columns and collapses them into key-value pairs, increasing the number of rows.
column will contain the names of the original columns, and value will contain the corresponding values. Note that neither of these names—column or value—mean anything! They are just to be used further down in the command.
For instance, if you have two columns (“A” and “B”), gather() puts all values in a single column and another column specifying where each value came from (“A” or “B”). Thus, it would go from this representation:

A   B
1   5
2   6

To this representation:

row column value
1   A      1
1   B      5
2   A      2
2   B      6

This is what it means to convert from a wide format to a long format.

group_by(column): This groups the data by the original column names.

This function segments the data into groups based on unique column names.
Operations following this function (like summarise()) will operate independently within each group.

summarise(): This calculates summary statistics for each group.

mean = mean(value, na.rm = TRUE) calculates the mean of the numeric values, ignoring NA values.
median = median(value, na.rm = TRUE) calculates the median, ignoring NA values.
sd = sd(value, na.rm = TRUE) calculates the standard deviation, ignoring NA values.
count = sum(!is.na(value)) counts the non-NA values in each group.

The overall result is a summary table with calculated mean, median, standard deviation, and count for each numeric column of the original data frame, st_info.

In summary:

The pipeline starts by selecting numeric columns.
It then reshapes the data to a long format.
Following this, it groups the data by original columns.
Lastly, it calculates summary statistics for each of these groups, resulting in a concise, informative table.

11.3.5 `across()`

11.3.5.1 Selects

selects the columns you want to operate on

where(is.numeric)
c(list of column names)

11.3.5.2 Function or list of functions

colname = ~funcname(.x, na.rm = TRUE):

12 Differences between `arrange()` and `group_by()`

The functions arrange() and group_by()—though they seem similar—serve different purposes when manipulating and analyzing data in R. Here’s a breakdown of their differences.

12.1 `arrange()`

arrange() is used to sort or order rows in a data frame or tibble based on the values of one or more columns. It rearranges the rows in ascending order by default, but you can also sort them in descending order. Consider these two examples:

df <- tibble(x = c(3, 2, 1), y = c("A", "B", "C"))
df |> arrange(x) # sort in ascending order

# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 C    
2     2 B    
3     3 A

Note that the x column is arranged in ascending order as we specified.

df |> arrange(desc(x)) # sort in descending order

# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     3 A    
2     2 B    
3     1 C

And now the x column is arranged in descending order.

12.2 `group_by()`

group_by() is used to split data into groups based on one or more variables. This is particularly useful when you want to perform some operations separately on each group, such as summarizing or mutating.

Let’s first define a simple tibble with three rows and two columns:

df <- tibble(x = c("A", "B", "A"), y = c(1, 2, 3))
df |> select(x, y)

# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 B         2
3 A         3

Suppose that we want to calculate the average y value for every value of x. This is what group_by() is for:

df |> 
  group_by(x) |>  # Grouping by a single variable
  summarise(mean_y = mean(y))

# A tibble: 2 × 2
  x     mean_y
  <chr>  <dbl>
1 A          2
2 B          2

Notice that we did not have to arrange() the tibble in any way first; the group_by() command handles the ordering and the grouping itself.

Finally, we can also calculate summary statistics when grouping by multiple columns:

df |> 
  group_by(x, y) |>  # Grouping by multiple variables
  summarise(count = n())

# A tibble: 3 × 3
# Groups:   x [2]
  x         y count
  <chr> <dbl> <int>
1 A         1     1
2 A         3     1
3 B         2     1

This counts, correctly, that there is one each of A,1, A,3, and B,2.

12.3 Key Differences

arrange() modifies the order of rows, changing the way data is displayed but not how it’s fundamentally structured.
group_by(), on the other hand, changes the way data is treated in subsequent operations, particularly when paired with verbs like summarise() or mutate(), to apply functions within each group.

In essence, arrange() is about changing the row order for presentation, while group_by() is about defining subsets of the data to operate on in subsequent steps of your data analysis pipeline. They can also be used in conjunction to first group data and then arrange the resulting summarized data.

Read this documentation for more information.