The Pipe

Before continuing, we need to spend a bit of time on R’s pipe operator—either |> or %>%. (Let’s get this out of the way: There are historical reasons why two pipe operators exist. The %>% version is the original; the |> version is the new (and current and preferred) version. But you will see both in your explorations).

The pipe is one of the distinguishing features of R scripts…and it is awesome!

Think of the pipe as enabling you to easily define a sequence of actions—kind of like steps in the creation of a recipe: “First you do this, then you do this, and lastly, you do this.”

1 Creating a recipe using a pipe

Let’s take a few moments to consider the following. Don’t worry that you don’t understand all of the details for now—we expend a lot of effort on this site to ensure that it doesn’t seem like a foreign language. We think you’ll see that the general structure is easy enough to understand:

st_info |> 
 select("Application ID", St, Sex, SAT) |> 
 filter(Sex == "M") |> 
 group_by(St) |> 
 summarize(MinSAT = min(SAT), 
           MeanSAT = mean(SAT), 
           MaxSAT = max(SAT), 
           Count = n())

# A tibble: 2 × 5
  St    MinSAT MeanSAT MaxSAT Count
  <chr>  <dbl>   <dbl>  <dbl> <int>
1 GA       978   1312.   1600   225
2 SC      1008   1313.   1600   695

The command can be read, line-by-line, as follows:

Line 1: On the data in the st_info data frame,
Line 2: select four specific variables (columns),
Line 3: then filter (choose) the rows (observations) for which the Sex variable has the "M" value.
Line 4: Next, group the rows together by the value in the St (“state”) variable.
Lines 5–7: Finally, summarize the SAT column by state by calculating the minimum, mean, and maximum values.
Line 8: Also, calculate the number of rows within each grouping.

The output of the command, as you would hope given the above description, is a tibble with rows for each state and the minimum, average, and maximum SAT score calculated for each state. It also shows the number of rows within each grouping.

FYI, within RStudio, you can type the pipe operator with the keys Ctrl-Shift-M (or Cmd-Shift-M on the Mac). This gets rid of some of the awkwardness of using this operator.

Here is a more detailed reading of the above command from top to bottom.

st_info: We will be working with the st_info dataframe.
select("Application ID", St, Sex, SAT): First, we are selecting just four variables from the st_info dataframe to work with.
filter(Sex == "M"): Second, we only want to filter those rows/observations for which the Sex variable equals "M". Notice the double equals sign; this is the sign for testing equality. A single equals sign is an assignment operator (which you will see in a moment).
group_by(St): Third, we want the remaining rows of the dataframe to be grouped by the values in the St variable (column).
summarize(MinSAT = min(SAT), MeanSAT = mean(SAT), MaxSAT = max(SAT), Count = n()): Finally, we want to calculate three values on the SAT column: the minimum SAT for each state, the mean SAT for each state, and the maximum SAT for each state. The values are calculated per state because the dataframe has been grouped by that column. The calculation will be assigned to three new columns (with the names shown on the left of the equal signs).

That’s the recipe: Specify the ingredients (in this case, just the st_info dataframe), and then specify the steps to take in order to create the final output (in this case, select, filter, group_by, and summarize). Other scripts can and will use different steps in different orders and will work on different dataframes.

We said above, when introducing the pipe operator, that it is awesome. We definitely believe it, but we want you to see a couple of examples why we think so.

2 Alternative approaches that do not use a pipe

Below, we present two different scripts that don’t use the pipe operator but that deliver the same output as the script above. These are both quite traditional means of programming in languages other than R that don’t have the pipe operator. And if R didn’t have the pipe operator, then this is how we would have to work. Note that we are not trying to make this example complicated and clunky…they just come out this way.

2.1 Nesting commands

This first alternative works by nesting commands within parentheses. You should read this command from the inside out—you select, then filter, then group_by, and finally summarize. This is the same as before but it is, at least to our eyes, a lot more difficult to read:

summarize(group_by(filter(select(st_info, 
                                 "Application ID", 
                                 St, Sex, SAT),
                           Sex == "M"), 
                    St), 
          MeanSAT = mean(SAT),
          MinSAT = min(SAT),
          MaxSAT = max(SAT),
          Count = n())

That’s kind of a mess, isn’t it? Hard to read amid a nest of parentheses. As much as possible, we will refrain from using this approach. And—I’m guessing without too much encouragement—you will follow our lead on this.

2.2 Creating temporary tables

Here is the second alternative. (Again, it is equivalent to both the original pipe-based version and the nested-parentheses version above.) Here what we do is perform one action at a time and then store the temporary result in a temporary dataframe. We perform the steps in the same order (select on st_info, filter, group_by, and then summarize):

tb1 <- select(st_info, "Application ID", St, Sex, SAT)
tb2 <- filter(tb1, Sex == "M")
tb3 <- group_by(tb2, St)
summarize(tb3, 
          MeanSAT = mean(SAT), 
          MinSAT = min(SAT), 
          MaxSAT = max(SAT), 
          Count = n())
rm(tb1, tb2, tb3)

Now this looks quite a bit like the original pipe-based version, but it awkwardly creates those temporary tables (tb1, tb2, and tb3) that serve no other purpose than to be used by the next command. To make it nice and neat, it is best to use the rm() command to remove the dataframes that you no longer need.

Go back one more time and look at the pipe-based version. Now, that doesn’t look so bad, does it? Again, we will be using this approach for the rest of the class, so if you don’t quite get it now, it’ll be easier for you over time.