For loops

1 Iteration and R

The beauty of R’s design is that its functions are designed to work on entire vectors at once. Recall that a “vector” is a sequence of numbers, characters, dates, or other data in a single object.

nums = 1:10 # create a vector of the numbers 1 to 10 

If you want to add 1 to each number in the vector, you don’t need a loop.

nums_plus_one = nums + 1 # performs the operation for all the numbers in the vector

The same is true for data frames, since a data frame is just a list of columns, and each column is a vector of the same length. That’s why the tidyverse functions are so intuitive to read. R and the tidyverse hides all that iteration over individual elements to let you focus on the big picture.

There are advanced functions for iteration in the tidyverse purrr package, but that topic is too advanced for this course. We can get by just fine with simpler methods.

Sometimes we want to divide a data set into groups and perform a complex operation on the subset. For example, creating separate data sets or reports for academic departments, or by student class, or by year. Here’s a recipe for doing that.

For the purposes of our example, we’re going to create this tiny data table.

grades = data.frame(
  student_id  = c(1, 2, 3, 4, 5, 6, 7, 8, 9) * 2,
  st_first    = c("Alice", "Bob", "Charlie", "David",
                  "Eve", "Stanislav", "Yolanda",
                  "Zoe", "Xavier"),
  st_last     = c("Smith", "Jones", "Kline", "White",
                  "Zettle", "Bernard-Zza", "Zhang",
                  "Xu", "Zimmerman"),
  subject     = c("Math", "Math", "English", "English",
                  "English", "English", "History",
                  "History", "History"),
  prof_id     = c(1, 2, 3, 3, 4, 4, 5, 5, 6),
  grade       = c("A", "B", "A", "C", "B", "A", "A",
                  "B", "A")
)
grades <- 
  grades |> 
    mutate(grade = factor(grade,
                          levels = c("A", "B", "C", "D", "F"),
                          ordered = TRUE))

2 Recipe for iterating over groups

To make this concrete, suppose we want to create a grade report for each academic subject. The report shows each professor’s grade distribution for the last academic year. We’ll output that in a CSV file or a PDF report.

  1. Create the combined data set. The grades data frame (above) is a tiny example
  2. Identify the groups in a new data frame. In this case, the groups are the unique values in the subject column. We’ll go ahead and put them in alphabetical order while we’re at it.
subjects = grades |>  distinct(subject) |> arrange()
subjects
  subject
1    Math
2 English
3 History
  1. Create a for loop that looks like this (which is basically a test):
for(i in 1:nrow(subjects)){
  print(i)  # test it out
  #we'll add more here in the next step
}
[1] 1
[1] 2
[1] 3

Here’s what it’s doing.

  • The first time R encounters the for loop, it figures out that i will eventually take the values from 1 to nrows(subjects) (which it calculates is 3).
  • So now it sets i to 1.
  • It now knows that it is supposed to execute everything in the curly brackets ({ and }) from top to bottom.
  • It prints out the value of i, which is the number 1.
  • It ignores everything to the right of the hash # on lines 2 and 3 (because those are comments).
  • It encounters the ending curly bracket, and proceeds to the top of the for loop.
  • It now sets i to 2.
  • It prints out the value of i, which is the number 2.
  • It encounters the ending curly bracket, and proceeds to the top of the for loop.
  • It now sets i to 3.
  • It prints out the value of i, which is the number 3.
  • It encounters the ending curly bracket, and proceeds to the top of the for loop.
  • It sees that there are no more values for i, jumps to the ending curly bracket, and proceeds with the rest of the program (of which there is none in this example).
  1. Let’s expand on that test and get R to print out each subject, individually.

We know multiple ways to access data items in a column. In this case, we know the following:

  • We know how to count. (Don’t worry, this becomes important in a minute.)
  • We know how to determine the number of rows in a vector — nrow(v).
  • We know that a for loop can iterate from 1 to a specified number.
  • We know how to access a value in a specific row in a specific column — df$col[num].

Put all of that together, and we can put together a for loop that accesses each subject name. Notice that we’re iterating through the subjects vector because we want to work on each subject name only once.

for(i in 1:nrow(subjects)){
  
    my_subject = subjects$subject[i]
    print(my_subject)
  
    #we'll add more here in the next step
}
[1] "Math"
[1] "English"
[1] "History"
  1. Inside the loop, filter the data frame to the current group. On each (of three) pass through the for loop, R will create a new version of subject_grades which only contains the rows for that particular subject.

You can see from the output of nrow() that the filter() has worked appropriately to create the subject_grades data frame on each pass.

for(i in 1:nrow(subjects)){
  
  my_subject = subjects$subject[i]
  print(my_subject) 
  
  # filter the grades data frame to the current subject
  subject_grades = grades |> 
                     filter(subject == my_subject)
  print(nrow(subject_grades))
  
  #we'll add more here in the next step
}
[1] "Math"
[1] 2
[1] "English"
[1] 4
[1] "History"
[1] 3
  1. Inside the loop, perform the calculations on the filtered data that are to go in the report. Then output to the csv or other form.
for(i in 1:nrow(subjects)){
  
  my_subject = subjects$subject[i] 

  # filter the grades data frame to the current subject
  subject_grades = grades |> 
    filter(subject == my_subject)
  
  # create the grade distribution
  prof_grades = subject_grades |> 
    count(prof_id, grade) |> 
    pivot_wider(names_from = grade, 
                values_from = n,
                names_sort = TRUE,
                values_fill = 0)
  
  # for purposes of this demo, we'll just print the data 
  print(my_subject)
  print(subject_grades)
  print(prof_grades)
  
  # Write the data to a csv file (usually put this 
  # in an output folder)
  # To create the csv files, un-comment the line below
  #write_csv(prof_grades, str_c(my_subject, "_grades.csv"))
}
[1] "Math"
  student_id st_first st_last subject prof_id grade
1          2    Alice   Smith    Math       1     A
2          4      Bob   Jones    Math       2     B
# A tibble: 2 × 3
  prof_id     A     B
    <dbl> <int> <int>
1       1     1     0
2       2     0     1
[1] "English"
  student_id  st_first     st_last subject prof_id grade
1          6   Charlie       Kline English       3     A
2          8     David       White English       3     C
3         10       Eve      Zettle English       4     B
4         12 Stanislav Bernard-Zza English       4     A
# A tibble: 2 × 4
  prof_id     A     B     C
    <dbl> <int> <int> <int>
1       3     1     0     1
2       4     1     1     0
[1] "History"
  student_id st_first   st_last subject prof_id grade
1         14  Yolanda     Zhang History       5     A
2         16      Zoe        Xu History       5     B
3         18   Xavier Zimmerman History       6     A
# A tibble: 2 × 3
  prof_id     A     B
    <dbl> <int> <int>
1       5     1     1
2       6     1     0