For loops
1 Iteration and R
The beauty of R’s design is that its functions are designed to work on entire vectors at once. Recall that a “vector” is a sequence of numbers, characters, dates, or other data in a single object.
If you want to add 1 to each number in the vector, you don’t need a loop.
The same is true for data frames, since a data frame is just a list of columns, and each column is a vector of the same length. That’s why the tidyverse functions are so intuitive to read. R and the tidyverse hides all that iteration over individual elements to let you focus on the big picture.
There are advanced functions for iteration in the tidyverse purrr
package, but that topic is too advanced for this course. We can get by just fine with simpler methods.
Sometimes we want to divide a data set into groups and perform a complex operation on the subset. For example, creating separate data sets or reports for academic departments, or by student class, or by year. Here’s a recipe for doing that.
For the purposes of our example, we’re going to create this tiny data table.
grades = data.frame(
student_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9) * 2,
st_first = c("Alice", "Bob", "Charlie", "David",
"Eve", "Stanislav", "Yolanda",
"Zoe", "Xavier"),
st_last = c("Smith", "Jones", "Kline", "White",
"Zettle", "Bernard-Zza", "Zhang",
"Xu", "Zimmerman"),
subject = c("Math", "Math", "English", "English",
"English", "English", "History",
"History", "History"),
prof_id = c(1, 2, 3, 3, 4, 4, 5, 5, 6),
grade = c("A", "B", "A", "C", "B", "A", "A",
"B", "A")
)
grades <-
grades |>
mutate(grade = factor(grade,
levels = c("A", "B", "C", "D", "F"),
ordered = TRUE))
2 Recipe for iterating over groups
To make this concrete, suppose we want to create a grade report for each academic subject. The report shows each professor’s grade distribution for the last academic year. We’ll output that in a CSV
file or a PDF
report.
- Create the combined data set. The
grades
data frame (above) is a tiny example - Identify the groups in a new data frame. In this case, the groups are the unique values in the
subject
column. We’ll go ahead and put them in alphabetical order while we’re at it.
- Create a
for
loop that looks like this (which is basically a test):
[1] 1
[1] 2
[1] 3
Here’s what it’s doing.
- The first time R encounters the
for
loop, it figures out thati
will eventually take the values from1
tonrows(subjects)
(which it calculates is3
). - So now it sets
i
to1
. - It now knows that it is supposed to execute everything in the curly brackets (
{
and}
) from top to bottom. - It prints out the value of
i
, which is the number1
. - It ignores everything to the right of the hash
#
on lines 2 and 3 (because those are comments). - It encounters the ending curly bracket, and proceeds to the top of the
for
loop. - It now sets
i
to2
. - It prints out the value of
i
, which is the number2
. - It encounters the ending curly bracket, and proceeds to the top of the
for
loop. - It now sets
i
to3
. - It prints out the value of
i
, which is the number3
. - It encounters the ending curly bracket, and proceeds to the top of the
for
loop. - It sees that there are no more values for
i
, jumps to the ending curly bracket, and proceeds with the rest of the program (of which there is none in this example).
- Let’s expand on that test and get R to print out each subject, individually.
We know multiple ways to access data items in a column. In this case, we know the following:
- We know how to count. (Don’t worry, this becomes important in a minute.)
- We know how to determine the number of rows in a vector —
nrow(v)
. - We know that a
for
loop can iterate from1
to a specified number. - We know how to access a value in a specific row in a specific column —
df$col[num]
.
Put all of that together, and we can put together a for
loop that accesses each subject name. Notice that we’re iterating through the subjects
vector because we want to work on each subject name only once.
for(i in 1:nrow(subjects)){
my_subject = subjects$subject[i]
print(my_subject)
#we'll add more here in the next step
}
[1] "Math"
[1] "English"
[1] "History"
- Inside the loop, filter the data frame to the current group. On each (of three) pass through the
for
loop, R will create a new version ofsubject_grades
which only contains the rows for that particular subject.
You can see from the output of nrow()
that the filter()
has worked appropriately to create the subject_grades
data frame on each pass.
for(i in 1:nrow(subjects)){
my_subject = subjects$subject[i]
print(my_subject)
# filter the grades data frame to the current subject
subject_grades = grades |>
filter(subject == my_subject)
print(nrow(subject_grades))
#we'll add more here in the next step
}
[1] "Math"
[1] 2
[1] "English"
[1] 4
[1] "History"
[1] 3
- Inside the loop, perform the calculations on the filtered data that are to go in the report. Then output to the csv or other form.
for(i in 1:nrow(subjects)){
my_subject = subjects$subject[i]
# filter the grades data frame to the current subject
subject_grades = grades |>
filter(subject == my_subject)
# create the grade distribution
prof_grades = subject_grades |>
count(prof_id, grade) |>
pivot_wider(names_from = grade,
values_from = n,
names_sort = TRUE,
values_fill = 0)
# for purposes of this demo, we'll just print the data
print(my_subject)
print(subject_grades)
print(prof_grades)
# Write the data to a csv file (usually put this
# in an output folder)
# To create the csv files, un-comment the line below
#write_csv(prof_grades, str_c(my_subject, "_grades.csv"))
}
[1] "Math"
student_id st_first st_last subject prof_id grade
1 2 Alice Smith Math 1 A
2 4 Bob Jones Math 2 B
# A tibble: 2 × 3
prof_id A B
<dbl> <int> <int>
1 1 1 0
2 2 0 1
[1] "English"
student_id st_first st_last subject prof_id grade
1 6 Charlie Kline English 3 A
2 8 David White English 3 C
3 10 Eve Zettle English 4 B
4 12 Stanislav Bernard-Zza English 4 A
# A tibble: 2 × 4
prof_id A B C
<dbl> <int> <int> <int>
1 3 1 0 1
2 4 1 1 0
[1] "History"
student_id st_first st_last subject prof_id grade
1 14 Yolanda Zhang History 5 A
2 16 Zoe Xu History 5 B
3 18 Xavier Zimmerman History 6 A
# A tibble: 2 × 3
prof_id A B
<dbl> <int> <int>
1 5 1 1
2 6 1 0