Summary calculations

On this page, we’re going to explain how to calculate summary calculations on the whole table and on subsets of a table (based on values in specific grouping columns). We will focus on the usage of the group_by() and summarize() commands.

This is the data that we’re going to be working with on this page:

names(survey)

 [1] "Year"       "ID"         "NPS"        "Field"      "ClassLevel"
 [6] "Status"     "Gender"     "BirthYear"  "FinPL"      "FinSch"    
[11] "FinGov"     "FinSelf"    "FinPar"     "FinOther"   "Question"  
[16] "Response"   "NumResp"

glimpse(survey)

Rows: 268,481
Columns: 17
$ Year       <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
$ ID         <chr> "aacuutwvloft", "aacuutwvloft", "aacuutwvloft", "aacuutwvlo…
$ NPS        <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4,…
$ Field      <chr> "Undecl", "Undecl", "Undecl", "Undecl", "Undecl", "Undecl",…
$ ClassLevel <ord> Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Fre…
$ Status     <fct> Full-time, Full-time, Full-time, Full-time, Full-time, Full…
$ Gender     <fct> Female, Female, Female, Female, Female, Female, Female, Mal…
$ BirthYear  <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1992, 1992, 1992,…
$ FinPL      <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ FinSch     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
$ FinGov     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ FinSelf    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ FinPar     <chr> NA, NA, NA, NA, NA, NA, NA, "No", "No", "No", "No", "No", "…
$ FinOther   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
$ Question   <ord> ClassTooBig, FinAid, OverallValue, PoorTeaching, Sched, Too…
$ Response   <ord> Neutral, Agree, Strongly Agree, Strongly Agree, Disagree, D…
$ NumResp    <dbl> 3, 4, 5, 5, 2, 2, 2, 1, 3, 4, 1, 5, 3, 5, 1, 2, 4, 1, 1, 5,…

This is sample data from a data table about a student retention survey. It has over 250,000 rows and 17 columns.

1 Calculating on whole table

R has built-in functions for calculating the minimum, average/mean, median, maximum, standard deviation, the raw count of the number of items, and the number of distinct items:

survey |> 
  summarize(MinResp = min(NumResp, na.rm = TRUE),
            AvgResp = mean(NumResp, na.rm = TRUE),
            MedResp = median(NumResp, na.rm = TRUE),
            MaxResp = max(NumResp, na.rm = TRUE),
            SDResp = sd(NumResp, na.rm = TRUE),
            Count = n(),
            Distinct = n_distinct(NumResp))

# A tibble: 1 × 7
  MinResp AvgResp MedResp MaxResp SDResp  Count Distinct
    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <int>    <int>
1       1    3.22       3       5   1.37 268481        5

We use the additional argument na.rm to tell R to remove NA values before calculating the value. You will notice that the left side of the equal sign is the name of the column.

You do not have to rename the columns, but it generally makes it easier to read. For example, simply removing the column names from the above makes the full statement nearly unusable:

survey |> 
  summarize(min(NumResp, na.rm = TRUE),
            mean(NumResp, na.rm = TRUE),
            median(NumResp, na.rm = TRUE),
            max(NumResp, na.rm = TRUE),
            sd(NumResp, na.rm = TRUE),
            n(),
            n_distinct(NumResp))

# A tibble: 1 × 7
  `min(NumResp, na.rm = TRUE)` mean(NumResp, na.rm = TR…¹ median(NumResp, na.r…²
                         <dbl>                      <dbl>                  <dbl>
1                            1                       3.22                      3
# ℹ abbreviated names: ¹`mean(NumResp, na.rm = TRUE)`,
#   ²`median(NumResp, na.rm = TRUE)`
# ℹ 4 more variables: `max(NumResp, na.rm = TRUE)` <dbl>,
#   `sd(NumResp, na.rm = TRUE)` <dbl>, `n()` <int>, `n_distinct(NumResp)` <int>

It’s easy to remove the na.rm value as we do here. Depending on the underlying data, it can change the values a lot or not at all. You should figure out what you want to do and then set the attribute as you need it.

survey |> 
  summarize(MinResp = min(NumResp),
            AvgResp = mean(NumResp),
            MedResp = median(NumResp),
            MaxResp = max(NumResp),
            SDResp = sd(NumResp),
            Count = n(),
            Distinct = n_distinct(NumResp))

# A tibble: 1 × 7
  MinResp AvgResp MedResp MaxResp SDResp  Count Distinct
    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <int>    <int>
1       1    3.22       3       5   1.37 268481        5

Given that the Distinct column equals 5, we should be able to validate it by listing the distinct values in that column:

survey |> 
  distinct(NumResp)

# A tibble: 5 × 1
  NumResp
    <dbl>
1       3
2       4
3       5
4       2
5       1

And so it is.

2 Calculating on groups

2.1 Counting unique values

The most common need for calculations on groups is to count the number of rows in a group. R provides the count() function for this.

2.1.1 Count the rows in a group

This query says, “Count the number of rows for each combination of ClassLevel and Gender and display the results in a column called Count.”

survey |> 
  count(ClassLevel, Gender, name = "Count")

# A tibble: 20 × 3
   ClassLevel Gender Count
   <ord>      <fct>  <int>
 1 Fresh      Female 41546
 2 Fresh      Male   36140
 3 Fresh      Other   7619
 4 Fresh      <NA>    2355
 5 Soph       Female 31296
 6 Soph       Male   27641
 7 Soph       Other   5927
 8 Soph       <NA>    1707
 9 Jr         Female 27302
10 Jr         Male   23084
11 Jr         Other   4978
12 Jr         <NA>    1567
13 Sr         Female 25587
14 Sr         Male   21253
15 Sr         Other   4726
16 Sr         <NA>    1343
17 <NA>       Female  2117
18 <NA>       Male    1759
19 <NA>       Other    454
20 <NA>       <NA>      80

The above is essentially equivalent, though shorter, way of expressing the command here:

survey |> 
  group_by(ClassLevel, Gender) |> 
  summarize(Count = n())

# A tibble: 20 × 3
# Groups:   ClassLevel [5]
   ClassLevel Gender Count
   <ord>      <fct>  <int>
 1 Fresh      Female 41546
 2 Fresh      Male   36140
 3 Fresh      Other   7619
 4 Fresh      <NA>    2355
 5 Soph       Female 31296
 6 Soph       Male   27641
 7 Soph       Other   5927
 8 Soph       <NA>    1707
 9 Jr         Female 27302
10 Jr         Male   23084
11 Jr         Other   4978
12 Jr         <NA>    1567
13 Sr         Female 25587
14 Sr         Male   21253
15 Sr         Other   4726
16 Sr         <NA>    1343
17 <NA>       Female  2117
18 <NA>       Male    1759
19 <NA>       Other    454
20 <NA>       <NA>      80

2.1.2 Count the rows in a group and sort by size

This query says, “Count the number of rows for each combination of ClassLevel and Gender, display the results in a column called Count, and sort the rows in descending order by the Count column.”

survey |> 
  count(ClassLevel, Gender, name = "Count", sort = TRUE)

# A tibble: 20 × 3
   ClassLevel Gender Count
   <ord>      <fct>  <int>
 1 Fresh      Female 41546
 2 Fresh      Male   36140
 3 Soph       Female 31296
 4 Soph       Male   27641
 5 Jr         Female 27302
 6 Sr         Female 25587
 7 Jr         Male   23084
 8 Sr         Male   21253
 9 Fresh      Other   7619
10 Soph       Other   5927
11 Jr         Other   4978
12 Sr         Other   4726
13 Fresh      <NA>    2355
14 <NA>       Female  2117
15 <NA>       Male    1759
16 Soph       <NA>    1707
17 Jr         <NA>    1567
18 Sr         <NA>    1343
19 <NA>       Other    454
20 <NA>       <NA>      80

The following is a more verbose way of expressing the above.

survey |> 
  group_by(ClassLevel, Gender) |> 
  summarize(Count = n()) |> 
  arrange(desc(Count))

# A tibble: 20 × 3
# Groups:   ClassLevel [5]
   ClassLevel Gender Count
   <ord>      <fct>  <int>
 1 Fresh      Female 41546
 2 Fresh      Male   36140
 3 Soph       Female 31296
 4 Soph       Male   27641
 5 Jr         Female 27302
 6 Sr         Female 25587
 7 Jr         Male   23084
 8 Sr         Male   21253
 9 Fresh      Other   7619
10 Soph       Other   5927
11 Jr         Other   4978
12 Sr         Other   4726
13 Fresh      <NA>    2355
14 <NA>       Female  2117
15 <NA>       Male    1759
16 Soph       <NA>    1707
17 Jr         <NA>    1567
18 Sr         <NA>    1343
19 <NA>       Other    454
20 <NA>       <NA>      80

2.1.3 Count the elements in a group but show elements that have zero elements

In some data sets, when running a count() statement, some combinations of factors (such as ClassLevel and Gender) might have 0 rows; count()’s default response is to not show those rows. In order to ensure that the combinations with a zero count show in the results, add .drop = FALSE to the count() statement, as below:

survey |> 
  count(ClassLevel, Gender, .drop = FALSE)

# A tibble: 20 × 3
   ClassLevel Gender     n
   <ord>      <fct>  <int>
 1 Fresh      Female 41546
 2 Fresh      Male   36140
 3 Fresh      Other   7619
 4 Fresh      <NA>    2355
 5 Soph       Female 31296
 6 Soph       Male   27641
 7 Soph       Other   5927
 8 Soph       <NA>    1707
 9 Jr         Female 27302
10 Jr         Male   23084
11 Jr         Other   4978
12 Jr         <NA>    1567
13 Sr         Female 25587
14 Sr         Male   21253
15 Sr         Other   4726
16 Sr         <NA>    1343
17 <NA>       Female  2117
18 <NA>       Male    1759
19 <NA>       Other    454
20 <NA>       <NA>      80

Note in this case, there were no combinations that had zero rows so this change did not affect the results.

2.2 General calculations

The usual way to use summarize() is in combination with the group_by() statement. Below, we calculate the same statistics as we did above but we calculate them separately for each combination of ClassLevel and Gender:

survey |> 
  group_by(ClassLevel, Gender) |> 
  summarize(MinResp = min(NumResp, na.rm = TRUE),
            AvgResp = mean(NumResp, na.rm = TRUE),
            MedResp = median(NumResp, na.rm = TRUE),
            MaxResp = max(NumResp, na.rm = TRUE),
            StdDevResp = sd(NumResp, na.rm = TRUE),
            Count = n(),
            Distinct = n_distinct(NumResp))

# A tibble: 20 × 9
# Groups:   ClassLevel [5]
   ClassLevel Gender MinResp AvgResp MedResp MaxResp StdDevResp Count Distinct
   <ord>      <fct>    <dbl>   <dbl>   <dbl>   <dbl>      <dbl> <int>    <int>
 1 Fresh      Female       1    3.21       3       5       1.37 41546        5
 2 Fresh      Male         1    3.23       3       5       1.37 36140        5
 3 Fresh      Other        1    3.21       3       5       1.38  7619        5
 4 Fresh      <NA>         1    3.23       3       5       1.38  2355        5
 5 Soph       Female       1    3.21       3       5       1.38 31296        5
 6 Soph       Male         1    3.21       3       5       1.37 27641        5
 7 Soph       Other        1    3.21       3       5       1.36  5927        5
 8 Soph       <NA>         1    3.24       3       5       1.38  1707        5
 9 Jr         Female       1    3.22       3       5       1.37 27302        5
10 Jr         Male         1    3.21       3       5       1.38 23084        5
11 Jr         Other        1    3.26       3       5       1.38  4978        5
12 Jr         <NA>         1    3.24       3       5       1.37  1567        5
13 Sr         Female       1    3.23       3       5       1.37 25587        5
14 Sr         Male         1    3.24       3       5       1.37 21253        5
15 Sr         Other        1    3.26       3       5       1.36  4726        5
16 Sr         <NA>         1    3.26       3       5       1.38  1343        5
17 <NA>       Female       1    3.24       3       5       1.36  2117        5
18 <NA>       Male         1    3.20       3       5       1.35  1759        5
19 <NA>       Other        1    3.23       3       5       1.39   454        5
20 <NA>       <NA>         1    3.15       3       5       1.52    80        5

If we wanted to, we could easily calculate the statistics based on just the ClassLevel by removing the Gender from the group_by() statement:

survey |> 
  group_by(ClassLevel) |> 
  summarize(MinResp = min(NumResp, na.rm = TRUE),
            AvgResp = mean(NumResp, na.rm = TRUE),
            MedResp = median(NumResp, na.rm = TRUE),
            MaxResp = max(NumResp, na.rm = TRUE),
            StdDevResp = sd(NumResp, na.rm = TRUE),
            Count = n(),
            Distinct = n_distinct(NumResp))

# A tibble: 5 × 8
  ClassLevel MinResp AvgResp MedResp MaxResp StdDevResp Count Distinct
  <ord>        <dbl>   <dbl>   <dbl>   <dbl>      <dbl> <int>    <int>
1 Fresh            1    3.22       3       5       1.37 87660        5
2 Soph             1    3.21       3       5       1.37 66571        5
3 Jr               1    3.22       3       5       1.38 56931        5
4 Sr               1    3.24       3       5       1.37 52909        5
5 <NA>             1    3.22       3       5       1.37  4410        5

3 The `summary()` function

R has a built-in function called summary() which makes it easy to calculate the quartiles of multiple numeric fields at the same time:

st_info |> 
  select(AdmitCalendarYear, HSGPA, UnivGPA) |> 
  summary()

 AdmitCalendarYear     HSGPA          UnivGPA     
 Min.   :2009      Min.   :2.000   Min.   :1.054  
 1st Qu.:2013      1st Qu.:2.890   1st Qu.:2.538  
 Median :2016      Median :3.100   Median :2.791  
 Mean   :2016      Mean   :3.099   Mean   :2.777  
 3rd Qu.:2019      3rd Qu.:3.300   3rd Qu.:3.030  
 Max.   :2023      Max.   :4.000   Max.   :3.925

It can also be used on just one field. The form of the function call is as follows:

summary(table-name$column-name)

This shows an example of how it can be used:

summary(st_info$UnivGPA)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.054   2.538   2.791   2.777   3.030   3.925

4 The `quantile()` function

The R function quantile() returns an estimate of the underlying distribution of the specified column.

The structure of the quantile() function is as follows:

quantile(column-name,
         probs = )

Note that it is named quantile(), not quartile(). It can be used to calculate quartiles, quintiles, deciles, or whatever samples you want to take of the column’s distribution.

4.1 On a whole column

Here is a typical usage of quantile() in combination with the summarize() function:

st_info |> 
  summarize(Min = min(UnivGPA, na.rm = TRUE),
            quant25 = quantile(UnivGPA, 
                               probs = 0.25, 
                               na.rm = TRUE),
            quant50 = quantile(UnivGPA, 
                               probs = 0.5, 
                               na.rm = TRUE),
            Avg = mean(UnivGPA, na.rm = TRUE),
            quant75 = quantile(UnivGPA, 
                               probs = 0.75, 
                               na.rm = TRUE),
            Max = max(UnivGPA, na.rm = TRUE))

# A tibble: 1 × 6
    Min quant25 quant50   Avg quant75   Max
  <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
1  1.05    2.54    2.79  2.78    3.03  3.92

4.2 Using `group_by()`

The quantile() function can also be used in combination with the group_by() function to calculate statistics for subsets of the column.

4.2.1 Calculating quartiles

As you might guess from the structure of the previous example, it is a simple thing to insert group_by() into the calculation:

st_info |> 
  group_by(Gender) |> 
  summarize(Min = min(UnivGPA, na.rm = TRUE),
            quant25 = quantile(UnivGPA, 
                               probs = 0.25, 
                               na.rm = TRUE),
            quant50 = quantile(UnivGPA, 
                               probs = 0.5, 
                               na.rm = TRUE),
            Mean = mean(UnivGPA, na.rm = TRUE),
            quant75 = quantile(UnivGPA, 
                               probs = 0.75, 
                               na.rm = TRUE),
            Max = max(UnivGPA, na.rm = TRUE))

# A tibble: 4 × 7
  Gender    Min quant25 quant50  Mean quant75   Max
  <chr>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
1 Another  1.65    2.53    2.84  2.81    3.10  3.59
2 Female   1.33    2.63    2.88  2.86    3.10  3.92
3 Male     1.05    2.45    2.69  2.68    2.93  3.77
4 Unknown  1.27    2.54    2.81  2.78    3.01  3.72

4.2.2 Calculating quintiles

Just as we calculated quartiles above, it is also a simple thing to change this to calculating quintiles:

q = c(0.2, 0.4, 0.6, 0.8)
st_info |> 
  group_by(Gender) |> 
  summarize(Min = min(UnivGPA, na.rm = TRUE),
            quant20 = quantile(UnivGPA, 
                               probs = 0.2, 
                               na.rm = TRUE),
            quant40 = quantile(UnivGPA, 
                               probs = 0.4, 
                               na.rm = TRUE),
            Med = median(UnivGPA, na.rm = TRUE),
            Mean = mean(UnivGPA, na.rm = TRUE),
            quant60 = quantile(UnivGPA, 
                               probs = 0.6, 
                               na.rm = TRUE),
            quant80 = quantile(UnivGPA, 
                               probs = 0.8, 
                               na.rm = TRUE),
            Max = max(UnivGPA, na.rm = TRUE))

# A tibble: 4 × 9
  Gender    Min quant20 quant40   Med  Mean quant60 quant80   Max
  <chr>   <dbl>   <dbl>   <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>
1 Another  1.65    2.44    2.72  2.84  2.81    2.96    3.13  3.59
2 Female   1.33    2.57    2.78  2.88  2.86    2.96    3.16  3.92
3 Male     1.05    2.39    2.6   2.69  2.68    2.78    2.99  3.77
4 Unknown  1.27    2.5     2.71  2.81  2.78    2.89    3.06  3.72

4.3 As stand-alone function

The quantile() function can also be used in a stand-alone manner, apart from summarize().

Here is how you might use it to print out five samples of the distribution:

quantile(st_info$UnivGPA, 
         probs = c(0.05, 0.25, 0.5, 0.75, 0.95),
         names = TRUE)

   5%   25%   50%   75%   95% 
2.160 2.538 2.791 3.030 3.350

Instead of printing the results, quantile() can be used to calculate a list of results:

quantile(st_info$UnivGPA, 
         probs = c(0.05, 0.25, 0.5, 0.75, 0.95),
         names = FALSE)

[1] 2.160 2.538 2.791 3.030 3.350

This list can then be assigned to a variable and manipulated in any way that you might see fit (way beyond what is done in this degenerate case):

gpaQuartiles <- quantile(st_info$UnivGPA, 
                         probs = c(0.05, 0.25, 
                                   0.5, 0.75, 0.95),
                         names = FALSE)
gpaQuartiles[1]

[1] 2.16