Case study: Retention survey

We are now going to return to our retention survey case study. We have imported the dataset and examined it. At this stage we are going to standardize the representation of NA values throughout the data frame.

1 Get a few details on the survey data frame

While working on all of the following, you can use the following information to remind yourself what you’re working on.

glimpse(survey)

Rows: 33,524
Columns: 24
$ Year         <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
$ ID           <chr> "xuojqdfdozvu", "vvwkinqvnibo", "ibyjcmiopiqk", "lsqamawy…
$ NPS          <int> 8, 8, 8, 8, 8, 8, 8, 4, 4, 7, 6, 6, 6, 4, 7, 4, 6, 7, 4, …
$ Field        <chr> "Undecl", "SocSci", "CompSci", "Other", "SocSci", "Undecl…
$ ClassLevel   <chr> "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Fresh", "Fresh…
$ Status       <chr> "Full-time", "Full-time", "Full-time", "Full-time", "Full…
$ Gender       <chr> "Male", "Female", "Male", "Female", "Other", "Female", "M…
$ BirthYear    <int> 1990, 1999, 1999, 1989, 1993, 1988, 1996, 1991, 1990, 198…
$ FinPL        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "No", "No", "…
$ FinSch       <chr> "No", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "No",…
$ FinGov       <chr> "No", "Yes", "Yes", "", "No", "No", "No", "No", "No", "No…
$ FinSelf      <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No",…
$ FinPar       <chr> "Yes", "No", "", "Yes", "No", "No", "No", "No", "No", "Ye…
$ FinOther     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No…
$ TooDifficult <chr> "Disagree", "Strongly Disagree", "Disagree", "Agree", NA,…
$ NotRelevant  <chr> "Strongly Disagree", "Disagree", NA, "Strongly Disagree",…
$ PoorTeaching <chr> "Agree", "Agree", "Agree", "Disagree", "Agree", "Strongly…
$ UnsuppFac    <chr> NA, "Neutral", "Neutral", "Neutral", NA, "Neutral", "Stro…
$ Grades       <chr> NA, "Disagree", "Strongly Disagree", NA, NA, "Agree", NA,…
$ Sched        <chr> "Strongly Agree", "Strongly Disagree", "Strongly Disagree…
$ ClassTooBig  <chr> "Neutral", NA, "Strongly Disagree", "Strongly Disagree", …
$ BadAdvising  <chr> "Disagree", "Disagree", NA, "Disagree", "Strongly Disagre…
$ FinAid       <chr> "Strongly Agree", "Strongly Agree", "Strongly Agree", "Ag…
$ OverallValue <chr> "Strongly Agree", "Strongly Agree", "Neutral", "Strongly …

skim(survey)

Data summary
Name	survey
Number of rows	33524
Number of columns	24
_______________________
Column type frequency:
character	21
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique
ID	0	1.00	12	12	0	33524
Field	0	1.00	0	9	12	16
ClassLevel	0	1.00	2	5	0	5
Status	0	1.00	0	9	627	4
Gender	0	1.00	0	6	875	4
FinPL	0	1.00	0	3	1674	3
FinSch	0	1.00	0	3	1615	3
FinGov	0	1.00	0	3	1586	3
FinSelf	0	1.00	0	3	1622	3
FinPar	0	1.00	0	3	1699	3
FinOther	0	1.00	0	3	1699	3
TooDifficult	4999	0.85	0	17	1646	6
NotRelevant	4957	0.85	0	17	1665	6
PoorTeaching	5141	0.85	0	17	1678	6
UnsuppFac	4994	0.85	0	17	1682	6
Grades	4996	0.85	0	17	1641	6
Sched	5080	0.85	0	17	1604	6
ClassTooBig	4979	0.85	0	17	1655	6
BadAdvising	5085	0.85	0	17	1650	6
FinAid	4986	0.85	0	17	1690	6
OverallValue	4961	0.85	0	17	1670	6

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	1	2014.50	1.71	2012	2013	2014	2016	2017	▇▅▅▃▅
NPS	1	5.95	1.52	4	4	6	7	8	▇▁▆▅▅
BirthYear	1	1994.01	3.76	1988	1991	1994	1997	2000	▇▅▇▅▇

2 Work on `TooDifficult` column

2.1 Current status

Let’s start with the TooDifficult variable in the survey dataset. What values are in the column?

survey |> 
  distinct(TooDifficult)

       TooDifficult
1          Disagree
2 Strongly Disagree
3             Agree
4              <NA>
5                  
6           Neutral
7    Strongly Agree

So it has 7 different values…but only five of them are what we want. Two of them, <NA> (which is a printed representation of NA) and the blank in row 5, are blank non-responses.

Let’s remind ourselves of the actual values at the top of this column:

survey |> 
  select(TooDifficult) |> 
  head(8)

       TooDifficult
1          Disagree
2 Strongly Disagree
3          Disagree
4             Agree
5              <NA>
6                  
7           Neutral
8 Strongly Disagree

And then what does the is.na() have to say about these values?

survey |> 
  select(TooDifficult) |> 
  is.na() |> 
  head(8)

     TooDifficult
[1,]        FALSE
[2,]        FALSE
[3,]        FALSE
[4,]        FALSE
[5,]         TRUE
[6,]        FALSE
[7,]        FALSE
[8,]        FALSE

Notice that row 5 (the <NA>) is considered to not have a value but row 6 (the blank, empty space) is considered to have a value. Essentially, the efforts on the rest of this page is directed at removing this duality — we want all blank values to be represented by NA and nothing else.

2.2 Some calculations on the field

Now, we’re going to do something a bit more complicated. Let’s look at the whole TooDifficult column and compute the following:

number of distinct values
number of NA values
number of not-NA values
median response

survey |> 
  summarize(n = n_distinct(TooDifficult),
            na = sum(is.na(TooDifficult)),
            notNA = sum(!is.na(TooDifficult)),
            med = median(TooDifficult, 
                         na.rm = TRUE))

  n   na notNA     med
1 7 4999 28525 Neutral

From the output, we can see that there are (still) 7 distinct values, almost 5000 NA values, over 28,000 non-NA values, and a median response of Neutral. The trouble is that some of those 28,000 non-NA values are actually blanks (and should be NA).

The following is one of our favorite sets of reusable code. Other than the first line (which specifies the data frame it works on), everything else is simply a copy-and-paste. It prints out, for every column, the count of how many entries have NA-like values:

survey |> 
  summarize(across(everything(),
                   ~sum(.x %in% c("NA", "--", ""))))

  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0    12        551    627    875         0  1674   1615   1586
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1    1622   1699     1699         1646        1665         1678      1682
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1   1641  1604        1655        1650   1690         1670

All but four columns have these NA-like values.

The following shows what this looks like. You can see an actual NA value (<NA>) in row 5 plus a blank value in row 6:

survey |> 
  select(TooDifficult) |> 
  head(8)

       TooDifficult
1          Disagree
2 Strongly Disagree
3          Disagree
4             Agree
5              <NA>
6                  
7           Neutral
8 Strongly Disagree

3 Creating a standardized representation of `NA` in one column

We are now going to try to get a standardized representation of the NA values in the column. Here is the command:

survey <-
  survey |> 
  mutate(TooDifficult = replace(TooDifficult,
                                TooDifficult == "",
                                NA)) |> 
  mutate(TooDifficult = replace(TooDifficult,
                                TooDifficult == "NA",
                                NA))

Let’s take a moment to interpret the above. Here is the form of these two mutate() statements:

mutate(new-col-name = replace(old-col-name,
                              condition,
                              new-value-if-true))

In the first case, we’re making the following substitutions:

new-col-name: we’re keeping the column name as TooDifficult
old-col-name: we’re starting with the values that are currently in TooDifficult
condition: we’re testing if TooDifficult has the value "" (empty string)
new-value-if-true: we’re replacing the original value with NA (the not available indicator)

Thus, we’re replacing all "" and "NA" values in TooDifficult with NA.

4 Learning about all the columns

Let’s build a statement that can tell us about all of the columns.

The first statement uses the map_chr(x) function — it applies the x function to all the columns, one after another. Consider the following that applies the class function to each column in survey. What we should get is the class of each column.

survey |> 
  map_chr(class)

        Year           ID          NPS        Field   ClassLevel       Status 
   "integer"  "character"    "integer"  "character"  "character"  "character" 
      Gender    BirthYear        FinPL       FinSch       FinGov      FinSelf 
 "character"    "integer"  "character"  "character"  "character"  "character" 
      FinPar     FinOther TooDifficult  NotRelevant PoorTeaching    UnsuppFac 
 "character"  "character"  "character"  "character"  "character"  "character" 
      Grades        Sched  ClassTooBig  BadAdvising       FinAid OverallValue 
 "character"  "character"  "character"  "character"  "character"  "character"

The following statement has the same form as the above, but it applies the n_distinct function which returns the number of distinct values in that column.

survey |> 
  map_dbl(n_distinct)

        Year           ID          NPS        Field   ClassLevel       Status 
           6        33524            4           16            5            4 
      Gender    BirthYear        FinPL       FinSch       FinGov      FinSelf 
           4           13            3            3            3            3 
      FinPar     FinOther TooDifficult  NotRelevant PoorTeaching    UnsuppFac 
           3            3            6            7            7            7 
      Grades        Sched  ClassTooBig  BadAdvising       FinAid OverallValue 
           7            7            7            7            7            7

Now we want to create a statement that returns both of these values. In the following in line 1 we the name of the first column in survey (Year) to the variable .x. (You will see why we used the .x variable name in a moment.) Lines 2 and 3 create a data frame with 2 columns (named n_distinct and class) with 1 row (associated with the first column):

.x <- survey |> pluck(1)
data.frame(n_distinct = n_distinct(.x),
           class = class(.x))

  n_distinct   class
1          6 integer

Now we are putting some pieces together. In this case, we use the R function map_df to apply a function to every column in survey.

survey |> 
  map_df(~(data.frame(n_distinct = n_distinct(.x),
                      class = class(.x))))

   n_distinct     class
1           6   integer
2       33524 character
3           4   integer
4          16 character
5           5 character
6           4 character
7           4 character
8          13   integer
9           3 character
10          3 character
11          3 character
12          3 character
13          3 character
14          3 character
15          6 character
16          7 character
17          7 character
18          7 character
19          7 character
20          7 character
21          7 character
22          7 character
23          7 character
24          7 character

But what is the function? Above, we had simple built-in R functions like class and n_distinct. Here it is more complicated. It is whole first argument of map_df:

~(data.frame(n_distinct = n_distinct(.x),
             class = class(.x)))

This is a strange-looking statement. It helps to know that, instead of a function (as we saw above), R accepts a formula with the form ~(stuff), where stuff is a valid R formula that uses .x to refer to each successive column name. We know what this data.frame command does because we applied it to the Year column above.

The results look great, but it would be better if it identified the name of the column. We can do this by adding the .id argument to the map_df() function:

survey |> 
  map_df(~(data.frame(n_distinct = n_distinct(.x),
                      class = class(.x))),
         .id = "variable")

       variable n_distinct     class
1          Year          6   integer
2            ID      33524 character
3           NPS          4   integer
4         Field         16 character
5    ClassLevel          5 character
6        Status          4 character
7        Gender          4 character
8     BirthYear         13   integer
9         FinPL          3 character
10       FinSch          3 character
11       FinGov          3 character
12      FinSelf          3 character
13       FinPar          3 character
14     FinOther          3 character
15 TooDifficult          6 character
16  NotRelevant          7 character
17 PoorTeaching          7 character
18    UnsuppFac          7 character
19       Grades          7 character
20        Sched          7 character
21  ClassTooBig          7 character
22  BadAdvising          7 character
23       FinAid          7 character
24 OverallValue          7 character

We can see (in row 15) that TooDifficult has 6 distinct values, one less than the columns named in rows 16–17 because of the mutate() commands we executed above.

5 Work on `FinPL` column

Let’s just take a moment to verify the information about one of the columns. We’ll focus on FinPL, the 9th column in the table.

5.1 Distinct values

survey |> 
  select(FinPL) |> 
  distinct()

  FinPL
1    No
2   Yes
3

So, yes, it has 3 distinct values in the column (as shown in the statement above on row 9). One of those distinct values is a blank (but not a NA).

5.2 Number of `NA` values

sum(is.na(survey$FinPL))

[1] 0

The column has no NA values in it though we know from above that it does have NA-like values.

5.3 Number of non-`NA` values

sum(!is.na(survey$FinPL))

[1] 33524

And, for completeness, we can see that it has lots of non-NA values.

6 Selecting `distinct` values for two columns

This command returns the distinct values based on a combination of the columns! We know (from above) that each of these two columns has 3 distinct variables; however, when asking for the distinct combination of the two, it will display 9 (3x3) and not 6 (3+3) rows in the result:

survey |> 
  select(FinPL, FinSch) |> 
  distinct()

  FinPL FinSch
1    No     No
2    No    Yes
3   Yes    Yes
4   Yes     No
5           No
6    No       
7          Yes
8   Yes       
9

7 Changing values across many columns

We saw above how much work that it takes to make it so one column (in that case, TooDifficult) has a standardized representation of the NA value. We don’t want to have to do that for all of the columns separately — we want to take care of all of them at once.

survey <-
  survey |> 
    mutate(across(c(Field:Gender,
                    FinPL:OverallValue),
                  ~ case_when(is.na(.x) ~ NA,
                              .x == "" ~ NA,
                              .x == "NA" ~ NA,
                              .x == "--" ~ NA,
                              .default = .x)))

In addition to the already-understood mutate() command, we have several pieces to unpack here: across(), c(), and ~case_when(). Let’s take a look at each one of them in some detail:

across(Columns, Function): When used within a verb such as select() or mutate(), the across() function applies the verb across the columns specified by Columns and executes the Function. We discuss both the specification of the columns and the definition of the function below. Read this page for more information about across().
c(RangeOfCols1, RangeOfCols2): In this case, the whole set of columns is the joining together of two lists: 1) the columns from Field to Gender, and 2) the columns from FinPL to OverallValue. We could have simply listed all of the fields within a c(), but the syntax A:B specifies a list of the fields (as they appear in the data frame).

We are leaving out ID because it has one unique non-NA-like per row. We are leaving out Year, NPS, and BirthYear because they are integer columns with no missing values.

As you might guess, there is no limit to the number of fields or range of fields that you can include in the c() function.

~case_when(Condition1 ~ Val1, Condition2 ~ Val2, ... Default): First, realize that this is the function that is applied to each of the columns specified above in the c(). Second, let’s define how this is read. If you have programmed before, this is the tidyverse’s version of an if/else or switch statement:

if Condition1 is true:
  return Val1
else if Condition2 is true:
  return Val2
else if ...
  ...
else:
  return Default

The conditions are evaluated in order. The value is returned for the first condition that is true. This is what we have in this case:

if is.na(x):
  return NA
else if (x == ""):
  return NA
else if (x == "NA"):
  return NA
else if (x == "--"):
  return NA
else
  return x

You can see that this case_when() function is cleaning up the column it is working on so that it has a standard representation of an NA-like value.

8 Checking for `NA`-like values across all columns

We’re now going to define a statement that will help us see if there are any NA-like values remaining in any column.

The statement that we are going to build uses the everything() function within the across(). You might recall that we used the c() function to build a set of columns for mutate(). In this case, we don’t want to use a strict subset of columns—we want to work on all of the columns. That is what everything() is for.

In this case we are also going to use summarize() separately from any group_by() statement. Let’s take a look at an example that uses a function that we’re familiar with. In this case, the statement is telling R to calculate (summarize()) the maximum value (max()) for all columns (across(everything(), ...)) while ignoring all NA values (na.rm = TRUE):

survey |> 
  summarize(across(everything(),
                   ~max(.x, na.rm = TRUE)))

  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
1 2017 zzzyiijteasl   8 Undecl         Sr Part-time  Other      2000   Yes
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1    Yes    Yes     Yes    Yes      Yes Strongly Disagree Strongly Disagree
       PoorTeaching         UnsuppFac            Grades             Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
        ClassTooBig       BadAdvising            FinAid      OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree

Here is another example. The pluck(Col, Num) function can be used to access the Num item in column Col. You would expect the following to print the first row of the survey data frame:

survey |> 
  summarize(across(everything(),
                   ~pluck(.x, 1)))

  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu   8 Undecl         Sr Full-time   Male      1990    No
  FinSch FinGov FinSelf FinPar FinOther TooDifficult       NotRelevant
1     No     No      No    Yes       No     Disagree Strongly Disagree
  PoorTeaching UnsuppFac Grades          Sched ClassTooBig BadAdvising
1        Agree      <NA>   <NA> Strongly Agree     Neutral    Disagree
          FinAid   OverallValue
1 Strongly Agree Strongly Agree

That is, the above should be the same as the following:

survey |> head(1)

  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu   8 Undecl         Sr Full-time   Male      1990    No
  FinSch FinGov FinSelf FinPar FinOther TooDifficult       NotRelevant
1     No     No      No    Yes       No     Disagree Strongly Disagree
  PoorTeaching UnsuppFac Grades          Sched ClassTooBig BadAdvising
1        Agree      <NA>   <NA> Strongly Agree     Neutral    Disagree
          FinAid   OverallValue
1 Strongly Agree Strongly Agree

And it is.

Now, having a better idea about summarize(), across(), and everything(), we can now discuss the statement that we are going to use:

survey |> 
  summarize(across(everything(),
                   ~sum(.x %in% c("NA", "--", ""))))

  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0     0          0      0      0         0     0      0      0
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1       0      0        0            0           0            0         0
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1      0     0           0           0      0            0

All that remains now to understand this statement is to understand what the sum() function is doing. For each column, it is determining if a value is in the vector c("NA", "--", ""). Suppose that 500 of the 30,000 values in a certain column had one of those three values. Well, A %in% B returns TRUE or FALSE. The value of TRUE is 1 and the value of FALSE is 0. (If you didn’t know this, then now you do.) Thus, the sum() function would be looking at a set of 29,500 FALSE values and 500 TRUE values. In this case, the sum() function would return 500.

Now we can look at the result of the above statement and see that there are zero NA-like values in any column.

9 Validating the effects on some columns

Let’s now take a look at the actual values in some of the columns in survey. We’ll start with ClassLevel.

survey |> 
  select(ClassLevel) |> 
  distinct()

  ClassLevel
1         Sr
2      Fresh
3         Jr
4       Soph
5       <NA>

And now FinPL. Before starting all of the processing, this column had three distinct values plus hundreds of empty values.

survey |> 
  select(FinPL) |> 
  distinct()

  FinPL
1    No
2   Yes
3  <NA>

But now all of the blanks ("") have been converted to NA values.

The following two commands calculate how many NA and non-NA values are in FinPL:

sum(is.na(survey$FinPL))

[1] 1674

sum(!is.na(survey$FinPL))

[1] 31850

Instead of listing the distinct values in a column, we also have the option of simply displaying a few rows of that column:

survey |> 
  select(TooDifficult) |> 
  head(8)

       TooDifficult
1          Disagree
2 Strongly Disagree
3          Disagree
4             Agree
5              <NA>
6              <NA>
7           Neutral
8 Strongly Disagree

10 Print first 6 distinct values for every column

The previous section demonstrated how to learn about the distinct values in a column; however, this would be a lot of work to go through for each of the columns in the survey table. Fortunately, R and the tidyverse gives us several ways to accomplish this with much less work.

In the following, we provide several different approaches and explain each one of them. Note that you can use any one of them (as shown in Listing 1, Listing 2, and Listing 3) even if you only have a minimal understanding of the code. We try to explain each of these to help you build your understanding of R and the tidyverse.

10.1 Applying the `map()` function

We start with the simplest to run and the simplest to understand. First, though, we want to remind you how to use distinct(). Here is a command that prints no more than 6 distinct values in the Year column:

survey |>
  distinct(Year) |>
  head(6)

We are using the head() command because, as we know, some of these columns have many thousands of distinct values. We don’t need to see all of them.

We are going to use the map() command to create a single command that prints the distinct values in each one of the columns. Let’s first remind ourselves how to read the map() command:

map(some-list,
    ~some-statement-on-.x)

The list in this case will be names(survey), a function that will return a list of the column names in survey.

The more complicated portion is the ~some-statement-on-.x. The .x will successively be the names of the columns. This raises the question: In the tidyverse, how do we refer to a specific column of data only given the column name? This is accomplished using .data[[.x]]. (Not an obvious answer…but the answer, nevertheless.) Thus, we put all of this together and get Listing 1:

Listing 1: Using map() to print the distinct values in all columns

map(names(survey), 
    ~survey |>
      distinct(.data[[.x]]) |>
      head(6))

[[1]]
  Year
1 2012
2 2013
3 2014
4 2015
5 2016
6 2017

[[2]]
            ID
1 xuojqdfdozvu
2 vvwkinqvnibo
3 ibyjcmiopiqk
4 lsqamawyancj
5 zrpkydjpltkc
6 mdoqvaalcscx

[[3]]
  NPS
1   8
2   4
3   7
4   6

[[4]]
      Field
1    Undecl
2    SocSci
3   CompSci
4     Other
5       Eng
6 OthHealth

[[5]]
  ClassLevel
1         Sr
2      Fresh
3         Jr
4       Soph
5       <NA>

[[6]]
     Status
1 Full-time
2 Part-time
3     Other
4      <NA>

[[7]]
  Gender
1   Male
2 Female
3  Other
4   <NA>

[[8]]
  BirthYear
1      1990
2      1999
3      1989
4      1993
5      1988
6      1996

[[9]]
  FinPL
1    No
2   Yes
3  <NA>

[[10]]
  FinSch
1     No
2    Yes
3   <NA>

[[11]]
  FinGov
1     No
2    Yes
3   <NA>

[[12]]
  FinSelf
1      No
2     Yes
3    <NA>

[[13]]
  FinPar
1    Yes
2     No
3   <NA>

[[14]]
  FinOther
1       No
2      Yes
3     <NA>

[[15]]
       TooDifficult
1          Disagree
2 Strongly Disagree
3             Agree
4              <NA>
5           Neutral
6    Strongly Agree

[[16]]
        NotRelevant
1 Strongly Disagree
2          Disagree
3              <NA>
4           Neutral
5    Strongly Agree
6             Agree

[[17]]
       PoorTeaching
1             Agree
2          Disagree
3 Strongly Disagree
4    Strongly Agree
5              <NA>
6           Neutral

[[18]]
          UnsuppFac
1              <NA>
2           Neutral
3 Strongly Disagree
4    Strongly Agree
5             Agree
6          Disagree

[[19]]
             Grades
1              <NA>
2          Disagree
3 Strongly Disagree
4             Agree
5           Neutral
6    Strongly Agree

[[20]]
              Sched
1    Strongly Agree
2 Strongly Disagree
3             Agree
4              <NA>
5          Disagree
6           Neutral

[[21]]
        ClassTooBig
1           Neutral
2              <NA>
3 Strongly Disagree
4    Strongly Agree
5             Agree
6          Disagree

[[22]]
        BadAdvising
1          Disagree
2              <NA>
3 Strongly Disagree
4           Neutral
5    Strongly Agree
6             Agree

[[23]]
             FinAid
1    Strongly Agree
2             Agree
3              <NA>
4 Strongly Disagree
5           Neutral
6          Disagree

[[24]]
       OverallValue
1    Strongly Agree
2           Neutral
3          Disagree
4             Agree
5              <NA>
6 Strongly Disagree

So, the above prints out up to 6 distinct values for each column. Which is what we wanted.

10.2 Applying the `map()` function to return a data frame plus column names

The above works. Definitely. But there are times when you might not want to have 24 print-outs one after the other. What if we wanted one column per original column and one row per distinct value — in this case a 24 by 6 data frame?

That’s what we’re going to define here (as shown in Listing 2). But it takes a lot of patience and an understanding of functions in order to understand what is going to be done. Note that you can use the following in your work without understanding any of it. All you have to do change the word survey to the name of your own data frame.

Listing 2: Applying the map() function to return a data frame plus column names

get_distinct_and_pad <- function(column, numvals = 6) {
  distinct_values <- tibble(value = column) |>
                     distinct(value) |>
                     arrange(value) |> 
                     slice_head(n = numvals) |>
                     pull(value)

  # Pad with NA if there are less than n values
  length(distinct_values) <- numvals
  return(distinct_values)
}

survey |> 
  map(get_distinct_and_pad) |> 
  as.data.frame()

  Year           ID NPS   Field ClassLevel    Status Gender BirthYear FinPL
1 2012 aaapjamvmsgd   4  ArchUP      Fresh Full-time Female      1988    No
2 2013 aacgcpuqbhue   6 ArtsHum         Jr     Other   Male      1989   Yes
3 2014 aacgctmukgti   7     Bus       Soph Part-time  Other      1990  <NA>
4 2015 aacukkmcqpdr   8 CompSci         Sr      <NA>   <NA>      1991  <NA>
5 2016 aacuutwvloft  NA      Ed       <NA>      <NA>   <NA>      1992  <NA>
6 2017 aadtesmehzsd  NA     Eng       <NA>      <NA>   <NA>      1993  <NA>
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1     No     No      No     No       No             Agree             Agree
2    Yes    Yes     Yes    Yes      Yes          Disagree          Disagree
3   <NA>   <NA>    <NA>   <NA>     <NA>           Neutral           Neutral
4   <NA>   <NA>    <NA>   <NA>     <NA>    Strongly Agree    Strongly Agree
5   <NA>   <NA>    <NA>   <NA>     <NA> Strongly Disagree Strongly Disagree
6   <NA>   <NA>    <NA>   <NA>     <NA>              <NA>              <NA>
       PoorTeaching         UnsuppFac            Grades             Sched
1             Agree             Agree             Agree             Agree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
5 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
6              <NA>              <NA>              <NA>              <NA>
        ClassTooBig       BadAdvising            FinAid      OverallValue
1             Agree             Agree             Agree             Agree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
5 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
6              <NA>              <NA>              <NA>              <NA>

Let’s build up our understanding of the above in pieces.

Lines 1–2: A comment. Anything that comes after a hashtag (#) is not evaluated. Thus, all of lines 1–2 are simply explanatory text that R ignores.
Lines 15–17: Start with the survey data frame. Use map() to apply get_distinct_and_pad() function to each column separately. Use the as.data.frame() function to join all the separate columns into one unified data table.

Now, what is this get_distinct_and_pad() function?

Lines 3–13: This is a full-fledged R function that we have built for just this scenario. Let’s go through the process from beginning to end.

This is the format of an R function:

name-of-fct <- function(possibly-some-args) {
  one-or-more-r-statements
}

You can see that the function get_distinct_and_pad() fits this template. First, name-of-fct is equal to get_distinct_and_pad. Let’s look at the values for possibly-some-args:

column: This is a vector of values (which we know is what every column in a data frame actually is).
numvals = 6: This second argument is an optional argument. If you leave it off when calling the function, then numvals is assumed to have the value 6.

Thus, this function can be called with, e.g., get_distinct_and_pad(some-df-col). However, if you look at line 16 above, you will see that it is actually called with map(get_distinct_and_pad). This is the magic of tidyverse combined with R’s map() function:

Since map(get_distinct_and_pad) is to the right of survey, this is actually map(survey, get_distinct_and_pad). And the way that map() works is that each column of survey in succession will be passed to get_distinct_and_pad() — i.e., multiple distinct calls of get_distinct_and_pad(col-of-survey).

Now let’s build up our understanding of one-or-more-r-statements.

Lines 4–8 (first statement in the body of the function): Let’s take this line-by-line:

tibble(value=column): The column here is the argument passed into the function — i.e., a column from the survey table. This defines a tibble (a data-frame like object that the tidyverse works well with) with a single column with the name value and the values specified in column.
distinct(value): This returns a vector of values that contains the distinct (or unique) values contained in the value column.
arrange(value): This returns a sorted vector of the distinct values (which were passed to it by the previous statement) from thevalue` column.
slice_head(n = numvals): This returns the first n = numvals rows from the sorted vector of distinct values from the value column. If there are fewer than numvals elements, then the vector is filled with NA values.
pull(value): This returns a vector the same size as the input — i.e., 6 elements.

At this point in the function, distinct_values holds a vector of values of size numvals (i.e., 6 in this case) that contains the distinct values from column (i.e., a specific column in the survey data frame).

Line 11–12: Now we get to a little magic and then some R plumbing:

length(distinct_values) <- numvals: This is a bit of magic that sets the length of the distinct_values vector to be numvals long.
return(distinct_values): This is the statement that tells R to return the value of distinct_values whenever get_distinct_and_pad() is called.

10.3 Using the `for` command to print the distinct values

Above, we defined two separate approaches to printing distinct values that use the map() function. Here we define a third and final approach that uses the for statement.

Listing 3: Using the for command to print the distinct values

for (var in names(survey)) {
  survey |> 
    distinct(.data[[var]]) |> 
    head(6) |> 
    print()
}

We see some old friends here:

names(survey): generates a list of the names of the columns of survey
for (var in names(survey)): (this is new) iterates through the names of the columns and assigns var to each one in succession
.data[[var]]: returns the column named by var from the survey data frame (since survey is on the left-hand-side of the pipe)

So, given all of this, the for statement steps through each column and prints the top 6 distinct values from each column.

  Year
1 2012
2 2013
3 2014
4 2015
5 2016
6 2017
            ID
1 xuojqdfdozvu
2 vvwkinqvnibo
3 ibyjcmiopiqk
4 lsqamawyancj
5 zrpkydjpltkc
6 mdoqvaalcscx
  NPS
1   8
2   4
3   7
4   6
      Field
1    Undecl
2    SocSci
3   CompSci
4     Other
5       Eng
6 OthHealth
  ClassLevel
1         Sr
2      Fresh
3         Jr
4       Soph
5       <NA>
     Status
1 Full-time
2 Part-time
3     Other
4      <NA>
  Gender
1   Male
2 Female
3  Other
4   <NA>
  BirthYear
1      1990
2      1999
3      1989
4      1993
5      1988
6      1996
  FinPL
1    No
2   Yes
3  <NA>
  FinSch
1     No
2    Yes
3   <NA>
  FinGov
1     No
2    Yes
3   <NA>
  FinSelf
1      No
2     Yes
3    <NA>
  FinPar
1    Yes
2     No
3   <NA>
  FinOther
1       No
2      Yes
3     <NA>
       TooDifficult
1          Disagree
2 Strongly Disagree
3             Agree
4              <NA>
5           Neutral
6    Strongly Agree
        NotRelevant
1 Strongly Disagree
2          Disagree
3              <NA>
4           Neutral
5    Strongly Agree
6             Agree
       PoorTeaching
1             Agree
2          Disagree
3 Strongly Disagree
4    Strongly Agree
5              <NA>
6           Neutral
          UnsuppFac
1              <NA>
2           Neutral
3 Strongly Disagree
4    Strongly Agree
5             Agree
6          Disagree
             Grades
1              <NA>
2          Disagree
3 Strongly Disagree
4             Agree
5           Neutral
6    Strongly Agree
              Sched
1    Strongly Agree
2 Strongly Disagree
3             Agree
4              <NA>
5          Disagree
6           Neutral
        ClassTooBig
1           Neutral
2              <NA>
3 Strongly Disagree
4    Strongly Agree
5             Agree
6          Disagree
        BadAdvising
1          Disagree
2              <NA>
3 Strongly Disagree
4           Neutral
5    Strongly Agree
6             Agree
             FinAid
1    Strongly Agree
2             Agree
3              <NA>
4 Strongly Disagree
5           Neutral
6          Disagree
       OverallValue
1    Strongly Agree
2           Neutral
3          Disagree
4             Agree
5              <NA>
6 Strongly Disagree

11 Factors

We are now going to use a lot of information from the listings generated in the previous section to define the factors for the survey data frame. Go back and look at the ordering of some of the fields — for example, look at ClassLevel, FinPar, FinOther, TooDifficult, and NotRelevant. Each of these has a natural ordering, and we don’t see that here. Further, why aren’t FinPar and FinOther in the same order? And why aren’t TooDifficult and NotRelevant in the same order? When we look at these columns during analysis, it would make it easier for us if the values were presented in a consistent ordering.

That’s what we can make happen by defining factors. We are first going to define the factors (in the next section) and then validate the work that we did (in the section following that one).

11.1 Defining the factors

Consider the following code for defining the factors. Below, we dissect what is going on.

fin_cols <- c("FinPL", "FinSch", "FinGov", 
              "FinSelf", "FinPar", "FinOther")
exp_cols <- c("TooDifficult", "NotRelevant", "PoorTeaching",
              "UnsuppFac", "Grades", "Sched", "ClassTooBig",
              "BadAdvising", "FinAid", "OverallValue")
fin_levels <- c("Yes", "No")
exp_levels <- c("Strongly Disagree", "Disagree", "Neutral", "Agree",
                "Strongly Agree")
class_levels <- c("Fresh", "Soph", "Jr", "Sr")

survey <-
  survey |> 
    mutate(across(all_of(exp_cols), 
                  ~factor(.x, 
                          levels = exp_levels, 
                          ordered = TRUE)))
survey <-
  survey |> 
    mutate(across(all_of(fin_cols),
                  ~factor(.x, 
                          levels = fin_levels, 
                          ordered = TRUE)))
survey <-
  survey |> 
    mutate(ClassLevel = factor(ClassLevel,
                               levels = class_levels,
                               ordered = TRUE))
survey <-
  survey |> 
    mutate(Status = factor(Status,
                           levels = c("Full-time", "Part-time",
                           "Other")))
survey <-
  survey |> 
    mutate(Gender = factor(Gender,
                           levels = c("Female", "Male", "Other")))

Lines 1–2 (define fin_cols): These six columns are grouped together because each column can take on the values of either "Yes" or "No". We define this vector because we will treat them all similarly.
Lines 3–5 (define exp_cols): These ten columns are grouped together for the same reason — they each can have the same values (as shown on the next two lines).
Line 6–9 (define levels for the factors below): These lines define the possible levels (or values) for the three types of factors that we define below.
Lines 11–16 (define factors for exp_cols): This statement tells R to change all of the columns listed in exp_cols so that they are a factor with possible values (in order) as shown by exp_levels.
Lines 17–22 (define factors for fin_cols): This statement tells R to change all of the columns listed in fin_cols so that they are a factor with possible values (in order) as shown by fin_levels.
Lines 23–27 (define ClassLevel as a factor): This changes ClassLevel so that it is a factor with possible values (in order) as shown by class_levels.
Lines 28–32 (define Status as a factor): This changes Status so that it is a factor with possible values (in order) indicated by the levels argument.

Note that we know what all the possible levels are for each of the factors because of all the work that we have done standardizing our representation of NA values and then listing all of the distinct values in the columns.

11.2 Validating the factor creation process

We would like to think that the work we do is correct the first time. (Oh, if it were only true…) Given that it isn’t, you should absolutely run the following (which we defined above) after you have defined all of your factors. Then, after running it, you should examine the results to ensure that all of the columns look right (correct distinct values listed in the right order).

get_distinct_and_pad <- function(column, numvals = 6) {
  distinct_values <- tibble(value = column) |>
                     distinct(value) |>
                     arrange(value) |> 
                     slice_head(n = numvals) |>
                     pull(value)

  # Pad with NA if there are less than n values
  length(distinct_values) <- numvals
  return(distinct_values)
}

survey |> 
  map(get_distinct_and_pad) |> 
  as.data.frame()

  Year           ID NPS   Field ClassLevel    Status Gender BirthYear FinPL
1 2012 aaapjamvmsgd   4  ArchUP      Fresh Full-time Female      1988   Yes
2 2013 aacgcpuqbhue   6 ArtsHum       Soph Part-time   Male      1989    No
3 2014 aacgctmukgti   7     Bus         Jr     Other  Other      1990  <NA>
4 2015 aacukkmcqpdr   8 CompSci         Sr      <NA>   <NA>      1991  <NA>
5 2016 aacuutwvloft  NA      Ed       <NA>      <NA>   <NA>      1992  <NA>
6 2017 aadtesmehzsd  NA     Eng       <NA>      <NA>   <NA>      1993  <NA>
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1    Yes    Yes     Yes    Yes      Yes Strongly Disagree Strongly Disagree
2     No     No      No     No       No          Disagree          Disagree
3   <NA>   <NA>    <NA>   <NA>     <NA>           Neutral           Neutral
4   <NA>   <NA>    <NA>   <NA>     <NA>             Agree             Agree
5   <NA>   <NA>    <NA>   <NA>     <NA>    Strongly Agree    Strongly Agree
6   <NA>   <NA>    <NA>   <NA>     <NA>              <NA>              <NA>
       PoorTeaching         UnsuppFac            Grades             Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4             Agree             Agree             Agree             Agree
5    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
6              <NA>              <NA>              <NA>              <NA>
        ClassTooBig       BadAdvising            FinAid      OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4             Agree             Agree             Agree             Agree
5    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
6              <NA>              <NA>              <NA>              <NA>

Let’s take a look:

Year: in numeric order
ID: in alphabetic order (these are random values in any case, so the ordering really doesn’t matter)
NPS: in numeric order
Field: in alphabetic order (and we haven’t defined an ordering for these, so this is fine)
ClassLevel–Gender: in the order that we defined with the factor
BirthYear: in numeric order
FinPL–FinOther: in the order that we defined with the factor; all have the same levels in the same order
TooDifficult–OverallValue: in the order that we defined with the factor; all have the same levels in the same order

Given all of the above, it appears that we have defined the factors correctly and standardized the representation of NA-like values.

1 Get a few details on the survey data frame

2 Work on TooDifficult column

2.1 Current status

2.2 Some calculations on the field

3 Creating a standardized representation of NA in one column

4 Learning about all the columns

5 Work on FinPL column