Case study: Retention survey

We are now going to return to our retention survey case study. We have imported the dataset and examined it. At this stage we are going to standardize the representation of NA values throughout the data frame.

1 Get a few details on the survey data frame

While working on all of the following, you can use the following information to remind yourself what you’re working on.

glimpse(survey)
Rows: 33,524
Columns: 24
$ Year         <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
$ ID           <chr> "xuojqdfdozvu", "vvwkinqvnibo", "ibyjcmiopiqk", "lsqamawy…
$ NPS          <int> 8, 8, 8, 8, 8, 8, 8, 4, 4, 7, 6, 6, 6, 4, 7, 4, 6, 7, 4, …
$ Field        <chr> "Undecl", "SocSci", "CompSci", "Other", "SocSci", "Undecl…
$ ClassLevel   <chr> "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Fresh", "Fresh…
$ Status       <chr> "Full-time", "Full-time", "Full-time", "Full-time", "Full…
$ Gender       <chr> "Male", "Female", "Male", "Female", "Other", "Female", "M…
$ BirthYear    <int> 1990, 1999, 1999, 1989, 1993, 1988, 1996, 1991, 1990, 198…
$ FinPL        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "No", "No", "…
$ FinSch       <chr> "No", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "No",…
$ FinGov       <chr> "No", "Yes", "Yes", "", "No", "No", "No", "No", "No", "No…
$ FinSelf      <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No",…
$ FinPar       <chr> "Yes", "No", "", "Yes", "No", "No", "No", "No", "No", "Ye…
$ FinOther     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No…
$ TooDifficult <chr> "Disagree", "Strongly Disagree", "Disagree", "Agree", NA,…
$ NotRelevant  <chr> "Strongly Disagree", "Disagree", NA, "Strongly Disagree",…
$ PoorTeaching <chr> "Agree", "Agree", "Agree", "Disagree", "Agree", "Strongly…
$ UnsuppFac    <chr> NA, "Neutral", "Neutral", "Neutral", NA, "Neutral", "Stro…
$ Grades       <chr> NA, "Disagree", "Strongly Disagree", NA, NA, "Agree", NA,…
$ Sched        <chr> "Strongly Agree", "Strongly Disagree", "Strongly Disagree…
$ ClassTooBig  <chr> "Neutral", NA, "Strongly Disagree", "Strongly Disagree", …
$ BadAdvising  <chr> "Disagree", "Disagree", NA, "Disagree", "Strongly Disagre…
$ FinAid       <chr> "Strongly Agree", "Strongly Agree", "Strongly Agree", "Ag…
$ OverallValue <chr> "Strongly Agree", "Strongly Agree", "Neutral", "Strongly …
skim(survey)
Data summary
Name survey
Number of rows 33524
Number of columns 24
_______________________
Column type frequency:
character 21
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1.00 12 12 0 33524 0
Field 0 1.00 0 9 12 16 0
ClassLevel 0 1.00 2 5 0 5 0
Status 0 1.00 0 9 627 4 0
Gender 0 1.00 0 6 875 4 0
FinPL 0 1.00 0 3 1674 3 0
FinSch 0 1.00 0 3 1615 3 0
FinGov 0 1.00 0 3 1586 3 0
FinSelf 0 1.00 0 3 1622 3 0
FinPar 0 1.00 0 3 1699 3 0
FinOther 0 1.00 0 3 1699 3 0
TooDifficult 4999 0.85 0 17 1646 6 0
NotRelevant 4957 0.85 0 17 1665 6 0
PoorTeaching 5141 0.85 0 17 1678 6 0
UnsuppFac 4994 0.85 0 17 1682 6 0
Grades 4996 0.85 0 17 1641 6 0
Sched 5080 0.85 0 17 1604 6 0
ClassTooBig 4979 0.85 0 17 1655 6 0
BadAdvising 5085 0.85 0 17 1650 6 0
FinAid 4986 0.85 0 17 1690 6 0
OverallValue 4961 0.85 0 17 1670 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2014.50 1.71 2012 2013 2014 2016 2017 ▇▅▅▃▅
NPS 0 1 5.95 1.52 4 4 6 7 8 ▇▁▆▅▅
BirthYear 0 1 1994.01 3.76 1988 1991 1994 1997 2000 ▇▅▇▅▇

2 Work on TooDifficult column

2.1 Current status

Let’s start with the TooDifficult variable in the survey dataset. What values are in the column?

survey |> 
  distinct(TooDifficult)
       TooDifficult
1          Disagree
2 Strongly Disagree
3             Agree
4              <NA>
5                  
6           Neutral
7    Strongly Agree

So it has 7 different values…but only five of them are what we want. Two of them, <NA> (which is a printed representation of NA) and the blank in row 5, are blank non-responses.

Let’s remind ourselves of the actual values at the top of this column:

survey |> 
  select(TooDifficult) |> 
  head(8)
       TooDifficult
1          Disagree
2 Strongly Disagree
3          Disagree
4             Agree
5              <NA>
6                  
7           Neutral
8 Strongly Disagree

And then what does the is.na() have to say about these values?

survey |> 
  select(TooDifficult) |> 
  is.na() |> 
  head(8)
     TooDifficult
[1,]        FALSE
[2,]        FALSE
[3,]        FALSE
[4,]        FALSE
[5,]         TRUE
[6,]        FALSE
[7,]        FALSE
[8,]        FALSE

Notice that row 5 (the <NA>) is considered to not have a value but row 6 (the blank, empty space) is considered to have a value. Essentially, the efforts on the rest of this page is directed at removing this duality — we want all blank values to be represented by NA and nothing else.

2.2 Some calculations on the field

Now, we’re going to do something a bit more complicated. Let’s look at the whole TooDifficult column and compute the following:

  • number of distinct values
  • number of NA values
  • number of not-NA values
  • median response
survey |> 
  summarize(n = n_distinct(TooDifficult),
            na = sum(is.na(TooDifficult)),
            notNA = sum(!is.na(TooDifficult)),
            med = median(TooDifficult, 
                         na.rm = TRUE))
  n   na notNA     med
1 7 4999 28525 Neutral

From the output, we can see that there are (still) 7 distinct values, almost 5000 NA values, over 28,000 non-NA values, and a median response of Neutral. The trouble is that some of those 28,000 non-NA values are actually blanks (and should be NA).

The following is one of our favorite sets of reusable code. Other than the first line (which specifies the data frame it works on), everything else is simply a copy-and-paste. It prints out, for every column, the count of how many entries have NA-like values:

survey |> 
  summarize(across(everything(),
                   ~sum(.x %in% c("NA", "--", ""))))
  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0    12        551    627    875         0  1674   1615   1586
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1    1622   1699     1699         1646        1665         1678      1682
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1   1641  1604        1655        1650   1690         1670

All but four columns have these NA-like values.

The following shows what this looks like. You can see an actual NA value (<NA>) in row 5 plus a blank value in row 6:

survey |> 
  select(TooDifficult) |> 
  head(8)
       TooDifficult
1          Disagree
2 Strongly Disagree
3          Disagree
4             Agree
5              <NA>
6                  
7           Neutral
8 Strongly Disagree

3 Creating a standardized representation of NA in one column

We are now going to try to get a standardized representation of the NA values in the column. Here is the command:

survey <-
  survey |> 
  mutate(TooDifficult = replace(TooDifficult,
                                TooDifficult == "",
                                NA)) |> 
  mutate(TooDifficult = replace(TooDifficult,
                                TooDifficult == "NA",
                                NA))

Let’s take a moment to interpret the above. Here is the form of these two mutate() statements:

mutate(new-col-name = replace(old-col-name,
                              condition,
                              new-value-if-true))

In the first case, we’re making the following substitutions:

  • new-col-name: we’re keeping the column name as TooDifficult
  • old-col-name: we’re starting with the values that are currently in TooDifficult
  • condition: we’re testing if TooDifficult has the value "" (empty string)
  • new-value-if-true: we’re replacing the original value with NA (the not available indicator)

Thus, we’re replacing all "" and "NA" values in TooDifficult with NA.

4 Learning about all the columns

Let’s build a statement that can tell us about all of the columns.

The first statement uses the map_chr(x) function — it applies the x function to all the columns, one after another. Consider the following that applies the class function to each column in survey. What we should get is the class of each column.

survey |> 
  map_chr(class)
        Year           ID          NPS        Field   ClassLevel       Status 
   "integer"  "character"    "integer"  "character"  "character"  "character" 
      Gender    BirthYear        FinPL       FinSch       FinGov      FinSelf 
 "character"    "integer"  "character"  "character"  "character"  "character" 
      FinPar     FinOther TooDifficult  NotRelevant PoorTeaching    UnsuppFac 
 "character"  "character"  "character"  "character"  "character"  "character" 
      Grades        Sched  ClassTooBig  BadAdvising       FinAid OverallValue 
 "character"  "character"  "character"  "character"  "character"  "character" 

The following statement has the same form as the above, but it applies the n_distinct function which returns the number of distinct values in that column.

survey |> 
  map_dbl(n_distinct)
        Year           ID          NPS        Field   ClassLevel       Status 
           6        33524            4           16            5            4 
      Gender    BirthYear        FinPL       FinSch       FinGov      FinSelf 
           4           13            3            3            3            3 
      FinPar     FinOther TooDifficult  NotRelevant PoorTeaching    UnsuppFac 
           3            3            6            7            7            7 
      Grades        Sched  ClassTooBig  BadAdvising       FinAid OverallValue 
           7            7            7            7            7            7 

Now we want to create a statement that returns both of these values. In the following in line 1 we the name of the first column in survey (Year) to the variable .x. (You will see why we used the .x variable name in a moment.) Lines 2 and 3 create a data frame with 2 columns (named n_distinct and class) with 1 row (associated with the first column):

.x <- survey |> pluck(1)
data.frame(n_distinct = n_distinct(.x),
           class = class(.x))
  n_distinct   class
1          6 integer

Now we are putting some pieces together. In this case, we use the R function map_df to apply a function to every column in survey.

survey |> 
  map_df(~(data.frame(n_distinct = n_distinct(.x),
                      class = class(.x))))
   n_distinct     class
1           6   integer
2       33524 character
3           4   integer
4          16 character
5           5 character
6           4 character
7           4 character
8          13   integer
9           3 character
10          3 character
11          3 character
12          3 character
13          3 character
14          3 character
15          6 character
16          7 character
17          7 character
18          7 character
19          7 character
20          7 character
21          7 character
22          7 character
23          7 character
24          7 character

But what is the function? Above, we had simple built-in R functions like class and n_distinct. Here it is more complicated. It is whole first argument of map_df:

~(data.frame(n_distinct = n_distinct(.x),
             class = class(.x)))

This is a strange-looking statement. It helps to know that, instead of a function (as we saw above), R accepts a formula with the form ~(stuff), where stuff is a valid R formula that uses .x to refer to each successive column name. We know what this data.frame command does because we applied it to the Year column above.

The results look great, but it would be better if it identified the name of the column. We can do this by adding the .id argument to the map_df() function:

survey |> 
  map_df(~(data.frame(n_distinct = n_distinct(.x),
                      class = class(.x))),
         .id = "variable")
       variable n_distinct     class
1          Year          6   integer
2            ID      33524 character
3           NPS          4   integer
4         Field         16 character
5    ClassLevel          5 character
6        Status          4 character
7        Gender          4 character
8     BirthYear         13   integer
9         FinPL          3 character
10       FinSch          3 character
11       FinGov          3 character
12      FinSelf          3 character
13       FinPar          3 character
14     FinOther          3 character
15 TooDifficult          6 character
16  NotRelevant          7 character
17 PoorTeaching          7 character
18    UnsuppFac          7 character
19       Grades          7 character
20        Sched          7 character
21  ClassTooBig          7 character
22  BadAdvising          7 character
23       FinAid          7 character
24 OverallValue          7 character

We can see (in row 15) that TooDifficult has 6 distinct values, one less than the columns named in rows 16–17 because of the mutate() commands we executed above.

5 Work on FinPL column

Let’s just take a moment to verify the information about one of the columns. We’ll focus on FinPL, the 9th column in the table.

5.1 Distinct values

survey |> 
  select(FinPL) |> 
  distinct()
  FinPL
1    No
2   Yes
3      

So, yes, it has 3 distinct values in the column (as shown in the statement above on row 9). One of those distinct values is a blank (but not a NA).

5.2 Number of NA values

sum(is.na(survey$FinPL))
[1] 0

The column has no NA values in it though we know from above that it does have NA-like values.

5.3 Number of non-NA values

sum(!is.na(survey$FinPL))
[1] 33524

And, for completeness, we can see that it has lots of non-NA values.

6 Selecting distinct values for two columns

This command returns the distinct values based on a combination of the columns! We know (from above) that each of these two columns has 3 distinct variables; however, when asking for the distinct combination of the two, it will display 9 (3x3) and not 6 (3+3) rows in the result:

survey |> 
  select(FinPL, FinSch) |> 
  distinct()
  FinPL FinSch
1    No     No
2    No    Yes
3   Yes    Yes
4   Yes     No
5           No
6    No       
7          Yes
8   Yes       
9             

7 Changing values across many columns

We saw above how much work that it takes to make it so one column (in that case, TooDifficult) has a standardized representation of the NA value. We don’t want to have to do that for all of the columns separately — we want to take care of all of them at once.

survey <-
  survey |> 
    mutate(across(c(Field:Gender,
                    FinPL:OverallValue),
                  ~ case_when(is.na(.x) ~ NA,
                              .x == "" ~ NA,
                              .x == "NA" ~ NA,
                              .x == "--" ~ NA,
                              .default = .x)))

In addition to the already-understood mutate() command, we have several pieces to unpack here: across(), c(), and ~case_when(). Let’s take a look at each one of them in some detail:

across(Columns, Function)

When used within a verb such as select() or mutate(), the across() function applies the verb across the columns specified by Columns and executes the Function. We discuss both the specification of the columns and the definition of the function below. Read this page for more information about across().

c(RangeOfCols1, RangeOfCols2)

In this case, the whole set of columns is the joining together of two lists: 1) the columns from Field to Gender, and 2) the columns from FinPL to OverallValue. We could have simply listed all of the fields within a c(), but the syntax A:B specifies a list of the fields (as they appear in the data frame).

We are leaving out ID because it has one unique non-NA-like per row. We are leaving out Year, NPS, and BirthYear because they are integer columns with no missing values.

As you might guess, there is no limit to the number of fields or range of fields that you can include in the c() function.

~case_when(Condition1 ~ Val1, Condition2 ~ Val2, ... Default)

First, realize that this is the function that is applied to each of the columns specified above in the c(). Second, let’s define how this is read. If you have programmed before, this is the tidyverse’s version of an if/else or switch statement:

if Condition1 is true:
  return Val1
else if Condition2 is true:
  return Val2
else if ...
  ...
else:
  return Default

The conditions are evaluated in order. The value is returned for the first condition that is true. This is what we have in this case:

if is.na(x):
  return NA
else if (x == ""):
  return NA
else if (x == "NA"):
  return NA
else if (x == "--"):
  return NA
else
  return x

You can see that this case_when() function is cleaning up the column it is working on so that it has a standard representation of an NA-like value.

8 Checking for NA-like values across all columns

We’re now going to define a statement that will help us see if there are any NA-like values remaining in any column.

The statement that we are going to build uses the everything() function within the across(). You might recall that we used the c() function to build a set of columns for mutate(). In this case, we don’t want to use a strict subset of columns—we want to work on all of the columns. That is what everything() is for.

In this case we are also going to use summarize() separately from any group_by() statement. Let’s take a look at an example that uses a function that we’re familiar with. In this case, the statement is telling R to calculate (summarize()) the maximum value (max()) for all columns (across(everything(), ...)) while ignoring all NA values (na.rm = TRUE):

survey |> 
  summarize(across(everything(),
                   ~max(.x, na.rm = TRUE)))
  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
1 2017 zzzyiijteasl   8 Undecl         Sr Part-time  Other      2000   Yes
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1    Yes    Yes     Yes    Yes      Yes Strongly Disagree Strongly Disagree
       PoorTeaching         UnsuppFac            Grades             Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
        ClassTooBig       BadAdvising            FinAid      OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree

Here is another example. The pluck(Col, Num) function can be used to access the Num item in column Col. You would expect the following to print the first row of the survey data frame:

survey |> 
  summarize(across(everything(),
                   ~pluck(.x, 1)))
  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu   8 Undecl         Sr Full-time   Male      1990    No
  FinSch FinGov FinSelf FinPar FinOther TooDifficult       NotRelevant
1     No     No      No    Yes       No     Disagree Strongly Disagree
  PoorTeaching UnsuppFac Grades          Sched ClassTooBig BadAdvising
1        Agree      <NA>   <NA> Strongly Agree     Neutral    Disagree
          FinAid   OverallValue
1 Strongly Agree Strongly Agree

That is, the above should be the same as the following:

survey |> head(1)
  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu   8 Undecl         Sr Full-time   Male      1990    No
  FinSch FinGov FinSelf FinPar FinOther TooDifficult       NotRelevant
1     No     No      No    Yes       No     Disagree Strongly Disagree
  PoorTeaching UnsuppFac Grades          Sched ClassTooBig BadAdvising
1        Agree      <NA>   <NA> Strongly Agree     Neutral    Disagree
          FinAid   OverallValue
1 Strongly Agree Strongly Agree

And it is.

Now, having a better idea about summarize(), across(), and everything(), we can now discuss the statement that we are going to use:

survey |> 
  summarize(across(everything(),
                   ~sum(.x %in% c("NA", "--", ""))))
  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0     0          0      0      0         0     0      0      0
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1       0      0        0            0           0            0         0
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1      0     0           0           0      0            0

All that remains now to understand this statement is to understand what the sum() function is doing. For each column, it is determining if a value is in the vector c("NA", "--", ""). Suppose that 500 of the 30,000 values in a certain column had one of those three values. Well, A %in% B returns TRUE or FALSE. The value of TRUE is 1 and the value of FALSE is 0. (If you didn’t know this, then now you do.) Thus, the sum() function would be looking at a set of 29,500 FALSE values and 500 TRUE values. In this case, the sum() function would return 500.

Now we can look at the result of the above statement and see that there are zero NA-like values in any column.

9 Validating the effects on some columns

Let’s now take a look at the actual values in some of the columns in survey. We’ll start with ClassLevel.

survey |> 
  select(ClassLevel) |> 
  distinct()
  ClassLevel
1         Sr
2      Fresh
3         Jr
4       Soph
5       <NA>

And now FinPL. Before starting all of the processing, this column had three distinct values plus hundreds of empty values.

survey |> 
  select(FinPL) |> 
  distinct()
  FinPL
1    No
2   Yes
3  <NA>

But now all of the blanks ("") have been converted to NA values.

The following two commands calculate how many NA and non-NA values are in FinPL:

sum(is.na(survey$FinPL))
[1] 1674
sum(!is.na(survey$FinPL))
[1] 31850

Instead of listing the distinct values in a column, we also have the option of simply displaying a few rows of that column:

survey |> 
  select(TooDifficult) |> 
  head(8)
       TooDifficult
1          Disagree
2 Strongly Disagree
3          Disagree
4             Agree
5              <NA>
6              <NA>
7           Neutral
8 Strongly Disagree

11 Factors

We are now going to use a lot of information from the listings generated in the previous section to define the factors for the survey data frame. Go back and look at the ordering of some of the fields — for example, look at ClassLevel, FinPar, FinOther, TooDifficult, and NotRelevant. Each of these has a natural ordering, and we don’t see that here. Further, why aren’t FinPar and FinOther in the same order? And why aren’t TooDifficult and NotRelevant in the same order? When we look at these columns during analysis, it would make it easier for us if the values were presented in a consistent ordering.

That’s what we can make happen by defining factors. We are first going to define the factors (in the next section) and then validate the work that we did (in the section following that one).

11.1 Defining the factors

Consider the following code for defining the factors. Below, we dissect what is going on.

fin_cols <- c("FinPL", "FinSch", "FinGov", 
              "FinSelf", "FinPar", "FinOther")
exp_cols <- c("TooDifficult", "NotRelevant", "PoorTeaching",
              "UnsuppFac", "Grades", "Sched", "ClassTooBig",
              "BadAdvising", "FinAid", "OverallValue")
fin_levels <- c("Yes", "No")
exp_levels <- c("Strongly Disagree", "Disagree", "Neutral", "Agree",
                "Strongly Agree")
class_levels <- c("Fresh", "Soph", "Jr", "Sr")

survey <-
  survey |> 
    mutate(across(all_of(exp_cols), 
                  ~factor(.x, 
                          levels = exp_levels, 
                          ordered = TRUE)))
survey <-
  survey |> 
    mutate(across(all_of(fin_cols),
                  ~factor(.x, 
                          levels = fin_levels, 
                          ordered = TRUE)))
survey <-
  survey |> 
    mutate(ClassLevel = factor(ClassLevel,
                               levels = class_levels,
                               ordered = TRUE))
survey <-
  survey |> 
    mutate(Status = factor(Status,
                           levels = c("Full-time", "Part-time",
                           "Other")))
survey <-
  survey |> 
    mutate(Gender = factor(Gender,
                           levels = c("Female", "Male", "Other")))
Lines 1–2 (define fin_cols)

These six columns are grouped together because each column can take on the values of either "Yes" or "No". We define this vector because we will treat them all similarly.

Lines 3–5 (define exp_cols)

These ten columns are grouped together for the same reason — they each can have the same values (as shown on the next two lines).

Line 6–9 (define levels for the factors below)

These lines define the possible levels (or values) for the three types of factors that we define below.

Lines 11–16 (define factors for exp_cols)

This statement tells R to change all of the columns listed in exp_cols so that they are a factor with possible values (in order) as shown by exp_levels.

Lines 17–22 (define factors for fin_cols)

This statement tells R to change all of the columns listed in fin_cols so that they are a factor with possible values (in order) as shown by fin_levels.

Lines 23–27 (define ClassLevel as a factor)

This changes ClassLevel so that it is a factor with possible values (in order) as shown by class_levels.

Lines 28–32 (define Status as a factor)

This changes Status so that it is a factor with possible values (in order) indicated by the levels argument.

Note that we know what all the possible levels are for each of the factors because of all the work that we have done standardizing our representation of NA values and then listing all of the distinct values in the columns.

11.2 Validating the factor creation process

We would like to think that the work we do is correct the first time. (Oh, if it were only true…) Given that it isn’t, you should absolutely run the following (which we defined above) after you have defined all of your factors. Then, after running it, you should examine the results to ensure that all of the columns look right (correct distinct values listed in the right order).

get_distinct_and_pad <- function(column, numvals = 6) {
  distinct_values <- tibble(value = column) |>
                     distinct(value) |>
                     arrange(value) |> 
                     slice_head(n = numvals) |>
                     pull(value)

  # Pad with NA if there are less than n values
  length(distinct_values) <- numvals
  return(distinct_values)
}

survey |> 
  map(get_distinct_and_pad) |> 
  as.data.frame()
  Year           ID NPS   Field ClassLevel    Status Gender BirthYear FinPL
1 2012 aaapjamvmsgd   4  ArchUP      Fresh Full-time Female      1988   Yes
2 2013 aacgcpuqbhue   6 ArtsHum       Soph Part-time   Male      1989    No
3 2014 aacgctmukgti   7     Bus         Jr     Other  Other      1990  <NA>
4 2015 aacukkmcqpdr   8 CompSci         Sr      <NA>   <NA>      1991  <NA>
5 2016 aacuutwvloft  NA      Ed       <NA>      <NA>   <NA>      1992  <NA>
6 2017 aadtesmehzsd  NA     Eng       <NA>      <NA>   <NA>      1993  <NA>
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1    Yes    Yes     Yes    Yes      Yes Strongly Disagree Strongly Disagree
2     No     No      No     No       No          Disagree          Disagree
3   <NA>   <NA>    <NA>   <NA>     <NA>           Neutral           Neutral
4   <NA>   <NA>    <NA>   <NA>     <NA>             Agree             Agree
5   <NA>   <NA>    <NA>   <NA>     <NA>    Strongly Agree    Strongly Agree
6   <NA>   <NA>    <NA>   <NA>     <NA>              <NA>              <NA>
       PoorTeaching         UnsuppFac            Grades             Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4             Agree             Agree             Agree             Agree
5    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
6              <NA>              <NA>              <NA>              <NA>
        ClassTooBig       BadAdvising            FinAid      OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4             Agree             Agree             Agree             Agree
5    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
6              <NA>              <NA>              <NA>              <NA>

Let’s take a look:

  • Year: in numeric order
  • ID: in alphabetic order (these are random values in any case, so the ordering really doesn’t matter)
  • NPS: in numeric order
  • Field: in alphabetic order (and we haven’t defined an ordering for these, so this is fine)
  • ClassLevelGender: in the order that we defined with the factor
  • BirthYear: in numeric order
  • FinPLFinOther: in the order that we defined with the factor; all have the same levels in the same order
  • TooDifficultOverallValue: in the order that we defined with the factor; all have the same levels in the same order

Given all of the above, it appears that we have defined the factors correctly and standardized the representation of NA-like values.