Case study: Retention survey

On this page, we’re going to take you through the process of getting a data frame into a state of readiness for analysis by R and the tidyverse.

This consists of, first, acquainting ourselves with the data frame, identifying any issues with its structure, converting it from wide to long where appropriate, and then finishing up with other changes (creating factors, creating a new column, and doing a quick validity check).

1 The survey data frame

We have worked with the survey table throughout this site.

summary(survey)
      Year           ID                 NPS           Field          
 Min.   :2012   Length:33524       Min.   :4.000   Length:33524      
 1st Qu.:2013   Class :character   1st Qu.:4.000   Class :character  
 Median :2014   Mode  :character   Median :6.000   Mode  :character  
 Mean   :2014                      Mean   :5.955                     
 3rd Qu.:2016                      3rd Qu.:7.000                     
 Max.   :2017                      Max.   :8.000                     
  ClassLevel           Status             Gender            BirthYear   
 Length:33524       Length:33524       Length:33524       Min.   :1988  
 Class :character   Class :character   Class :character   1st Qu.:1991  
 Mode  :character   Mode  :character   Mode  :character   Median :1994  
                                                          Mean   :1994  
                                                          3rd Qu.:1997  
                                                          Max.   :2000  
    FinPL              FinSch             FinGov            FinSelf         
 Length:33524       Length:33524       Length:33524       Length:33524      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    FinPar            FinOther         TooDifficult       NotRelevant       
 Length:33524       Length:33524       Length:33524       Length:33524      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 PoorTeaching        UnsuppFac            Grades             Sched          
 Length:33524       Length:33524       Length:33524       Length:33524      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 ClassTooBig        BadAdvising           FinAid          OverallValue      
 Length:33524       Length:33524       Length:33524       Length:33524      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            

We know right off that 3 columns right near the beginning are factors. Let’s handle those first.

class_levels <- c("Fresh", "Soph", "Jr", "Sr")
survey <-
  survey |> 
    mutate(ClassLevel = factor(ClassLevel,
                               levels = class_levels,
                               ordered = TRUE))
survey <-
  survey |> 
    mutate(Status = factor(Status,
                           levels = c("Full-time", 
                                      "Part-time",
                                      "Other")))
survey <-
  survey |> 
    mutate(Gender = factor(Gender,
                           levels = c("Female", "Male", 
                                      "Other")))

2 Handling NA values

From our previous work on this case study, we know that this data frame has a lot of missing values:

survey |> 
  summarize(across(everything(), ~sum(is.na(.x))))
  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0     0        551    627    875         0     0      0      0
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1       0      0        0         4999        4957         5141      4994
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1   4996  5080        4979        5085   4986         4961

Yep, that’s a lot. For this analysis, we want to ensure that all of the missing data is encoded as NA values. We use the command below to make these changes.

It operates on the 4 columns from Field to Gender and the 16 columns from FinPL to OverallValue. For every value in those columns, it changes values of "", "NA", and "--" to NA.

This will make analysis and computations easier.

survey <-
  survey |> 
    mutate(across(c(Field:Gender,
                    FinPL:OverallValue),
                  ~ case_when(is.na(.x) ~ NA,
                              .x == "" ~ NA,
                              .x == "NA" ~ NA,
                              .x == "--" ~ NA,
                              .default = .x)))

Let’s run that command from above again and see what’s changed.

survey |> 
  summarize(across(everything(), ~sum(is.na(.x))))
  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0    12        551    627    875         0  1674   1615   1586
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1    1622   1699     1699         6645        6622         6819      6676
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1   6637  6684        6634        6735   6676         6631

The output now shows many hundreds more NA values in most of the columns. Again, this will make computations easier as we move forward.

3 Converting wide to long

Now it’s time for the main event — converting a wide table to a long table. In this case, we want to put all of the responses to survey questions in one column.

Using the format given in this page, we know we need the following information:

  • data frame: survey
  • names_to: we will name the column header Question
  • cols: we will combine all of the survey response fields from TooDifficult to OverallValue into one column
  • values_to: we will name the column with the responses Response

Combining all of this information, we get the following command:

survey <-
  survey |> 
    pivot_longer(
      names_to = "Question",
      cols = c(TooDifficult:OverallValue), 
      values_to = "Response"
    ) |> 
    arrange(Year, ID, Question)

Let’s take a look at our new survey table:

summary(survey)
      Year           ID                 NPS           Field          
 Min.   :2012   Length:335240      Min.   :4.000   Length:335240     
 1st Qu.:2013   Class :character   1st Qu.:4.000   Class :character  
 Median :2014   Mode  :character   Median :6.000   Mode  :character  
 Mean   :2014                      Mean   :5.955                     
 3rd Qu.:2016                      3rd Qu.:7.000                     
 Max.   :2017                      Max.   :8.000                     
 ClassLevel           Status          Gender         BirthYear   
 Fresh:109270   Full-time:251240   Female:159880   Min.   :1988  
 Soph : 83110   Part-time: 61730   Male  :137130   1st Qu.:1991  
 Jr   : 71340   Other    : 16000   Other : 29480   Median :1994  
 Sr   : 66010   NA's     :  6270   NA's  :  8750   Mean   :1994  
 NA's :  5510                                      3rd Qu.:1997  
                                                   Max.   :2000  
    FinPL              FinSch             FinGov            FinSelf         
 Length:335240      Length:335240      Length:335240      Length:335240     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    FinPar            FinOther           Question           Response        
 Length:335240      Length:335240      Length:335240      Length:335240     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            

4 Creating appropriate factors

And now let’s take a look at the contents of the last eight columns as preparation for defining some more factors:

vals <- function(x) {sort(unique(x))}
apply(survey[,9:16], 2, vals)
$FinPL
[1] "No"  "Yes"

$FinSch
[1] "No"  "Yes"

$FinGov
[1] "No"  "Yes"

$FinSelf
[1] "No"  "Yes"

$FinPar
[1] "No"  "Yes"

$FinOther
[1] "No"  "Yes"

$Question
 [1] "BadAdvising"  "ClassTooBig"  "FinAid"       "Grades"       "NotRelevant" 
 [6] "OverallValue" "PoorTeaching" "Sched"        "TooDifficult" "UnsuppFac"   

$Response
[1] "Agree"             "Disagree"          "Neutral"          
[4] "Strongly Agree"    "Strongly Disagree"

We can see that the first six have just Yes and No as values, so let’s define those as factors:

survey <-
  survey |> 
    mutate(FinPL = factor(FinPL, levels = c("Yes", "No"),
                          ordered = TRUE),
           FinSch = factor(FinSch, levels = c("Yes", "No"),
                          ordered = TRUE),
           FinGov = factor(FinGov, levels = c("Yes", "No"),
                          ordered = TRUE),
           FinSelf = factor(FinSelf, levels = c("Yes", "No"),
                          ordered = TRUE),
           FinPar = factor(FinPar, levels = c("Yes", "No"),
                          ordered = TRUE),
           FinOther = factor(FinOther, levels = c("Yes", "No"),
                          ordered = TRUE))

We want Question to always work with its values in the order in which the questions appear, so let’s define that as a factor as well:

ordered_q <- c("TooDifficult", "NotRelevant",
               "PoorTeaching", "UnsuppFac",
               "Grades", "Sched", "ClassTooBig",
               "BadAdvising", "FinAid",
               "OverallValue")
survey <-
  survey |> 
    mutate(Question = factor(Question,
                             levels = ordered_q,
                             ordered = TRUE))

And now to define Response as a factor. We want to do that so that the factors are in this particular order from low to high.

survey <-
  survey |> 
    mutate(Response = factor(Response,
                             levels = c("Strongly Disagree",
                                        "Disagree", "Neutral",
                                        "Agree",
                                        "Strongly Agree"),
                             ordered = TRUE))

5 Examining the data

Let’s take a look at some of the data. The following shows the responses of two respondents to the 10 questions:

survey |> 
  select(Year, ID, NPS, Field, Question, Response) |> 
  arrange(Year, ID, Question) |> 
  head(20)
# A tibble: 20 × 6
    Year ID             NPS Field  Question     Response         
   <int> <chr>        <int> <chr>  <ord>        <ord>            
 1  2012 aacuutwvloft     7 Undecl TooDifficult Disagree         
 2  2012 aacuutwvloft     7 Undecl NotRelevant  <NA>             
 3  2012 aacuutwvloft     7 Undecl PoorTeaching Strongly Agree   
 4  2012 aacuutwvloft     7 Undecl UnsuppFac    Disagree         
 5  2012 aacuutwvloft     7 Undecl Grades       <NA>             
 6  2012 aacuutwvloft     7 Undecl Sched        Disagree         
 7  2012 aacuutwvloft     7 Undecl ClassTooBig  Neutral          
 8  2012 aacuutwvloft     7 Undecl BadAdvising  <NA>             
 9  2012 aacuutwvloft     7 Undecl FinAid       Agree            
10  2012 aacuutwvloft     7 Undecl OverallValue Strongly Agree   
11  2012 aadyitppmqwo     7 Undecl TooDifficult <NA>             
12  2012 aadyitppmqwo     7 Undecl NotRelevant  <NA>             
13  2012 aadyitppmqwo     7 Undecl PoorTeaching <NA>             
14  2012 aadyitppmqwo     7 Undecl UnsuppFac    Strongly Agree   
15  2012 aadyitppmqwo     7 Undecl Grades       Strongly Disagree
16  2012 aadyitppmqwo     7 Undecl Sched        Neutral          
17  2012 aadyitppmqwo     7 Undecl ClassTooBig  Neutral          
18  2012 aadyitppmqwo     7 Undecl BadAdvising  Strongly Disagree
19  2012 aadyitppmqwo     7 Undecl FinAid       Agree            
20  2012 aadyitppmqwo     7 Undecl OverallValue Strongly Agree   

If we wanted to just show the non-NA responses, we could just add the filter() to the command:

survey |> 
  select(Year, ID, NPS, Field, Question, Response) |> 
  filter(!is.na(Response)) |> 
  arrange(Year, ID, Question) |> 
  head(20)
# A tibble: 20 × 6
    Year ID             NPS Field  Question     Response         
   <int> <chr>        <int> <chr>  <ord>        <ord>            
 1  2012 aacuutwvloft     7 Undecl TooDifficult Disagree         
 2  2012 aacuutwvloft     7 Undecl PoorTeaching Strongly Agree   
 3  2012 aacuutwvloft     7 Undecl UnsuppFac    Disagree         
 4  2012 aacuutwvloft     7 Undecl Sched        Disagree         
 5  2012 aacuutwvloft     7 Undecl ClassTooBig  Neutral          
 6  2012 aacuutwvloft     7 Undecl FinAid       Agree            
 7  2012 aacuutwvloft     7 Undecl OverallValue Strongly Agree   
 8  2012 aadyitppmqwo     7 Undecl UnsuppFac    Strongly Agree   
 9  2012 aadyitppmqwo     7 Undecl Grades       Strongly Disagree
10  2012 aadyitppmqwo     7 Undecl Sched        Neutral          
11  2012 aadyitppmqwo     7 Undecl ClassTooBig  Neutral          
12  2012 aadyitppmqwo     7 Undecl BadAdvising  Strongly Disagree
13  2012 aadyitppmqwo     7 Undecl FinAid       Agree            
14  2012 aadyitppmqwo     7 Undecl OverallValue Strongly Agree   
15  2012 aafejliqlzng     4 SocSci TooDifficult Neutral          
16  2012 aafejliqlzng     4 SocSci NotRelevant  Strongly Disagree
17  2012 aafejliqlzng     4 SocSci PoorTeaching Strongly Disagree
18  2012 aafejliqlzng     4 SocSci UnsuppFac    Neutral          
19  2012 aafejliqlzng     4 SocSci Grades       Strongly Disagree
20  2012 aafejliqlzng     4 SocSci Sched        Strongly Disagree

Since we aren’t going to do any analysis on the rows containing an NA as a response, we can just remove those rows permanently from the survey table with this:

survey <-
  survey |> 
    filter(!is.na(Response)) |> 
    arrange(Year, ID, Question)

This is just about all set to do our analysis on this table. But we have one problem. See this:

survey |> 
  group_by(Question) |> 
  summarize(Median = median(Response, na.rm = TRUE),
            Avg = mean(Response, na.rm = TRUE))

Well, this is supposed to calculate the median and mean responses to each question. It doesn’t work since the values in Response are character strings. Let’s fix that.

6 Readying for analysis

We can fix the above problem by creating a new numeric column called NumResp with this command:

survey <-
  survey |> 
    mutate(NumResp = case_when(
      Response == "Strongly Disagree" ~ 1,
      Response == "Disagree" ~ 2,
      Response == "Neutral" ~ 3,
      Response == "Agree" ~ 4,
      Response == "Strongly Agree" ~ 5,
      .default = NA))

We can now run that command above that wouldn’t work:

survey |> 
  group_by(Question) |> 
  summarize(Median = median(NumResp, na.rm = TRUE),
            Avg = mean(NumResp, na.rm = TRUE))
# A tibble: 10 × 3
   Question     Median   Avg
   <ord>         <dbl> <dbl>
 1 TooDifficult      3  2.99
 2 NotRelevant       3  2.73
 3 PoorTeaching      4  3.42
 4 UnsuppFac         3  3.28
 5 Grades            3  3.02
 6 Sched             4  4.00
 7 ClassTooBig       2  2.52
 8 BadAdvising       2  2.33
 9 FinAid            4  3.83
10 OverallValue      4  4.11