Case Study: Retention survey

The above is a somewhat dry listing of R’s capabilities related to importing and examining data. The following provides a case study that provides a more complete demonstration of what an analyst might go through when just beginning to work with and become familiar with a new dataset.

1 Description

The structure of this dataset is based on the survey template from QuestionPro related to student retention data. You might take a look at this survey to get a flavor of the questions.

2 Import the data

The following is the command for reading the dataset into an R data frame:

survey <- read.csv("../data/class202405/retention_survey_history.csv")

Note that this dataset is purely fictional, generated programmatically for this web site. This, however, does not change the steps that we need to take in order to import and examine the data.

3 Describing the dataset

Here are the column names:

names(survey)

 [1] "Year"         "ID"           "NPS"          "Field"        "ClassLevel"  
 [6] "Status"       "Gender"       "BirthYear"    "FinPL"        "FinSch"      
[11] "FinGov"       "FinSelf"      "FinPar"       "FinOther"     "TooDifficult"
[16] "NotRelevant"  "PoorTeaching" "UnsuppFac"    "Grades"       "Sched"       
[21] "ClassTooBig"  "BadAdvising"  "FinAid"       "OverallValue"

And here are a few rows of data:

head(survey)

  Year           ID NPS   Field ClassLevel    Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu   8  Undecl         Sr Full-time   Male      1990    No
2 2012 vvwkinqvnibo   8  SocSci         Sr Full-time Female      1999    No
3 2012 ibyjcmiopiqk   8 CompSci         Sr Full-time   Male      1999    No
4 2012 lsqamawyancj   8   Other         Sr Full-time Female      1989    No
5 2012 zrpkydjpltkc   8  SocSci         Sr Full-time  Other      1993   Yes
6 2012 mdoqvaalcscx   8  Undecl         Sr Part-time Female      1988   Yes
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1     No     No      No    Yes       No          Disagree Strongly Disagree
2     No    Yes     Yes     No       No Strongly Disagree          Disagree
3     No    Yes      No              No          Disagree              <NA>
4    Yes             No    Yes       No             Agree Strongly Disagree
5    Yes     No     Yes     No       No              <NA>           Neutral
6    Yes     No     Yes     No       No                                    
       PoorTeaching UnsuppFac            Grades             Sched
1             Agree      <NA>              <NA>    Strongly Agree
2             Agree   Neutral          Disagree Strongly Disagree
3             Agree   Neutral Strongly Disagree Strongly Disagree
4          Disagree   Neutral              <NA>             Agree
5             Agree      <NA>              <NA> Strongly Disagree
6 Strongly Disagree   Neutral             Agree    Strongly Agree
        ClassTooBig       BadAdvising         FinAid   OverallValue
1           Neutral          Disagree Strongly Agree Strongly Agree
2              <NA>          Disagree Strongly Agree Strongly Agree
3 Strongly Disagree              <NA> Strongly Agree        Neutral
4 Strongly Disagree          Disagree          Agree Strongly Agree
5    Strongly Agree Strongly Disagree           <NA>       Disagree
6              <NA>          Disagree           <NA> Strongly Agree

It’s quite clear that printing rows of data becomes much more challenging (and complex to interpret) when there are more columns in the dataset.

Here’s a basic description of the fields:

Survey response identifiers: These fields identify the specific survey that is in this observation.

Year: the year in which the survey was completed
ID: the student identifier

Descriptors: These fields provide data about the student responding to the survey.

Field: The student’s field of study
ClassLevel: The student’s class level (freshman, sophomore, etc.)
Status: The student’s status as either part-time or full-time
Gender: The student’s gender
BirthYear: The student’s year of birth

Financing: These fields contain the responses that the student gave on the survey concerning the student’s financing for their education.

FinPL: Did the student use a personal loan?
FinSch: Did the student have a scholarship?
FinGov: Is the student receiving a government-sponsored loan (Pell Grant)?
FinSelf: Did the student pay for some or all of the education him or herself?
FinPar: Did the student’s parents or relatives pay for some or all of the education?
FinOther: Did the student receive other financing?

Experience responses: These fields contain the student’s responses related to his/her experience at the university.

TooDifficult: The courses were too difficult for me
NotRelevant: The courses weren’t relevant to my career plans
PoorTeaching: Teaching methods were poor
UnsuppFac: The faculty members weren’t supportive
Grades: My grades dissatisfied me
Sched: The courses schedule didn’t fit my program
ClassTooBig: Class sizes were too large
BadAdvising: Academic advising was dissatisfying
FinAid: The financial aid received was inadequate
OverallValue: Overall value of the education for the price was dissatisfying

Summary response: This question relates to the student’s overall experience at the university.

NPS: This is the Net Promoter Score and reflects the student’s response to this question: Considering your complete experience studying at this college, how likely would you be to recommend us to a friend or colleague?

4 Exploring the dataset

How big is this table?

dim(survey)

[1] 33524    24

So this is a big data set—over 800,000 pieces of data, not something that you’d want to handle in an Excel spreadsheet. Also, this is clearly a survey that has to be processed every year; thus, setting up a process for handling all of the complexities in the data and then just being able to drop in a new dataset and get the results immediately would have a huge payoff.

How many NA values are in each column? (You’re right…you have never seen any programming like this! Don’t worry about it for now. You will begin to learn about this on the Pipe page. Just know that it counts up all of the NA values in every column in the data frame.)

survey |> 
  summarize(across(everything(), ~sum(is.na(.x))))

  Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1    0  0   0     0          0      0      0         0     0      0      0
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1       0      0        0         4999        4957         5141      4994
  Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1   4996  5080        4979        5085   4986         4961

Well, that’s a lot of NA values! Ten of the columns have about 5000 NA values in them. Fortunately, we’ll learn in a future section what to do with these and how R handles them.