<- read.csv("../data/class202405/retention_survey_history.csv") survey
Case Study: Retention survey
The above is a somewhat dry listing of R’s capabilities related to importing and examining data. The following provides a case study that provides a more complete demonstration of what an analyst might go through when just beginning to work with and become familiar with a new dataset.
1 Description
The structure of this dataset is based on the survey template from QuestionPro related to student retention data. You might take a look at this survey to get a flavor of the questions.
2 Import the data
The following is the command for reading the dataset into an R data frame:
Note that this dataset is purely fictional, generated programmatically for this web site. This, however, does not change the steps that we need to take in order to import and examine the data.
3 Describing the dataset
Here are the column names:
names(survey)
[1] "Year" "ID" "NPS" "Field" "ClassLevel"
[6] "Status" "Gender" "BirthYear" "FinPL" "FinSch"
[11] "FinGov" "FinSelf" "FinPar" "FinOther" "TooDifficult"
[16] "NotRelevant" "PoorTeaching" "UnsuppFac" "Grades" "Sched"
[21] "ClassTooBig" "BadAdvising" "FinAid" "OverallValue"
And here are a few rows of data:
head(survey)
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu 8 Undecl Sr Full-time Male 1990 No
2 2012 vvwkinqvnibo 8 SocSci Sr Full-time Female 1999 No
3 2012 ibyjcmiopiqk 8 CompSci Sr Full-time Male 1999 No
4 2012 lsqamawyancj 8 Other Sr Full-time Female 1989 No
5 2012 zrpkydjpltkc 8 SocSci Sr Full-time Other 1993 Yes
6 2012 mdoqvaalcscx 8 Undecl Sr Part-time Female 1988 Yes
FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
1 No No No Yes No Disagree Strongly Disagree
2 No Yes Yes No No Strongly Disagree Disagree
3 No Yes No No Disagree <NA>
4 Yes No Yes No Agree Strongly Disagree
5 Yes No Yes No No <NA> Neutral
6 Yes No Yes No No
PoorTeaching UnsuppFac Grades Sched
1 Agree <NA> <NA> Strongly Agree
2 Agree Neutral Disagree Strongly Disagree
3 Agree Neutral Strongly Disagree Strongly Disagree
4 Disagree Neutral <NA> Agree
5 Agree <NA> <NA> Strongly Disagree
6 Strongly Disagree Neutral Agree Strongly Agree
ClassTooBig BadAdvising FinAid OverallValue
1 Neutral Disagree Strongly Agree Strongly Agree
2 <NA> Disagree Strongly Agree Strongly Agree
3 Strongly Disagree <NA> Strongly Agree Neutral
4 Strongly Disagree Disagree Agree Strongly Agree
5 Strongly Agree Strongly Disagree <NA> Disagree
6 <NA> Disagree <NA> Strongly Agree
It’s quite clear that printing rows of data becomes much more challenging (and complex to interpret) when there are more columns in the dataset.
Here’s a basic description of the fields:
- Survey response identifiers
-
These fields identify the specific survey that is in this observation.
Year
: the year in which the survey was completedID
: the student identifier
- Descriptors
-
These fields provide data about the student responding to the survey.
Field
: The student’s field of studyClassLevel
: The student’s class level (freshman, sophomore, etc.)Status
: The student’s status as either part-time or full-timeGender
: The student’s genderBirthYear
: The student’s year of birth
- Financing
-
These fields contain the responses that the student gave on the survey concerning the student’s financing for their education.
FinPL
: Did the student use a personal loan?FinSch
: Did the student have a scholarship?FinGov
: Is the student receiving a government-sponsored loan (Pell Grant)?FinSelf
: Did the student pay for some or all of the education him or herself?FinPar
: Did the student’s parents or relatives pay for some or all of the education?FinOther
: Did the student receive other financing?
- Experience responses
-
These fields contain the student’s responses related to his/her experience at the university.
TooDifficult
: The courses were too difficult for meNotRelevant
: The courses weren’t relevant to my career plansPoorTeaching
: Teaching methods were poorUnsuppFac
: The faculty members weren’t supportiveGrades
: My grades dissatisfied meSched
: The courses schedule didn’t fit my programClassTooBig
: Class sizes were too largeBadAdvising
: Academic advising was dissatisfyingFinAid
: The financial aid received was inadequateOverallValue
: Overall value of the education for the price was dissatisfying
- Summary response
-
This question relates to the student’s overall experience at the university.
NPS
: This is the Net Promoter Score and reflects the student’s response to this question: Considering your complete experience studying at this college, how likely would you be to recommend us to a friend or colleague?
4 Exploring the dataset
How big is this table?
dim(survey)
[1] 33524 24
So this is a big data set—over 800,000 pieces of data, not something that you’d want to handle in an Excel spreadsheet. Also, this is clearly a survey that has to be processed every year; thus, setting up a process for handling all of the complexities in the data and then just being able to drop in a new dataset and get the results immediately would have a huge payoff.
How many NA
values are in each column? (You’re right…you have never seen any programming like this! Don’t worry about it for now. You will begin to learn about this on the Pipe page. Just know that it counts up all of the NA
values in every column in the data frame.)
|>
survey summarize(across(everything(), ~sum(is.na(.x))))
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1 0 0 0 0 0 0 0 0 0 0 0
FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1 0 0 0 4999 4957 5141 4994
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1 4996 5080 4979 5085 4986 4961
Well, that’s a lot of NA
values! Ten of the columns have about 5000 NA
values in them. Fortunately, we’ll learn in a future section what to do with these and how R handles them.