Year ID NPS Field
Min. :2012 Length:33524 Min. :4.000 Length:33524
1st Qu.:2013 Class :character 1st Qu.:4.000 Class :character
Median :2014 Mode :character Median :6.000 Mode :character
Mean :2014 Mean :5.955
3rd Qu.:2016 3rd Qu.:7.000
Max. :2017 Max. :8.000
ClassLevel Status Gender BirthYear
Length:33524 Length:33524 Length:33524 Min. :1988
Class :character Class :character Class :character 1st Qu.:1991
Mode :character Mode :character Mode :character Median :1994
Mean :1994
3rd Qu.:1997
Max. :2000
FinPL FinSch FinGov FinSelf
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
FinPar FinOther TooDifficult NotRelevant
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
PoorTeaching UnsuppFac Grades Sched
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ClassTooBig BadAdvising FinAid OverallValue
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Case study: Retention survey
On this page, we’re going to take you through the process of getting a data frame into a state of readiness for analysis by R and the tidyverse.
This consists of, first, acquainting ourselves with the data frame, identifying any issues with its structure, converting it from wide to long where appropriate, and then finishing up with other changes (creating factors, creating a new column, and doing a quick validity check).
1 The survey data frame
We have worked with the survey
table throughout this site.
We know right off that 3 columns right near the beginning are factors. Let’s handle those first.
class_levels <- c("Fresh", "Soph", "Jr", "Sr")
survey <-
survey |>
mutate(ClassLevel = factor(ClassLevel,
levels = class_levels,
ordered = TRUE))
survey <-
survey |>
mutate(Status = factor(Status,
levels = c("Full-time",
"Part-time",
"Other")))
survey <-
survey |>
mutate(Gender = factor(Gender,
levels = c("Female", "Male",
"Other")))
2 Handling NA
values
From our previous work on this case study, we know that this data frame has a lot of missing values:
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1 0 0 0 0 551 627 875 0 0 0 0
FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1 0 0 0 4999 4957 5141 4994
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1 4996 5080 4979 5085 4986 4961
Yep, that’s a lot. For this analysis, we want to ensure that all of the missing data is encoded as NA
values. We use the command below to make these changes.
It operates on the 4 columns from
Field
toGender
and the 16 columns fromFinPL
toOverallValue
. For every value in those columns, it changes values of""
,"NA"
, and"--"
toNA
.
This will make analysis and computations easier.
Let’s run that command from above again and see what’s changed.
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1 0 0 0 12 551 627 875 0 1674 1615 1586
FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1 1622 1699 1699 6645 6622 6819 6676
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1 6637 6684 6634 6735 6676 6631
The output now shows many hundreds more NA
values in most of the columns. Again, this will make computations easier as we move forward.
3 Converting wide to long
Now it’s time for the main event — converting a wide table to a long table. In this case, we want to put all of the responses to survey questions in one column.
Using the format given in this page, we know we need the following information:
data frame
:survey
names_to
: we will name the column headerQuestion
cols
: we will combine all of the survey response fields fromTooDifficult
toOverallValue
into one columnvalues_to
: we will name the column with the responsesResponse
Combining all of this information, we get the following command:
Let’s take a look at our new survey
table:
Year ID NPS Field
Min. :2012 Length:335240 Min. :4.000 Length:335240
1st Qu.:2013 Class :character 1st Qu.:4.000 Class :character
Median :2014 Mode :character Median :6.000 Mode :character
Mean :2014 Mean :5.955
3rd Qu.:2016 3rd Qu.:7.000
Max. :2017 Max. :8.000
ClassLevel Status Gender BirthYear
Fresh:109270 Full-time:251240 Female:159880 Min. :1988
Soph : 83110 Part-time: 61730 Male :137130 1st Qu.:1991
Jr : 71340 Other : 16000 Other : 29480 Median :1994
Sr : 66010 NA's : 6270 NA's : 8750 Mean :1994
NA's : 5510 3rd Qu.:1997
Max. :2000
FinPL FinSch FinGov FinSelf
Length:335240 Length:335240 Length:335240 Length:335240
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
FinPar FinOther Question Response
Length:335240 Length:335240 Length:335240 Length:335240
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
4 Creating appropriate factors
And now let’s take a look at the contents of the last eight columns as preparation for defining some more factors:
$FinPL
[1] "No" "Yes"
$FinSch
[1] "No" "Yes"
$FinGov
[1] "No" "Yes"
$FinSelf
[1] "No" "Yes"
$FinPar
[1] "No" "Yes"
$FinOther
[1] "No" "Yes"
$Question
[1] "BadAdvising" "ClassTooBig" "FinAid" "Grades" "NotRelevant"
[6] "OverallValue" "PoorTeaching" "Sched" "TooDifficult" "UnsuppFac"
$Response
[1] "Agree" "Disagree" "Neutral"
[4] "Strongly Agree" "Strongly Disagree"
We can see that the first six have just Yes
and No
as values, so let’s define those as factors:
survey <-
survey |>
mutate(FinPL = factor(FinPL, levels = c("Yes", "No"),
ordered = TRUE),
FinSch = factor(FinSch, levels = c("Yes", "No"),
ordered = TRUE),
FinGov = factor(FinGov, levels = c("Yes", "No"),
ordered = TRUE),
FinSelf = factor(FinSelf, levels = c("Yes", "No"),
ordered = TRUE),
FinPar = factor(FinPar, levels = c("Yes", "No"),
ordered = TRUE),
FinOther = factor(FinOther, levels = c("Yes", "No"),
ordered = TRUE))
We want Question
to always work with its values in the order in which the questions appear, so let’s define that as a factor as well:
And now to define Response
as a factor. We want to do that so that the factors are in this particular order from low to high.
5 Examining the data
Let’s take a look at some of the data. The following shows the responses of two respondents to the 10 questions:
survey |>
select(Year, ID, NPS, Field, Question, Response) |>
arrange(Year, ID, Question) |>
head(20)
# A tibble: 20 × 6
Year ID NPS Field Question Response
<int> <chr> <int> <chr> <ord> <ord>
1 2012 aacuutwvloft 7 Undecl TooDifficult Disagree
2 2012 aacuutwvloft 7 Undecl NotRelevant <NA>
3 2012 aacuutwvloft 7 Undecl PoorTeaching Strongly Agree
4 2012 aacuutwvloft 7 Undecl UnsuppFac Disagree
5 2012 aacuutwvloft 7 Undecl Grades <NA>
6 2012 aacuutwvloft 7 Undecl Sched Disagree
7 2012 aacuutwvloft 7 Undecl ClassTooBig Neutral
8 2012 aacuutwvloft 7 Undecl BadAdvising <NA>
9 2012 aacuutwvloft 7 Undecl FinAid Agree
10 2012 aacuutwvloft 7 Undecl OverallValue Strongly Agree
11 2012 aadyitppmqwo 7 Undecl TooDifficult <NA>
12 2012 aadyitppmqwo 7 Undecl NotRelevant <NA>
13 2012 aadyitppmqwo 7 Undecl PoorTeaching <NA>
14 2012 aadyitppmqwo 7 Undecl UnsuppFac Strongly Agree
15 2012 aadyitppmqwo 7 Undecl Grades Strongly Disagree
16 2012 aadyitppmqwo 7 Undecl Sched Neutral
17 2012 aadyitppmqwo 7 Undecl ClassTooBig Neutral
18 2012 aadyitppmqwo 7 Undecl BadAdvising Strongly Disagree
19 2012 aadyitppmqwo 7 Undecl FinAid Agree
20 2012 aadyitppmqwo 7 Undecl OverallValue Strongly Agree
If we wanted to just show the non-NA
responses, we could just add the filter()
to the command:
survey |>
select(Year, ID, NPS, Field, Question, Response) |>
filter(!is.na(Response)) |>
arrange(Year, ID, Question) |>
head(20)
# A tibble: 20 × 6
Year ID NPS Field Question Response
<int> <chr> <int> <chr> <ord> <ord>
1 2012 aacuutwvloft 7 Undecl TooDifficult Disagree
2 2012 aacuutwvloft 7 Undecl PoorTeaching Strongly Agree
3 2012 aacuutwvloft 7 Undecl UnsuppFac Disagree
4 2012 aacuutwvloft 7 Undecl Sched Disagree
5 2012 aacuutwvloft 7 Undecl ClassTooBig Neutral
6 2012 aacuutwvloft 7 Undecl FinAid Agree
7 2012 aacuutwvloft 7 Undecl OverallValue Strongly Agree
8 2012 aadyitppmqwo 7 Undecl UnsuppFac Strongly Agree
9 2012 aadyitppmqwo 7 Undecl Grades Strongly Disagree
10 2012 aadyitppmqwo 7 Undecl Sched Neutral
11 2012 aadyitppmqwo 7 Undecl ClassTooBig Neutral
12 2012 aadyitppmqwo 7 Undecl BadAdvising Strongly Disagree
13 2012 aadyitppmqwo 7 Undecl FinAid Agree
14 2012 aadyitppmqwo 7 Undecl OverallValue Strongly Agree
15 2012 aafejliqlzng 4 SocSci TooDifficult Neutral
16 2012 aafejliqlzng 4 SocSci NotRelevant Strongly Disagree
17 2012 aafejliqlzng 4 SocSci PoorTeaching Strongly Disagree
18 2012 aafejliqlzng 4 SocSci UnsuppFac Neutral
19 2012 aafejliqlzng 4 SocSci Grades Strongly Disagree
20 2012 aafejliqlzng 4 SocSci Sched Strongly Disagree
Since we aren’t going to do any analysis on the rows containing an NA
as a response, we can just remove those rows permanently from the survey
table with this:
This is just about all set to do our analysis on this table. But we have one problem. See this:
Well, this is supposed to calculate the median
and mean
responses to each question. It doesn’t work since the values in Response
are character strings. Let’s fix that.
6 Readying for analysis
We can fix the above problem by creating a new numeric column called NumResp
with this command:
We can now run that command above that wouldn’t work:
survey |>
group_by(Question) |>
summarize(Median = median(NumResp, na.rm = TRUE),
Avg = mean(NumResp, na.rm = TRUE))
# A tibble: 10 × 3
Question Median Avg
<ord> <dbl> <dbl>
1 TooDifficult 3 2.99
2 NotRelevant 3 2.73
3 PoorTeaching 4 3.42
4 UnsuppFac 3 3.28
5 Grades 3 3.02
6 Sched 4 4.00
7 ClassTooBig 2 2.52
8 BadAdvising 2 2.33
9 FinAid 4 3.83
10 OverallValue 4 4.11