Rows: 33,524
Columns: 24
$ Year <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
$ ID <chr> "xuojqdfdozvu", "vvwkinqvnibo", "ibyjcmiopiqk", "lsqamawy…
$ NPS <int> 8, 8, 8, 8, 8, 8, 8, 4, 4, 7, 6, 6, 6, 4, 7, 4, 6, 7, 4, …
$ Field <chr> "Undecl", "SocSci", "CompSci", "Other", "SocSci", "Undecl…
$ ClassLevel <chr> "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Fresh", "Fresh…
$ Status <chr> "Full-time", "Full-time", "Full-time", "Full-time", "Full…
$ Gender <chr> "Male", "Female", "Male", "Female", "Other", "Female", "M…
$ BirthYear <int> 1990, 1999, 1999, 1989, 1993, 1988, 1996, 1991, 1990, 198…
$ FinPL <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "No", "No", "…
$ FinSch <chr> "No", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "No",…
$ FinGov <chr> "No", "Yes", "Yes", "", "No", "No", "No", "No", "No", "No…
$ FinSelf <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No",…
$ FinPar <chr> "Yes", "No", "", "Yes", "No", "No", "No", "No", "No", "Ye…
$ FinOther <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No…
$ TooDifficult <chr> "Disagree", "Strongly Disagree", "Disagree", "Agree", NA,…
$ NotRelevant <chr> "Strongly Disagree", "Disagree", NA, "Strongly Disagree",…
$ PoorTeaching <chr> "Agree", "Agree", "Agree", "Disagree", "Agree", "Strongly…
$ UnsuppFac <chr> NA, "Neutral", "Neutral", "Neutral", NA, "Neutral", "Stro…
$ Grades <chr> NA, "Disagree", "Strongly Disagree", NA, NA, "Agree", NA,…
$ Sched <chr> "Strongly Agree", "Strongly Disagree", "Strongly Disagree…
$ ClassTooBig <chr> "Neutral", NA, "Strongly Disagree", "Strongly Disagree", …
$ BadAdvising <chr> "Disagree", "Disagree", NA, "Disagree", "Strongly Disagre…
$ FinAid <chr> "Strongly Agree", "Strongly Agree", "Strongly Agree", "Ag…
$ OverallValue <chr> "Strongly Agree", "Strongly Agree", "Neutral", "Strongly …
Case study: Retention survey
We are now going to return to our retention survey case study. We have imported the dataset and examined it. At this stage we are going to standardize the representation of NA
values throughout the data frame.
1 Get a few details on the survey data frame
While working on all of the following, you can use the following information to remind yourself what you’re working on.
Name | survey |
Number of rows | 33524 |
Number of columns | 24 |
_______________________ | |
Column type frequency: | |
character | 21 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ID | 0 | 1.00 | 12 | 12 | 0 | 33524 | 0 |
Field | 0 | 1.00 | 0 | 9 | 12 | 16 | 0 |
ClassLevel | 0 | 1.00 | 2 | 5 | 0 | 5 | 0 |
Status | 0 | 1.00 | 0 | 9 | 627 | 4 | 0 |
Gender | 0 | 1.00 | 0 | 6 | 875 | 4 | 0 |
FinPL | 0 | 1.00 | 0 | 3 | 1674 | 3 | 0 |
FinSch | 0 | 1.00 | 0 | 3 | 1615 | 3 | 0 |
FinGov | 0 | 1.00 | 0 | 3 | 1586 | 3 | 0 |
FinSelf | 0 | 1.00 | 0 | 3 | 1622 | 3 | 0 |
FinPar | 0 | 1.00 | 0 | 3 | 1699 | 3 | 0 |
FinOther | 0 | 1.00 | 0 | 3 | 1699 | 3 | 0 |
TooDifficult | 4999 | 0.85 | 0 | 17 | 1646 | 6 | 0 |
NotRelevant | 4957 | 0.85 | 0 | 17 | 1665 | 6 | 0 |
PoorTeaching | 5141 | 0.85 | 0 | 17 | 1678 | 6 | 0 |
UnsuppFac | 4994 | 0.85 | 0 | 17 | 1682 | 6 | 0 |
Grades | 4996 | 0.85 | 0 | 17 | 1641 | 6 | 0 |
Sched | 5080 | 0.85 | 0 | 17 | 1604 | 6 | 0 |
ClassTooBig | 4979 | 0.85 | 0 | 17 | 1655 | 6 | 0 |
BadAdvising | 5085 | 0.85 | 0 | 17 | 1650 | 6 | 0 |
FinAid | 4986 | 0.85 | 0 | 17 | 1690 | 6 | 0 |
OverallValue | 4961 | 0.85 | 0 | 17 | 1670 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 2014.50 | 1.71 | 2012 | 2013 | 2014 | 2016 | 2017 | ▇▅▅▃▅ |
NPS | 0 | 1 | 5.95 | 1.52 | 4 | 4 | 6 | 7 | 8 | ▇▁▆▅▅ |
BirthYear | 0 | 1 | 1994.01 | 3.76 | 1988 | 1991 | 1994 | 1997 | 2000 | ▇▅▇▅▇ |
2 Work on TooDifficult
column
2.1 Current status
Let’s start with the TooDifficult
variable in the survey
dataset. What values are in the column?
TooDifficult
1 Disagree
2 Strongly Disagree
3 Agree
4 <NA>
5
6 Neutral
7 Strongly Agree
So it has 7 different values…but only five of them are what we want. Two of them, <NA>
(which is a printed representation of NA
) and the blank in row 5, are blank non-responses.
Let’s remind ourselves of the actual values at the top of this column:
TooDifficult
1 Disagree
2 Strongly Disagree
3 Disagree
4 Agree
5 <NA>
6
7 Neutral
8 Strongly Disagree
And then what does the is.na()
have to say about these values?
TooDifficult
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
[5,] TRUE
[6,] FALSE
[7,] FALSE
[8,] FALSE
Notice that row 5 (the <NA>
) is considered to not have a value but row 6 (the blank, empty space) is considered to have a value. Essentially, the efforts on the rest of this page is directed at removing this duality — we want all blank values to be represented by NA
and nothing else.
2.2 Some calculations on the field
Now, we’re going to do something a bit more complicated. Let’s look at the whole TooDifficult
column and compute the following:
- number of distinct values
- number of NA values
- number of not-NA values
- median response
survey |>
summarize(n = n_distinct(TooDifficult),
na = sum(is.na(TooDifficult)),
notNA = sum(!is.na(TooDifficult)),
med = median(TooDifficult,
na.rm = TRUE))
n na notNA med
1 7 4999 28525 Neutral
From the output, we can see that there are (still) 7 distinct values, almost 5000 NA
values, over 28,000 non-NA
values, and a median response of Neutral
. The trouble is that some of those 28,000 non-NA
values are actually blanks (and should be NA
).
The following is one of our favorite sets of reusable code. Other than the first line (which specifies the data frame it works on), everything else is simply a copy-and-paste. It prints out, for every column, the count of how many entries have NA-like values:
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1 0 0 0 12 551 627 875 0 1674 1615 1586
FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1 1622 1699 1699 1646 1665 1678 1682
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1 1641 1604 1655 1650 1690 1670
All but four columns have these NA
-like values.
The following shows what this looks like. You can see an actual NA
value (<NA>
) in row 5 plus a blank value in row 6:
3 Creating a standardized representation of NA
in one column
We are now going to try to get a standardized representation of the NA
values in the column. Here is the command:
Let’s take a moment to interpret the above. Here is the form of these two mutate()
statements:
In the first case, we’re making the following substitutions:
new-col-name
: we’re keeping the column name asTooDifficult
old-col-name
: we’re starting with the values that are currently inTooDifficult
condition
: we’re testing ifTooDifficult
has the value""
(empty string)new-value-if-true
: we’re replacing the original value withNA
(the not available indicator)
Thus, we’re replacing all ""
and "NA"
values in TooDifficult
with NA
.
4 Learning about all the columns
Let’s build a statement that can tell us about all of the columns.
The first statement uses the map_chr(x)
function — it applies the x
function to all the columns, one after another. Consider the following that applies the class
function to each column in survey
. What we should get is the class of each column.
Year ID NPS Field ClassLevel Status
"integer" "character" "integer" "character" "character" "character"
Gender BirthYear FinPL FinSch FinGov FinSelf
"character" "integer" "character" "character" "character" "character"
FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
"character" "character" "character" "character" "character" "character"
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
"character" "character" "character" "character" "character" "character"
The following statement has the same form as the above, but it applies the n_distinct
function which returns the number of distinct values in that column.
Year ID NPS Field ClassLevel Status
6 33524 4 16 5 4
Gender BirthYear FinPL FinSch FinGov FinSelf
4 13 3 3 3 3
FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
3 3 6 7 7 7
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
7 7 7 7 7 7
Now we want to create a statement that returns both of these values. In the following in line 1 we the name of the first column in survey
(Year
) to the variable .x
. (You will see why we used the .x
variable name in a moment.) Lines 2 and 3 create a data frame with 2 columns (named n_distinct
and class
) with 1 row (associated with the first column):
n_distinct class
1 6 integer
Now we are putting some pieces together. In this case, we use the R function map_df
to apply a function to every column in survey
.
n_distinct class
1 6 integer
2 33524 character
3 4 integer
4 16 character
5 5 character
6 4 character
7 4 character
8 13 integer
9 3 character
10 3 character
11 3 character
12 3 character
13 3 character
14 3 character
15 6 character
16 7 character
17 7 character
18 7 character
19 7 character
20 7 character
21 7 character
22 7 character
23 7 character
24 7 character
But what is the function? Above, we had simple built-in R functions like class
and n_distinct
. Here it is more complicated. It is whole first argument of map_df
:
This is a strange-looking statement. It helps to know that, instead of a function (as we saw above), R accepts a formula with the form ~(stuff)
, where stuff
is a valid R formula that uses .x
to refer to each successive column name. We know what this data.frame
command does because we applied it to the Year
column above.
The results look great, but it would be better if it identified the name of the column. We can do this by adding the .id
argument to the map_df()
function:
variable n_distinct class
1 Year 6 integer
2 ID 33524 character
3 NPS 4 integer
4 Field 16 character
5 ClassLevel 5 character
6 Status 4 character
7 Gender 4 character
8 BirthYear 13 integer
9 FinPL 3 character
10 FinSch 3 character
11 FinGov 3 character
12 FinSelf 3 character
13 FinPar 3 character
14 FinOther 3 character
15 TooDifficult 6 character
16 NotRelevant 7 character
17 PoorTeaching 7 character
18 UnsuppFac 7 character
19 Grades 7 character
20 Sched 7 character
21 ClassTooBig 7 character
22 BadAdvising 7 character
23 FinAid 7 character
24 OverallValue 7 character
We can see (in row 15) that TooDifficult
has 6 distinct values, one less than the columns named in rows 16–17 because of the mutate()
commands we executed above.
5 Work on FinPL
column
Let’s just take a moment to verify the information about one of the columns. We’ll focus on FinPL
, the 9th column in the table.
5.1 Distinct values
So, yes, it has 3 distinct values in the column (as shown in the statement above on row 9). One of those distinct values is a blank (but not a NA
).
5.2 Number of NA
values
The column has no NA
values in it though we know from above that it does have NA
-like values.
5.3 Number of non-NA
values
And, for completeness, we can see that it has lots of non-NA
values.
6 Selecting distinct
values for two columns
This command returns the distinct
values based on a combination of the columns! We know (from above) that each of these two columns has 3 distinct variables; however, when asking for the distinct combination of the two, it will display 9 (3x3) and not 6 (3+3) rows in the result:
7 Changing values across many columns
We saw above how much work that it takes to make it so one column (in that case, TooDifficult
) has a standardized representation of the NA
value. We don’t want to have to do that for all of the columns separately — we want to take care of all of them at once.
In addition to the already-understood mutate()
command, we have several pieces to unpack here: across()
, c()
, and ~case_when()
. Let’s take a look at each one of them in some detail:
across(Columns, Function)
-
When used within a verb such as
select()
ormutate()
, theacross()
function applies the verb across the columns specified byColumns
and executes theFunction
. We discuss both the specification of the columns and the definition of the function below. Read this page for more information aboutacross()
. c(RangeOfCols1, RangeOfCols2)
-
In this case, the whole set of columns is the joining together of two lists: 1) the columns from
Field
toGender
, and 2) the columns fromFinPL
toOverallValue
. We could have simply listed all of the fields within ac()
, but the syntaxA:B
specifies a list of the fields (as they appear in the data frame).
We are leaving out ID
because it has one unique non-NA
-like per row. We are leaving out Year
, NPS
, and BirthYear
because they are integer columns with no missing values.
As you might guess, there is no limit to the number of fields or range of fields that you can include in the c()
function.
~case_when(Condition1 ~ Val1, Condition2 ~ Val2, ... Default)
-
First, realize that this is the function that is applied to each of the columns specified above in the
c()
. Second, let’s define how this is read. If you have programmed before, this is the tidyverse’s version of anif/else
orswitch
statement:
if Condition1 is true:
return Val1
else if Condition2 is true:
return Val2
else if ...
...
else:
return Default
The conditions are evaluated in order. The value is returned for the first condition that is true. This is what we have in this case:
You can see that this case_when()
function is cleaning up the column it is working on so that it has a standard representation of an NA
-like value.
8 Checking for NA
-like values across all columns
We’re now going to define a statement that will help us see if there are any NA
-like values remaining in any column.
The statement that we are going to build uses the everything()
function within the across()
. You might recall that we used the c()
function to build a set of columns for mutate()
. In this case, we don’t want to use a strict subset of columns—we want to work on all of the columns. That is what everything()
is for.
In this case we are also going to use summarize()
separately from any group_by()
statement. Let’s take a look at an example that uses a function that we’re familiar with. In this case, the statement is telling R to calculate (summarize()
) the maximum value (max()
) for all columns (across(everything(), ...)
) while ignoring all NA
values (na.rm = TRUE
):
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL
1 2017 zzzyiijteasl 8 Undecl Sr Part-time Other 2000 Yes
FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
1 Yes Yes Yes Yes Yes Strongly Disagree Strongly Disagree
PoorTeaching UnsuppFac Grades Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
ClassTooBig BadAdvising FinAid OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
Here is another example. The pluck(Col, Num)
function can be used to access the Num
item in column Col
. You would expect the following to print the first row of the survey
data frame:
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu 8 Undecl Sr Full-time Male 1990 No
FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
1 No No No Yes No Disagree Strongly Disagree
PoorTeaching UnsuppFac Grades Sched ClassTooBig BadAdvising
1 Agree <NA> <NA> Strongly Agree Neutral Disagree
FinAid OverallValue
1 Strongly Agree Strongly Agree
That is, the above should be the same as the following:
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL
1 2012 xuojqdfdozvu 8 Undecl Sr Full-time Male 1990 No
FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
1 No No No Yes No Disagree Strongly Disagree
PoorTeaching UnsuppFac Grades Sched ClassTooBig BadAdvising
1 Agree <NA> <NA> Strongly Agree Neutral Disagree
FinAid OverallValue
1 Strongly Agree Strongly Agree
And it is.
Now, having a better idea about summarize()
, across()
, and everything()
, we can now discuss the statement that we are going to use:
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
1 0 0 0 0 0 0 0 0 0 0 0
FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac
1 0 0 0 0 0 0 0
Grades Sched ClassTooBig BadAdvising FinAid OverallValue
1 0 0 0 0 0 0
All that remains now to understand this statement is to understand what the sum()
function is doing. For each column, it is determining if a value is in the vector c("NA", "--", "")
. Suppose that 500 of the 30,000 values in a certain column had one of those three values. Well, A %in% B
returns TRUE
or FALSE
. The value of TRUE
is 1
and the value of FALSE
is 0
. (If you didn’t know this, then now you do.) Thus, the sum()
function would be looking at a set of 29,500 FALSE
values and 500 TRUE
values. In this case, the sum()
function would return 500
.
Now we can look at the result of the above statement and see that there are zero NA
-like values in any column.
9 Validating the effects on some columns
Let’s now take a look at the actual values in some of the columns in survey
. We’ll start with ClassLevel
.
And now FinPL
. Before starting all of the processing, this column had three distinct values plus hundreds of empty values.
But now all of the blanks (""
) have been converted to NA
values.
The following two commands calculate how many NA
and non-NA
values are in FinPL
:
Instead of listing the distinct values in a column, we also have the option of simply displaying a few rows of that column:
10 Print first 6 distinct values for every column
The previous section demonstrated how to learn about the distinct values in a column; however, this would be a lot of work to go through for each of the columns in the survey
table. Fortunately, R and the tidyverse gives us several ways to accomplish this with much less work.
In the following, we provide several different approaches and explain each one of them. Note that you can use any one of them (as shown in Listing 1, Listing 2, and Listing 3) even if you only have a minimal understanding of the code. We try to explain each of these to help you build your understanding of R and the tidyverse.
10.1 Applying the map()
function
We start with the simplest to run and the simplest to understand. First, though, we want to remind you how to use distinct()
. Here is a command that prints no more than 6 distinct values in the Year
column:
We are using the head()
command because, as we know, some of these columns have many thousands of distinct values. We don’t need to see all of them.
We are going to use the map()
command to create a single command that prints the distinct values in each one of the columns. Let’s first remind ourselves how to read the map()
command:
The list in this case will be names(survey)
, a function that will return a list of the column names in survey
.
The more complicated portion is the ~some-statement-on-.x
. The .x
will successively be the names of the columns. This raises the question: In the tidyverse, how do we refer to a specific column of data only given the column name? This is accomplished using .data[[.x]]
. (Not an obvious answer…but the answer, nevertheless.) Thus, we put all of this together and get Listing 1:
[[1]]
Year
1 2012
2 2013
3 2014
4 2015
5 2016
6 2017
[[2]]
ID
1 xuojqdfdozvu
2 vvwkinqvnibo
3 ibyjcmiopiqk
4 lsqamawyancj
5 zrpkydjpltkc
6 mdoqvaalcscx
[[3]]
NPS
1 8
2 4
3 7
4 6
[[4]]
Field
1 Undecl
2 SocSci
3 CompSci
4 Other
5 Eng
6 OthHealth
[[5]]
ClassLevel
1 Sr
2 Fresh
3 Jr
4 Soph
5 <NA>
[[6]]
Status
1 Full-time
2 Part-time
3 Other
4 <NA>
[[7]]
Gender
1 Male
2 Female
3 Other
4 <NA>
[[8]]
BirthYear
1 1990
2 1999
3 1989
4 1993
5 1988
6 1996
[[9]]
FinPL
1 No
2 Yes
3 <NA>
[[10]]
FinSch
1 No
2 Yes
3 <NA>
[[11]]
FinGov
1 No
2 Yes
3 <NA>
[[12]]
FinSelf
1 No
2 Yes
3 <NA>
[[13]]
FinPar
1 Yes
2 No
3 <NA>
[[14]]
FinOther
1 No
2 Yes
3 <NA>
[[15]]
TooDifficult
1 Disagree
2 Strongly Disagree
3 Agree
4 <NA>
5 Neutral
6 Strongly Agree
[[16]]
NotRelevant
1 Strongly Disagree
2 Disagree
3 <NA>
4 Neutral
5 Strongly Agree
6 Agree
[[17]]
PoorTeaching
1 Agree
2 Disagree
3 Strongly Disagree
4 Strongly Agree
5 <NA>
6 Neutral
[[18]]
UnsuppFac
1 <NA>
2 Neutral
3 Strongly Disagree
4 Strongly Agree
5 Agree
6 Disagree
[[19]]
Grades
1 <NA>
2 Disagree
3 Strongly Disagree
4 Agree
5 Neutral
6 Strongly Agree
[[20]]
Sched
1 Strongly Agree
2 Strongly Disagree
3 Agree
4 <NA>
5 Disagree
6 Neutral
[[21]]
ClassTooBig
1 Neutral
2 <NA>
3 Strongly Disagree
4 Strongly Agree
5 Agree
6 Disagree
[[22]]
BadAdvising
1 Disagree
2 <NA>
3 Strongly Disagree
4 Neutral
5 Strongly Agree
6 Agree
[[23]]
FinAid
1 Strongly Agree
2 Agree
3 <NA>
4 Strongly Disagree
5 Neutral
6 Disagree
[[24]]
OverallValue
1 Strongly Agree
2 Neutral
3 Disagree
4 Agree
5 <NA>
6 Strongly Disagree
So, the above prints out up to 6 distinct values for each column. Which is what we wanted.
10.2 Applying the map()
function to return a data frame plus column names
The above works. Definitely. But there are times when you might not want to have 24 print-outs one after the other. What if we wanted one column per original column and one row per distinct value — in this case a 24 by 6 data frame?
That’s what we’re going to define here (as shown in Listing 2). But it takes a lot of patience and an understanding of functions in order to understand what is going to be done. Note that you can use the following in your work without understanding any of it. All you have to do change the word survey
to the name of your own data frame.
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL
1 2012 aaapjamvmsgd 4 ArchUP Fresh Full-time Female 1988 No
2 2013 aacgcpuqbhue 6 ArtsHum Jr Other Male 1989 Yes
3 2014 aacgctmukgti 7 Bus Soph Part-time Other 1990 <NA>
4 2015 aacukkmcqpdr 8 CompSci Sr <NA> <NA> 1991 <NA>
5 2016 aacuutwvloft NA Ed <NA> <NA> <NA> 1992 <NA>
6 2017 aadtesmehzsd NA Eng <NA> <NA> <NA> 1993 <NA>
FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
1 No No No No No Agree Agree
2 Yes Yes Yes Yes Yes Disagree Disagree
3 <NA> <NA> <NA> <NA> <NA> Neutral Neutral
4 <NA> <NA> <NA> <NA> <NA> Strongly Agree Strongly Agree
5 <NA> <NA> <NA> <NA> <NA> Strongly Disagree Strongly Disagree
6 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
PoorTeaching UnsuppFac Grades Sched
1 Agree Agree Agree Agree
2 Disagree Disagree Disagree Disagree
3 Neutral Neutral Neutral Neutral
4 Strongly Agree Strongly Agree Strongly Agree Strongly Agree
5 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
6 <NA> <NA> <NA> <NA>
ClassTooBig BadAdvising FinAid OverallValue
1 Agree Agree Agree Agree
2 Disagree Disagree Disagree Disagree
3 Neutral Neutral Neutral Neutral
4 Strongly Agree Strongly Agree Strongly Agree Strongly Agree
5 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
6 <NA> <NA> <NA> <NA>
Let’s build up our understanding of the above in pieces.
- Lines 1–2
-
A comment. Anything that comes after a hashtag (#) is not evaluated. Thus, all of lines 1–2 are simply explanatory text that R ignores.
- Lines 15–17
-
Start with the
survey
data frame. Usemap()
to applyget_distinct_and_pad()
function to each column separately. Use theas.data.frame()
function to join all the separate columns into one unified data table.
Now, what is this get_distinct_and_pad()
function?
- Lines 3–13
-
This is a full-fledged R function that we have built for just this scenario. Let’s go through the process from beginning to end.
This is the format of an R function:
You can see that the function get_distinct_and_pad()
fits this template. First, name-of-fct
is equal to get_distinct_and_pad
. Let’s look at the values for possibly-some-args
:
column
: This is a vector of values (which we know is what every column in a data frame actually is).numvals = 6
: This second argument is an optional argument. If you leave it off when calling the function, thennumvals
is assumed to have the value6
.
Thus, this function can be called with, e.g., get_distinct_and_pad(some-df-col)
. However, if you look at line 16 above, you will see that it is actually called with map(get_distinct_and_pad)
. This is the magic of tidyverse combined with R’s map()
function:
Since
map(get_distinct_and_pad)
is to the right ofsurvey
, this is actuallymap(survey, get_distinct_and_pad)
. And the way thatmap()
works is that each column ofsurvey
in succession will be passed toget_distinct_and_pad()
— i.e., multiple distinct calls ofget_distinct_and_pad(col-of-survey)
.
Now let’s build up our understanding of one-or-more-r-statements
.
- Lines 4–8 (first statement in the body of the function)
-
Let’s take this line-by-line:
tibble(value=column)
: Thecolumn
here is the argument passed into the function — i.e., a column from thesurvey
table. This defines a tibble (a data-frame like object that the tidyverse works well with) with a single column with the namevalue
and the values specified incolumn
.distinct(value)
: This returns a vector of values that contains the distinct (or unique) values contained in thevalue
column.arrange(value): This returns a sorted vector of the distinct values (which were passed to it by the previous statement) from the
value` column.slice_head(n = numvals)
: This returns the firstn = numvals
rows from the sorted vector of distinct values from thevalue
column. If there are fewer thannumvals
elements, then the vector is filled withNA
values.pull(value)
: This returns a vector the same size as the input — i.e., 6 elements.
At this point in the function, distinct_values
holds a vector of values of size numvals
(i.e., 6 in this case) that contains the distinct values from column
(i.e., a specific column in the survey
data frame).
- Line 11–12
-
Now we get to a little magic and then some R plumbing:
length(distinct_values) <- numvals
: This is a bit of magic that sets the length of thedistinct_values
vector to benumvals
long.return(distinct_values)
: This is the statement that tells R to return the value ofdistinct_values
wheneverget_distinct_and_pad()
is called.
10.3 Using the for
command to print the distinct values
Above, we defined two separate approaches to printing distinct values that use the map()
function. Here we define a third and final approach that uses the for
statement.
We see some old friends here:
names(survey)
: generates a list of the names of the columns ofsurvey
for (var in names(survey))
: (this is new) iterates through the names of the columns and assignsvar
to each one in succession.data[[var]]
: returns the column named byvar
from thesurvey
data frame (sincesurvey
is on the left-hand-side of the pipe)
So, given all of this, the for
statement steps through each column and prints the top 6 distinct values from each column.
Year
1 2012
2 2013
3 2014
4 2015
5 2016
6 2017
ID
1 xuojqdfdozvu
2 vvwkinqvnibo
3 ibyjcmiopiqk
4 lsqamawyancj
5 zrpkydjpltkc
6 mdoqvaalcscx
NPS
1 8
2 4
3 7
4 6
Field
1 Undecl
2 SocSci
3 CompSci
4 Other
5 Eng
6 OthHealth
ClassLevel
1 Sr
2 Fresh
3 Jr
4 Soph
5 <NA>
Status
1 Full-time
2 Part-time
3 Other
4 <NA>
Gender
1 Male
2 Female
3 Other
4 <NA>
BirthYear
1 1990
2 1999
3 1989
4 1993
5 1988
6 1996
FinPL
1 No
2 Yes
3 <NA>
FinSch
1 No
2 Yes
3 <NA>
FinGov
1 No
2 Yes
3 <NA>
FinSelf
1 No
2 Yes
3 <NA>
FinPar
1 Yes
2 No
3 <NA>
FinOther
1 No
2 Yes
3 <NA>
TooDifficult
1 Disagree
2 Strongly Disagree
3 Agree
4 <NA>
5 Neutral
6 Strongly Agree
NotRelevant
1 Strongly Disagree
2 Disagree
3 <NA>
4 Neutral
5 Strongly Agree
6 Agree
PoorTeaching
1 Agree
2 Disagree
3 Strongly Disagree
4 Strongly Agree
5 <NA>
6 Neutral
UnsuppFac
1 <NA>
2 Neutral
3 Strongly Disagree
4 Strongly Agree
5 Agree
6 Disagree
Grades
1 <NA>
2 Disagree
3 Strongly Disagree
4 Agree
5 Neutral
6 Strongly Agree
Sched
1 Strongly Agree
2 Strongly Disagree
3 Agree
4 <NA>
5 Disagree
6 Neutral
ClassTooBig
1 Neutral
2 <NA>
3 Strongly Disagree
4 Strongly Agree
5 Agree
6 Disagree
BadAdvising
1 Disagree
2 <NA>
3 Strongly Disagree
4 Neutral
5 Strongly Agree
6 Agree
FinAid
1 Strongly Agree
2 Agree
3 <NA>
4 Strongly Disagree
5 Neutral
6 Disagree
OverallValue
1 Strongly Agree
2 Neutral
3 Disagree
4 Agree
5 <NA>
6 Strongly Disagree
11 Factors
We are now going to use a lot of information from the listings generated in the previous section to define the factors for the survey
data frame. Go back and look at the ordering of some of the fields — for example, look at ClassLevel
, FinPar
, FinOther
, TooDifficult
, and NotRelevant
. Each of these has a natural ordering, and we don’t see that here. Further, why aren’t FinPar
and FinOther
in the same order? And why aren’t TooDifficult
and NotRelevant
in the same order? When we look at these columns during analysis, it would make it easier for us if the values were presented in a consistent ordering.
That’s what we can make happen by defining factors. We are first going to define the factors (in the next section) and then validate the work that we did (in the section following that one).
11.1 Defining the factors
Consider the following code for defining the factors. Below, we dissect what is going on.
fin_cols <- c("FinPL", "FinSch", "FinGov",
"FinSelf", "FinPar", "FinOther")
exp_cols <- c("TooDifficult", "NotRelevant", "PoorTeaching",
"UnsuppFac", "Grades", "Sched", "ClassTooBig",
"BadAdvising", "FinAid", "OverallValue")
fin_levels <- c("Yes", "No")
exp_levels <- c("Strongly Disagree", "Disagree", "Neutral", "Agree",
"Strongly Agree")
class_levels <- c("Fresh", "Soph", "Jr", "Sr")
survey <-
survey |>
mutate(across(all_of(exp_cols),
~factor(.x,
levels = exp_levels,
ordered = TRUE)))
survey <-
survey |>
mutate(across(all_of(fin_cols),
~factor(.x,
levels = fin_levels,
ordered = TRUE)))
survey <-
survey |>
mutate(ClassLevel = factor(ClassLevel,
levels = class_levels,
ordered = TRUE))
survey <-
survey |>
mutate(Status = factor(Status,
levels = c("Full-time", "Part-time",
"Other")))
survey <-
survey |>
mutate(Gender = factor(Gender,
levels = c("Female", "Male", "Other")))
- Lines 1–2 (define
fin_cols
) -
These six columns are grouped together because each column can take on the values of either
"Yes"
or"No"
. We define this vector because we will treat them all similarly. - Lines 3–5 (define
exp_cols
) -
These ten columns are grouped together for the same reason — they each can have the same values (as shown on the next two lines).
- Line 6–9 (define levels for the factors below)
-
These lines define the possible levels (or values) for the three types of factors that we define below.
- Lines 11–16 (define factors for
exp_cols
) -
This statement tells R to change all of the columns listed in
exp_cols
so that they are a factor with possible values (in order) as shown byexp_levels
. - Lines 17–22 (define factors for
fin_cols
) -
This statement tells R to change all of the columns listed in
fin_cols
so that they are a factor with possible values (in order) as shown byfin_levels
. - Lines 23–27 (define
ClassLevel
as a factor) -
This changes
ClassLevel
so that it is a factor with possible values (in order) as shown byclass_levels
. - Lines 28–32 (define
Status
as a factor) -
This changes
Status
so that it is a factor with possible values (in order) indicated by thelevels
argument.
Note that we know what all the possible levels are for each of the factors because of all the work that we have done standardizing our representation of NA
values and then listing all of the distinct values in the columns.
11.2 Validating the factor creation process
We would like to think that the work we do is correct the first time. (Oh, if it were only true…) Given that it isn’t, you should absolutely run the following (which we defined above) after you have defined all of your factors. Then, after running it, you should examine the results to ensure that all of the columns look right (correct distinct values listed in the right order).
get_distinct_and_pad <- function(column, numvals = 6) {
distinct_values <- tibble(value = column) |>
distinct(value) |>
arrange(value) |>
slice_head(n = numvals) |>
pull(value)
# Pad with NA if there are less than n values
length(distinct_values) <- numvals
return(distinct_values)
}
survey |>
map(get_distinct_and_pad) |>
as.data.frame()
Year ID NPS Field ClassLevel Status Gender BirthYear FinPL
1 2012 aaapjamvmsgd 4 ArchUP Fresh Full-time Female 1988 Yes
2 2013 aacgcpuqbhue 6 ArtsHum Soph Part-time Male 1989 No
3 2014 aacgctmukgti 7 Bus Jr Other Other 1990 <NA>
4 2015 aacukkmcqpdr 8 CompSci Sr <NA> <NA> 1991 <NA>
5 2016 aacuutwvloft NA Ed <NA> <NA> <NA> 1992 <NA>
6 2017 aadtesmehzsd NA Eng <NA> <NA> <NA> 1993 <NA>
FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
1 Yes Yes Yes Yes Yes Strongly Disagree Strongly Disagree
2 No No No No No Disagree Disagree
3 <NA> <NA> <NA> <NA> <NA> Neutral Neutral
4 <NA> <NA> <NA> <NA> <NA> Agree Agree
5 <NA> <NA> <NA> <NA> <NA> Strongly Agree Strongly Agree
6 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
PoorTeaching UnsuppFac Grades Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2 Disagree Disagree Disagree Disagree
3 Neutral Neutral Neutral Neutral
4 Agree Agree Agree Agree
5 Strongly Agree Strongly Agree Strongly Agree Strongly Agree
6 <NA> <NA> <NA> <NA>
ClassTooBig BadAdvising FinAid OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2 Disagree Disagree Disagree Disagree
3 Neutral Neutral Neutral Neutral
4 Agree Agree Agree Agree
5 Strongly Agree Strongly Agree Strongly Agree Strongly Agree
6 <NA> <NA> <NA> <NA>
Let’s take a look:
Year
: in numeric orderID
: in alphabetic order (these are random values in any case, so the ordering really doesn’t matter)NPS
: in numeric orderField
: in alphabetic order (and we haven’t defined an ordering for these, so this is fine)ClassLevel
–Gender
: in the order that we defined with the factorBirthYear
: in numeric orderFinPL
–FinOther
: in the order that we defined with the factor; all have the same levels in the same orderTooDifficult
–OverallValue
: in the order that we defined with the factor; all have the same levels in the same order
Given all of the above, it appears that we have defined the factors correctly and standardized the representation of NA
-like values.