Missing data functions

1 The problem

Data isn’t clean, perfect, and ready-to-use. You always have to clean it before you can use it. This is always a data-specific and context-specific problem and solution.

2 Varying solutions

  • Need to understand the scope of the problem, both on a row-basis and on a column-basis.
  • Need to ensure R can easily and consistently detect the missing data. So address this first.
  • Need to come up with a separate policy for handling missing data for each column and for each row. Do this before you continue.
  • Address each column.
  • After addressing each column, then—and only then—convert the survey data from wide to long (if you have to do so).

3 Functions to help deal with missing values

When doing your analyses, you will want to be clear about the following:

  • How prevalent NA and NaN values are, and
  • How you want to handle NA and NaN values in your analysis—do you want to include them or exclude them from your calculations?

Sometimes you want to examine a list to see if there are missing values. Let’s quickly define a list and test it with these functions:

lst <- list("a", 3, TRUE, FALSE, NA, NaN, 0/0)
is.na(lst)
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
sapply(lst, function(x) ifelse(is.nan(x), NaN, x))
[1] "a"     "3"     "TRUE"  "FALSE" NA      "NaN"   "NaN"  
df <- data.frame(a=c(1, 2, 3), b=c(5, NA, NaN))
is.na(df)
         a     b
[1,] FALSE FALSE
[2,] FALSE  TRUE
[3,] FALSE  TRUE
sapply(df, function(x) ifelse(is.nan(x), NaN, x))
     a   b
[1,] 1   5
[2,] 2  NA
[3,] 3 NaN
skim(df)
Data summary
Name df
Number of rows 3
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
a 0 1.00 2 1 1 1.5 2 2.5 3 ▇▁▇▁▇
b 2 0.33 5 NA 5 5.0 5 5.0 5 ▁▁▇▁▁

This is how you retrieve a specific observation (row) from a data frame:

survey[6,]
  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
6 2012 mdoqvaalcscx   8 Undecl         Sr Part-time Female      1988   Yes
  FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
6    Yes     No     Yes     No       No                         
       PoorTeaching UnsuppFac Grades          Sched ClassTooBig BadAdvising
6 Strongly Disagree   Neutral  Agree Strongly Agree        <NA>    Disagree
  FinAid   OverallValue
6   <NA> Strongly Agree

The anyNA(x) function determines if there are any NA values in the vector:

anyNA(lst)
[1] TRUE

Here we use it on the 6th row of survey:

anyNA(survey[6,])
[1] TRUE

The following call returns which items in the vector have the value NA:

which(is.na(survey[6,]))
[1] 21 23

The following counts how many items in the vector have the value NA:

sum(is.na(survey[6,]))
[1] 2