Intro to Functions

1 Functions in R

A “function” is a procedure that takes any number of inputs and produces a single output. You’re probably familiar with functions in Excel, like sum(...), average(...), etc. In R, functions are used in a similar way.

A function is called by its name, followed by parentheses. Inside the parentheses, you put the “arguments” (inputs) to the function.

2 Functions on data frames

You have already come across some R functions. Let’s give you a quick reminder — here are some functions that are handy for quick looks at a data frame, often used in the console. For more on similar functions, see this page.

These functions provide slightly different summary information:

  • summary(df): gives a summary of each column in the data frame
  • glimpse(df): a more succint version of summary
  • str(df): gives the structure of the data frame

More specific functions for data frames include:

  • nrow(df): gives the number of rows in the data frame
  • ncol(df): gives the number of columns in the data frame
  • dim(df) : gives the dimensions of the data frame (rows, columns)
  • head(df): shows the first few rows of the data frame
  • tail(df): shows the last few rows of the data frame
  • names(df): gives the names of the columns in the data frame

3 Named arguments

Consider some of the data in the UnivGPA column of the admitdata data table:

head(admitdata$UnivGPA)
[1] 2.468 2.731 2.670 2.779 3.250 3.185

Let’s now apply the round() function to it:

head(round(admitdata$UnivGPA))
[1] 2 3 3 3 3 3

This function rounds a number (or vector of numbers, in this case) to the nearest integer.

What if we wanted to round to the first number after the decimal? Or the second number before the decimal? Wouldn’t it be strange to have to create a functions roundtotenth() and roundtohundreds()? The creators of R thought so, too.

They addressed this by adding more information to the function call when it’s needed, as so:

head(round(admitdata$UnivGPA, digits = 1))
[1] 2.5 2.7 2.7 2.8 3.2 3.2

The inputs that go between the parentheses for a function are called “arguments” for historical reasons. In R, every argument has a name, as in the example with the round() function above. The second argument digits specifies how many places we want in the result. You can see the names of arguments using the help information about any function by doing one of the following:

  • Typing ?function_name,
  • Putting the cursor on the function and hitting F1,
  • Using the Help tab in the lower right panel to search, or
  • Just search the internet (usually in the form of something like “r function THEFUNCTION”)

The help information for round() shows that the names of the arguments are x and digits. The x argument is the vector of numbers to round, and digits is the number of decimal places to round to.

We can omit the name of the arguments if they appear in the right order. It would be tedious to type this all the time:

round(x = df$gpa, digits = 1)

Instead we can leave out the names, as long as the arguments are in the correct order:

round(df$gpa, 1) # this works just fine, and it's easy to understand

The following will not do what we expect because the arguments are in the wrong order:

round(1, df$gpa)

If for some reason you want to use the arguments in a non-traditional order (we’re not sure why!), then if you give the argument names, you can do so:

round(digits = 1, x = df$gpa)

4 Function composition

We can use functions within functions. This is called function composition.

4.1 The basic problem setup

Suppose you want to give a researcher student data, but want some protections on the privacy of individuals.

Let’s make up some data for this situation:

studentdata <-
  tibble(id = sample(40000:80000, 5, replace = FALSE),
         name = c("Marcia", "Jan", "Cindy", "Alice", "Greg"),
         age = sample(7:15, 5, replace = FALSE),
         county = sample(10000:30000, 5, replace = FALSE),
         gpa = runif(5, min = 2.5, max = 3.8))
studentdata
# A tibble: 5 × 5
     id name     age county   gpa
  <int> <chr>  <int>  <int> <dbl>
1 74700 Marcia    13  12899  2.56
2 59270 Jan        7  27166  3.16
3 74001 Cindy     11  29660  2.76
4 55254 Alice     14  20077  3.59
5 75183 Greg      10  20892  3.33

We could slightly perturb GPAs by adding a small random number to each one. Here’s a way to do that.

noise = runif(5, min = -0.1, max = 0.1) 
noise
[1] -0.021255755 -0.067369856  0.029681873  0.009153463 -0.072927387

This generated 5 random numbers between -0.1 and 0.1.

We are now ready to create the data table, studentdatamask, that we will distribute to the researcher. We will first add the noise vector to gpa, and then we will remove the identifying id and name columns from the data table:

studentdatamask <-
  studentdata |> 
    mutate(gpa_noisy = gpa + noise) |> 
    select(age, county, gpa_noisy)
studentdatamask
# A tibble: 5 × 3
    age county gpa_noisy
  <int>  <int>     <dbl>
1    13  12899      2.54
2     7  27166      3.10
3    11  29660      2.79
4    14  20077      3.60
5    10  20892      3.26

4.2 The need for function composition

In the previous example, we hard coded the 5 in 4 of the functions when creating the studentdata tibble as well as when creating the noise vector. These would only work, of course, if the table has 5 rows. (Duh.) What about next time we run it and it has 12? It would be better to specify the correct number of rows automatically, like this:

noise_flexible = runif(nrow(studentdata), 
                       min = -0.1, max = 0.1) 

The first argument to the random number generator runif is n, which is the number of random numbers to generate. Here we use the nrow() function to get the number of rows in the data frame. This way, the code will work no matter how many rows are in the data frame.

Most commonly we will use functions to transform data within a mutate() or summarize() function, as with the following:

studentdata |> 
  summarize(N = n(),                               
            mean_gpa = round(mean(gpa),2),
            sd_gpa = round( sd(gpa), 2),
            SE_gpa = round( sd_gpa / sqrt(N), 2),
            min_gpa = min(gpa),
            max_gpa = max(gpa))
# A tibble: 1 × 6
      N mean_gpa sd_gpa SE_gpa min_gpa max_gpa
  <int>    <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1     5     3.08   0.42   0.19    2.56    3.59

Look closely at the code to see where functions are within other functions. One annoyance is making sure the parentheses are balanced. The RStudio IDE will help with this by highlighting the matching parentheses when you put the cursor on one of them. (You can also turn on “rainbow” mode that colors them differently depending on their nesting level. To do that to go Tools -> Global Options -> Code -> Display -> Show rainbow parentheses.)

5 Creating your own functions

You can create your own functions in R. This is a powerful way to make your code more readable and reusable. Here’s an example of a simple function that takes a number and returns the square of that number.

sqrd = function(x) {
  return(x^2)
}

When you execute the code block, it stores the function definition for later use. After that you can use it like any other function:

sqrd(1:5)  
[1]  1  4  9 16 25

You can see that this function call returned the squares of 1, 2, …, 5.

Here is another new function, cubd, that takes a number and returns the cube of that number (that is, raised to the third power). Powers can be created with the ^ operator as with the sqrd() function we created.

cubd = function(x) {
  return(x^3)
}

Now check out this math identity that’s kind of amazing. If we sum up a sequence of numbers 1, 2, …, N and then square the result, it’s the same as summing the cubes of the numbers 1, 2, …, N. These should give you the same result

N = 7 # change this to whatever you like
squareofsum = sqrd(sum(1:N))           # sum up then square result
sumofcubes = sum(cubd(1:N))           # cube first then sum
paste(squareofsum, sumofcubes, sep = " = ")
[1] "784 = 784"

And so should these:

N = 25 # change this to whatever you like
squareofsum = sqrd(sum(1:N))           # sum up then square result
sumofcubes = sum(cubd(1:N))           # cube first then sum
paste(squareofsum, sumofcubes, sep = " = ")
[1] "105625 = 105625"

Note that we could write these using pipes, and it might be easier to follow:

squareofsums = 1:N |> sum() |> sqrd()       # sum up then square result
sumofcubes = 1:N |> cubd() |> sum()      # cube first then sum
paste(squareofsum, sumofcubes, sep = " = ")
[1] "105625 = 105625"