On this page we are going to continue using the st_info data set that was introduced on the import page.
1 Introduction
Tidy data looks like a simple spreadsheet, with columns and rows. Each column is a variable (type of information), and each row is an observation. In a tidy CSV file, the first row is the header row, which contains the names of these variables. Importing into Excel puts those at the top, where you can rename them, move them around, delete them, and so on. This arrangement of rows of data under headers (column names) is the same in R, so the basic data object, called a “data frame,” looks and acts a lot like a spreadsheet.
A major difference between spreadsheets and our work in R using data frames is that spreadsheets are designed around cell operations, which permits infinite flexibility. We can pick any cell in a spreadsheet and write a formula for a calculation using any other data in the attached sheets. This flexibility comes at the cost of complexity. While cell-wise operations are possible in R, generally we will want to work with rows and columns as a whole. The reduction in complexity allows for straightforward description and reproduction of the work we do.
So think of “tidy data” as a simplified spreadsheet, where rows correspond to observations or cases — for example, student registrations for classes. Each column tells us something different about the case. In the case of registrations, it might give us the student ID and the course ID, and maybe the instructor ID. There may be blanks where information is missing, e.g. an instructor hasn’t been assigned yet. See Chapter 12 of R for Data Science for more on tidy data.
Here’s a tidy data set created by the data.frame function, which lets us specify the column names and contents. Don’t worry about the details for now.
ID First Last GPA
1 123 Saran Wrap 3.10
2 456 Dan Zig 2.98
3 789 Frylander French 3.50
Notice that it’s a rectangle of data — that is, 3 rows of 4 columns — with cases in row and attributes in columns.
2 Renaming columns/variables
The following renames the field Application ID to become AppID. Notice that without the assignment operator (<-), the following would not have any lasting effect.
In the following, we provide several different ways that you can execute the listed actions. We will go through some more details related to different forms that the commands might take at the beginning than we do farther down in this document.
3.1 Using select
As a reminder, here’s the current state of the st_info data frame:
Suppose that you want to print just 7 of the variables from this data frame:
st_info |>select(AppID, Birthdate, St, County, Sex, Race, SAT)
# A tibble: 2,000 × 7
AppID Birthdate St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-2021 05/26/2003 GA COWETAGA M W 1436
2 9221751846-ROEH-2021 03/31/2003 SC GREENVSC M A 1398
3 4290276249-ALLE-2021 06/23/2003 GA BULLOCGA M W 1090
4 3398780452-HILT-2021 06/29/2003 GA DEKALBGA M W 1516
5 7691897840-SMIT-2021 11/03/2002 GA WHITFIGA M O 1440
6 1862245592-SHIP-2021 10/31/2002 GA DEKALBGA M H 1438
7 2085584835-CHAM-2021 02/08/2003 SC DARLINSC M W 1452
8 3918571924-STAH-2021 06/16/2003 SC CHARLESC M W 1536
9 3023361776-ROGE-2021 09/11/2002 SC LANCASSC M B 1487
10 7386692838-SCOT-2021 05/25/2003 SC PICKENSC M H 1373
# ℹ 1,990 more rows
As promised above, we used the pipe-based method of constructing the command. We tell R to work with the st_info data frame, and then to select() seven specific variables (columns) from that data frame.
Certainly, we could have used this form of the command:
select(st_info, AppID, Birthdate, St, County, Sex, Race, SAT)
However, we like the clarity of the pipe-based method because it makes very clear—right on the first line—what data frame that you are working on…every single time.
Suppose that you want to save the above data frame into a new data frame. You would type something like this:
st_min <- st_info |>select(AppID, Birthdate, St, County, Sex, Race, SAT)
As you can see, you simply put the assignment operation before the whole pipe. Whatever would have been the result printed at the end of the pipe gets assigned to the variable on the first line. Let’s glimpse some information about st_min:
Sure enough, the st_min data frame holds 2000 rows for those seven columns that we previously specified.
3.2 Using select with minus
Another way to remove variables (columns) from a data frame is to use the “minus” operator with select. To be clear, you aren’t actually removing the variable from the original data frame unless you save the data frame—you are simply removing it from the view that is presented.
If you want to look at the st_info data frame but without the Race variable, you could type in the following:
st_info |>select(-Race)
# A tibble: 2,000 × 9
AppID Given Family Birthdate Email St County Sex SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-2021 Tenoch Rodri… 05/26/20… teno… GA COWET… M 1436
2 9221751846-ROEH-2021 Travis Roe 03/31/20… trav… SC GREEN… M 1398
3 4290276249-ALLE-2021 Axel Allen 06/23/20… axel… GA BULLO… M 1090
4 3398780452-HILT-2021 Justice Hilton 06/29/20… just… GA DEKAL… M 1516
5 7691897840-SMIT-2021 Mehran Smith 11/03/20… mehr… GA WHITF… M 1440
6 1862245592-SHIP-2021 Carter Shipl… 10/31/20… cart… GA DEKAL… M 1438
7 2085584835-CHAM-2021 Benjamin Chamb… 02/08/20… benj… SC DARLI… M 1452
8 3918571924-STAH-2021 Robert Stahl 06/16/20… robe… SC CHARL… M 1536
9 3023361776-ROGE-2021 Zaiden Rogers 09/11/20… zaid… SC LANCA… M 1487
10 7386692838-SCOT-2021 Jakob Scott 05/25/20… jako… SC PICKE… M 1373
# ℹ 1,990 more rows
You can see that it now displays nine columns—all of the original columns except for the Race column.
If you want to create a new data frame that is the same as the st_info data frame but without the Race variable, then do the following:
st_min_again <- st_info |>select(-Race)
This tells R to select all of the variables from st_info (note: the original data frame) except the Race variable. Let’s verify that it worked:
Yep! It’s missing just one of the columns from the original data frame—the Race column.
But what if we wanted to remove multiple columns from the data frame with one command? Here we need to use the c() function—its purpose is to assemble a set of items into a vector that R can work with. This command tells R to select all of the variables except the three listed in the select statement:
Yep! We have removed three variables from the st_min_again data frame, just as we wanted.
4 Simplifying tibble print-outs
Just a quick aside. Consider this command:
st_info |>select(AppID, Birthdate, St, County, Sex, Race, SAT) |>head(n=3)
# A tibble: 3 × 7
AppID Birthdate St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-2021 05/26/2003 GA COWETAGA M W 1436
2 9221751846-ROEH-2021 03/31/2003 SC GREENVSC M A 1398
3 4290276249-ALLE-2021 06/23/2003 GA BULLOCGA M W 1090
Look at the second row of the print-out—<chr>, <dbl>, and more. It’s informative, but maybe you don’t want to print it.
If that’s the case, then you can add one additional command to to the end and that row will go away. Take a look:
st_info |>select(AppID, Birthdate, St, County, Sex, Race, SAT) |>head(n=3) |>as.data.frame()
AppID Birthdate St County Sex Race SAT
1 4563269562-RODR-2021 05/26/2003 GA COWETAGA M W 1436
2 9221751846-ROEH-2021 03/31/2003 SC GREENVSC M A 1398
3 4290276249-ALLE-2021 06/23/2003 GA BULLOCGA M W 1090
5 Filtering some rows/observations
R’s command for choosing which rows to display is filter(). In this section we provide several examples for how you can filter data in R.
5.1 Filtering on one equality condition
With the select command, we can specify which particular columns/variables that we want to work with or observe. In this section, we are going to learn to specify which rows/observations that we want to include in our analysis—this is called filtering.
The first two rows of the following start out in the way that we observed above, but we continue on the third line by filtering those rows that we want to work with:
st_info |>select(Given, Family, St, County, Sex, Race, SAT) |>filter(Sex =="F")
# A tibble: 1,080 × 7
Given Family St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Mira Payne GA FULTONGA F W 1456
2 Penelope Williamson SC LANCASSC F AI 1456
3 Emma Stephens SC LEXINGSC F W 1491
4 Ryan Jackson GA DEKALBGA F W 1233
5 Jasmine Truong SC BEAUFOSC F A 1480
6 Eira Scott SC GREENVSC F W 1390
7 Nalani Siegel SC CHARLESC F W 1261
8 Esther Morse SC SPARTASC F A 1338
9 Claire Wilson SC SALUDASC F H 1191
10 Delaney Bishop SC GREENVSC F W 1593
# ℹ 1,070 more rows
A few things to observe:
Notice that we used the select() operator here. We didn’t have to in order to filter data! But we did just to limit the information displayed.
All of the observations passing the filter have F in the Sex column, just as we specified in the command.
In the most recent select() and filter() above, we had one condition—Sex == "F. You can replace that comparison with any other single comparison condition. Consider this:
st_info |>select(Given, Family, St, County, Sex, Race, SAT) |>filter(SAT >1200)
# A tibble: 1,527 × 7
Given Family St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Tenoch Rodriguez GA COWETAGA M W 1436
2 Travis Roe SC GREENVSC M A 1398
3 Justice Hilton GA DEKALBGA M W 1516
4 Mehran Smith GA WHITFIGA M O 1440
5 Carter Shipley GA DEKALBGA M H 1438
6 Benjamin Chamberlain SC DARLINSC M W 1452
7 Robert Stahl SC CHARLESC M W 1536
8 Zaiden Rogers SC LANCASSC M B 1487
9 Jakob Scott SC PICKENSC M H 1373
10 Jasper Davis GA HENRYGA M W 1297
# ℹ 1,517 more rows
(Note that this displays only M applicants because they are at the top of the data frame and the F applicants are at the bottom. We will work on the arrange operation soon which will address this.)
5.3 Filtering on range conditions
R provides several useful operators for filtering on range conditions. We have seen the first form elsewhere, but the others are new:
V %in% c(list of values)
where V is any type of variable, and the list of values is a comma-separated list of values that the variable might take
read as “variable V has the value of one of the list of values shown”
Example: St %in% c("SC", "GA")
Example: Race %in% c("A", "H", "B")
V %in% L:U
where L and U are integers, with L being the lower bound and U being the upper bound
read as “variable V is in the range of integers from L to U (inclusive)”
Example: SAT %in% 1200:1225
between(V, L, U)
where V, L, and U are numbers, with L being the lower bound and U being the upper bound
read as “variable V (as an integer or decimal) takes a value such that V >= L and V <= U
Example: between(SAT, 1200, 1225)
Again, as you might hope, you can put any of the above filters within the filter() operation, for example:
# A tibble: 782 × 7
Given Family St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Travis Roe SC GREENVSC M A 1398
2 Carter Shipley GA DEKALBGA M H 1438
3 Zaiden Rogers SC LANCASSC M B 1487
4 Jakob Scott SC PICKENSC M H 1373
5 Hudson Espinoza GA GWINNEGA M A 1456
6 Jack Chandler SC HORRYSC M H 1199
7 Emiliano Rosen SC BERKELSC M H 1246
8 Kingsley Cronin SC PICKENSC M H 1177
9 Kenzo Johnson GA COFFEEGA M A 1189
10 Isaac Liu GA FULTONGA M H 1462
# ℹ 772 more rows
5.4 Filtering on multiple conditions
Filtering on a single condition is certainly useful, but it is much more common to want to filter on several conditions at once. Here, we have to use logical operators in the filter() to combine different filter conditions.
Here is a filter() that ensures that the applicant is a female from Georgia:
st_info |>select(Given, Family, St, County, Sex, Race, SAT) |>filter(Sex =="F"& St =="GA")
# A tibble: 269 × 7
Given Family St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Mira Payne GA FULTONGA F W 1456
2 Ryan Jackson GA DEKALBGA F W 1233
3 Hazel Larson GA FULTONGA F W 1356
4 Remi Brown GA WILKESGA F W 1321
5 Jeanette Hood GA RICHMOGA F A 1445
6 Choyce Lutz GA LOWNDEGA F H 1347
7 Marie Tracy GA GWINNEGA F W 1267
8 Antonella Johnson GA HENRYGA F W 1459
9 Harley Hall GA BIBBCGA F W 1227
10 Madison Walls GA GLYNNGA F W 1184
# ℹ 259 more rows
Or, another one that filters on Sex, Race, and County:
st_info |>select(Given, Family, St, County, Sex, Race, SAT) |>filter(Sex =="F"& Race %in%c("A", "H", "B") & County =="SPARTASC")
# A tibble: 22 × 7
Given Family St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Esther Morse SC SPARTASC F A 1338
2 Henley Fisher SC SPARTASC F H 1451
3 Abigail Saavedra SC SPARTASC F A 1150
4 Mila Price SC SPARTASC F H 1227
5 Aaliyah Smith SC SPARTASC F H 1329
6 Finlee Brown SC SPARTASC F H 1180
7 Diana Jackson SC SPARTASC F B 1148
8 Nova Lynch SC SPARTASC F B 1346
9 Adeline Fowler SC SPARTASC F H 1238
10 Isla Gomez SC SPARTASC F B 1161
# ℹ 12 more rows
In the above, be sure to use the equality comparison == and not the single equal =.
(As an aside, note that we have used glimpse() within a pipe command, differently than we introduced it above.) You can see that R has interpreted the data in all three columns to be character strings.
Think about (and look at) the data that are in these columns. The Birthdate column has dates in it, and you will want to do calculations on them. But R cannot do date calculations on the column until it knows that it contains date values.
In this section we will learn how to tell R that it is dealing with a date.
What you have to do is tell R the format of the column—where is the year, the month, etc.? Is it a 2-digit year or a 4-digit year? Is the month represented by letters or by numbers?
R has a set of format indicators, a few of which are shown here:
Note that you need to have both the assignment operator <- and the pipe operator |> in this statement for it to work.
The statement above is assigning a value to a new column called BD in the st_info data frame. It is getting the new values by applying the as.Date() function to the Birthdate variable. You can see that this function contains the format string that it needs to interpret the dates.
Let’s take a look at the new column next to the old one:
st_info |>select(Given, Family, Birthdate, BD)
# A tibble: 2,000 × 4
Given Family Birthdate BD
<chr> <chr> <chr> <date>
1 Tenoch Rodriguez 05/26/2003 2003-05-26
2 Travis Roe 03/31/2003 2003-03-31
3 Axel Allen 06/23/2003 2003-06-23
4 Justice Hilton 06/29/2003 2003-06-29
5 Mehran Smith 11/03/2002 2002-11-03
6 Carter Shipley 10/31/2002 2002-10-31
7 Benjamin Chamberlain 02/08/2003 2003-02-08
8 Robert Stahl 06/16/2003 2003-06-16
9 Zaiden Rogers 09/11/2002 2002-09-11
10 Jakob Scott 05/25/2003 2003-05-25
# ℹ 1,990 more rows
You can see that Birthdate is still a character string but the new BD column contains seemingly equivalent date values. The difference is that you can execute calculations on the BD column.
For example, suppose that you want to calculate the applicant’s age as of today. Here is the statement that would do just that:
The statement is assigning a value to a new column Age in the st_info data frame.
Let’s consider the interior function difftime(Later, Earlier). This calculates the number of seconds between the two times Later - Earlier.
The function Sys.Date() returns the current date and time.
Since `difftime() is operating on the current date and the applicant’s birthdate, this is returning the number of seconds the applicant has been alive.
The function time_length(S, "years") provided by the tidyverse converts that calculated number of seconds into a years value (in terms of a decimal number).
Thus, this calculation puts the applicant’s age in years in the Age column.
Let’s take a look at what this produces:
st_info |>select(Birthdate, BD, Age) |>summary()
Birthdate BD Age
Length:2000 Min. :2002-08-02 Min. :21.18
Class :character 1st Qu.:2002-11-10 1st Qu.:21.45
Mode :character Median :2003-02-25 Median :21.73
Mean :2003-02-23 Mean :21.74
3rd Qu.:2003-06-08 3rd Qu.:22.03
Max. :2003-09-15 Max. :22.30
This report highlights all that we have gained through converting the Birthdate string to the BD date:
We can now see the range of birthdates as well as other computed statistics on the BD column For Birthdate we just know that it is a character string column.
We can see with the Age column a whole set of statistics calculated on that value.
You absolutely do not have to do this converstion from string to date for a column that contains date-related information. You only need to do it if you have to perform date calculations.
7 Combining multiple character (string) columns into one
Suppose you have a data frame that contains two fields, Given and Family, that you would like to combine into one field, FullName, while also putting a space between the contents of the two field. Then use the following command:
The remove field is set to FALSE so that the original two fields are retained; as you might guess, setting it to TRUE would result in them both being deleted.
Let’s check the results of the above command:
st_info |>select(Given, Family, FullName)
# A tibble: 2,000 × 3
Given Family FullName
<chr> <chr> <chr>
1 Tenoch Rodriguez Tenoch Rodriguez
2 Travis Roe Travis Roe
3 Axel Allen Axel Allen
4 Justice Hilton Justice Hilton
5 Mehran Smith Mehran Smith
6 Carter Shipley Carter Shipley
7 Benjamin Chamberlain Benjamin Chamberlain
8 Robert Stahl Robert Stahl
9 Zaiden Rogers Zaiden Rogers
10 Jakob Scott Jakob Scott
# ℹ 1,990 more rows
8 Convert string variable/column to categorical
A categorical variable usually has a limited number of predictable and pre-defined values that it can take. If you have ever filled out a form with a drop-down menu or a selection of radio buttons, then those would be good candidates to be categorical variables.
R calls these variables factors. In this st_info data frame, three columns are good candidates for becoming factors: Sex, Race, and St.
8.1 The conversion process
The following command—definitely not tidyverse-specific—is used to convert a column into a factor:
Before—or even after—you turn a variable into a factor, you can find all of the unique values in that variable (column) with the unique() function:
category_values <-unique(st_info$Race)
As always, you can see the contents of a variable by simply typing it at the prompt:
category_values
[1] W A O H B AI PI
Levels: A AI B H O PI W
Notice there are two lines here. The first simply shows the values in the variable. The second line contains the same values, but they are now in alphabetical order. The order doesn’t mean anything in particular to R, but this is the order in which R will print out the results of particular commands. We’ll get back to this soon.
This would then show how many values the factor has:
length(category_values)
[1] 7
Of course, you could also have used the following to get how many values the factor has:
length(unique(st_info$Race))
[1] 7
This would be the preferred way if you had no particular need to access the values in the variable category_values that was created above.
You can use the table() function to show both the values in the factor and the number of times the factor has that value:
table(st_info$Race)
A AI B H O PI W
136 28 243 403 10 7 1173
Or, if you want to list those factor values by the decreasing frequency of their occurrence:
sort(table(st_info$Race), decreasing =TRUE)
W H B A AI O PI
1173 403 243 136 28 10 7
Finally, you can add the prop.table() function and see the proportion of the whole for each factor (instead of the raw value):
W H B A AI O PI
0.5865 0.2015 0.1215 0.0680 0.0140 0.0050 0.0035
The levels() function will list the variable’s levels in order by value. Notice that this results in the same line as the second output line when the factor value is printed out.
levels(st_info$Race)
[1] "A" "AI" "B" "H" "O" "PI" "W"
8.3 Changing the order of a factor’s levels
Now, suppose that you want to change the ordering of the factor values. The following reorders the values of st_info$Race.
Let’s check that it worked. The columns should be in the same order as the levels list in the above command.
table(st_info$Race)
W H B A AI PI O
1173 403 243 136 28 7 10
And so it is.
9 Joining tables together
Excel holds tables of data in a sheet. It is common for one master table to need to get data from some secondary table; this is usually done by means of Excel’s lookup function.
R’s approach is, we believe, easier and more efficient. You will join two data frames together to form one data frame.
9.1 Reading in the second table
We are now going to add some new data to our student information example. We are going to add information about counties and their economic situation.
# A tibble: 6 × 5
code name state_name population per_capita_income_2020
<chr> <chr> <chr> <dbl> <dbl>
1 ABBEVISC Abbeville County South Carolina 24356 38094
2 AIKENSC Aiken County South Carolina 174150 48208
3 ALLENDSC Allendale County South Carolina 7579 39852
4 ANDERSSC Anderson County South Carolina 209581 43933
5 APPLINGA Appling County Georgia 1156 39199
6 ATKINSGA Atkinson County Georgia 585 31938
We can see that we have the following variables (columns):
code: a unique identifier for the county
name: the name of the county
state_name: the name of the state in which the county is located
population: the pertinent population of the county
per_capita_income_2020: the average per capita income in 2020 for residents in the county
Let’s remind ourselves of the columns in the st_info data frame:
The st_info table has a column named County that contains the county code in which the applicant resides. It would be useful if the county code in the county_info data frame had the same name (in order to minimize confusion).
While it is obvious just when looking at the county_info data frame that name refers to the name of the county, we have a more complicated world now. It would be less confusing if it was more clear that this variable refers to the name of a county.
The variable name per_capita_income_2020 is quite long. It’s always a balancing act between length of variable names and clarity. Since PCI is frequently used to refer to per capita income, we should feel comfortable shortening the name.
9.2 Modifying the table
Let’s make those changes using the rename() function. Here’s the basic form:
# A tibble: 6 × 5
County County_Name state_name population PCI20
<chr> <chr> <chr> <dbl> <dbl>
1 ABBEVISC Abbeville County South Carolina 24356 38094
2 AIKENSC Aiken County South Carolina 174150 48208
3 ALLENDSC Allendale County South Carolina 7579 39852
4 ANDERSSC Anderson County South Carolina 209581 43933
5 APPLINGA Appling County Georgia 1156 39199
6 ATKINSGA Atkinson County Georgia 585 31938
The variables now have new names; the data has not changed.
We aren’t going to use the population or state_name variables so let’s remove them from the data frame:
In order to do analysis, R wants all of the information in a single data frame. Thus, given that we have two data frames now that are related—st_info and county_info—we need to get all of the relevant data into one data frame.
In this situation, the information about the student is in the st_info data frame (including the county in which he/she resides) and information about the county is in the county_info data frame. The link between the two tables is the County column in st_info and the County column in county_info.
We can combine these two tables into one with the left_join() command.
Consider the following command:
student_econ <- st_info |>select(AppID, County, St, Sex, Race, SAT, Age) |>left_join(county_info, by =join_by(County))
Let’s take this one line at a time:
The results are going to be assigned to the student_econ data frame.
The st_info data frame will be considered the one on the left. Think of reading. The English language is read left-to-right, so the first words encountered are on the left of words that are encountered later.
We only want to include seven variables (columns) from the st_info data frame in the new student_econ table. You need to make sure that one of these is the variable that will be used to join the two tables together. Above, we established that this variable is called County in both data frames.
Now for the left_join() function itself: (a)county_info is the right table; i.e., the table being joined to the left table. (b) The by = join_by(County) phrase tells the function that the two tables are to be joined by the County variable.
Let’s see what the result is:
head(student_econ)
# A tibble: 6 × 9
AppID County St Sex Race SAT Age County_Name PCI20
<chr> <chr> <fct> <fct> <fct> <dbl> <dbl> <chr> <dbl>
1 4563269562-RODR-2021 COWETAGA GA M W 1436 21.5 Coweta Coun… 52106
2 9221751846-ROEH-2021 GREENVSC SC M A 1398 21.6 Greenville … 52376
3 4290276249-ALLE-2021 BULLOCGA GA M W 1090 21.4 Bulloch Cou… 35376
4 3398780452-HILT-2021 DEKALBGA GA M W 1516 21.4 Dekalb Coun… 56428
5 7691897840-SMIT-2021 WHITFIGA GA M O 1440 22.0 Whitfield C… 43456
6 1862245592-SHIP-2021 DEKALBGA GA M H 1438 22.1 Dekalb Coun… 56428
Notice that the seven variables from st_info are in this new data frame as are the three variables from county_info. The variable County is only included once.
The result is that now we know the PCI (in 2020) for the county in which the applicant resides. Even if we don’t know the applicant’s economic situation, we know the average for the county in which he/she lives.
10 Sorting data frames by variable/column
10.1 Single columns
Welcome to the realm of sorting in R! First up, let’s master the art of ordering single columns.
st_min |>arrange(Birthdate) |>head()
# A tibble: 6 × 7
AppID Birthdate St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1412835066-BRYA-2021 01/01/2003 SC YORKCSC M W 1352
2 7043527378-BERR-2021 01/02/2003 SC AIKENSC M A 1120
3 6119019828-BURN-2021 01/02/2003 SC DARLINSC F W 1104
4 4650793794-HERR-2021 01/02/2003 SC LEXINGSC F W 1469
5 4054336768-BLAC-2021 01/03/2003 GA CARROLGA M W 1146
6 0117952760-FREE-2021 01/03/2003 SC LANCASSC M W 1438
In this example, the dataset st_min is being arranged based on the Birthdate column. Using the arrange(Birthdate) function, the rows will be sorted in ascending order based on the dates in the Birthdate column, displaying the earliest dates at the top. The head() function then shows us the first few rows of this beautifully ordered dataset.
10.2 Multiple columns
Next, let’s navigate the landscape of multi-column sorting.
st_min |>arrange(St, County, Birthdate) |>head()
# A tibble: 6 × 7
AppID Birthdate St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 7297156714-TAYL-2021 02/11/2003 GA BALDWIGA F W 1296
2 5503874158-ONEI-2021 08/28/2003 GA BALDWIGA F W 1120
3 4217895057-PLUM-2021 08/30/2002 GA BALDWIGA M W 1336
4 2934873168-SANC-2021 09/07/2002 GA BALDWIGA M W 1311
5 1634395813-GORE-2021 09/22/2002 GA BALDWIGA M W 1171
6 3395189387-JOHN-2021 06/03/2003 GA BARROWGA F B 1354
Here, we’re leveling up by arranging the data by multiple columns: St, County, and Birthdate. The arrange() function works diligently, first sorting by St, then within St, it sorts by County, and finally, within County, it sorts by Birthdate. It’s like a well-organized filing system, making sure everything is in its right place.
10.3 Descending order
Now, let’s unveil the magic of descending order.
st_min |>arrange(St, County, desc(Birthdate)) |>head()
# A tibble: 6 × 7
AppID Birthdate St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1634395813-GORE-2021 09/22/2002 GA BALDWIGA M W 1171
2 2934873168-SANC-2021 09/07/2002 GA BALDWIGA M W 1311
3 4217895057-PLUM-2021 08/30/2002 GA BALDWIGA M W 1336
4 5503874158-ONEI-2021 08/28/2003 GA BALDWIGA F W 1120
5 7297156714-TAYL-2021 02/11/2003 GA BALDWIGA F W 1296
6 2419671834-JONE-2021 12/03/2002 GA BARROWGA M B 1117
In this powerful move, we’re arranging the data in descending order based on the Birthdate column, while St and County remain in ascending order. By invoking the desc(Birthdate) spell, the Birthdate column is transformed, and the latest dates gallantly appear at the top. A blend of old and new, this approach provides a nuanced view of our dataset.
11 Creating a new column
11.1 Text calculations
Let’s dive into the wonderful world of text manipulation! Strings, characters, texts, or words—whatever you call them, they’re central in data handling. Here we’ll unveil some practical magic tricks to make your text columns perform delightful ballets of coherence and utility!
Our first act features the mutate() and paste() functions, a dynamic duo in text calculations. Together, they craft a new column, Full, by pasting the Given and Family names together with a space in between. This newly conjured Full column now graciously hosts the complete names, ensuring each person is addressed with due respect.
# not evaluated; equivalent to the previous command.st_info |>unite("Full", Given:Family, remove =FALSE, sep =" ") |>head()
Meet the unite() function—a mystical ally in merging columns. Here, it acts like a sophisticated tailor, stitching the Given and Family columns together with a space separator, crafting a chic “Full” name column. The remove = FALSE argument ensures that the original columns retain their places, standing proudly beside their new companion.
# A tibble: 10 × 3
Full Given Family
<chr> <chr> <chr>
1 Angie Abel Angie Abel
2 Ollie Acosta Ollie Acosta
3 Amber Adams Amber Adams
4 Conner Adams Conner Adams
5 Gemma Adams Gemma Adams
6 Jones Adams Jones Adams
7 Walter Adams Walter Adams
8 Caspian Adkins Caspian Adkins
9 Dalet Aguilar Dalet Aguilar
10 Tim Aguilar Tim Aguilar
Finally, we adorn our dataset with the elegance of selection and ordering. The select() function thoughtfully curates our columns, ensuring that only Full, Given, and Family grace the final presentation. Guided by the arrange() function, the rows sashay into a splendid procession ordered by Family and Given, showcasing the harmonious blend of our text manipulation craftsmanship.
11.2 Numeric calculations
11.2.1 Calculating summary statistics
Suppose that we want to calculate some summary statistics over the whole data frame. You might type the following:
So, it should be clear that this line of code calculates and prints a collection of statistics for theSATcolumn in the st_info dataset. It specifically calculates the:
Mean: The average of theSATscores.
Median: The middle value in theSATscores.
Standard Deviation (sd): A measure of the amount of variation or dispersion of theSATscores.
Length: The total number of values (count) in theSATcolumn.
Note that there might be a problem with the above. Do you want to include NA values in the calculations?
Let’s take a quick look:
v <-c(1,2,3)print(c(mean(v), median(v), length(v)))
[1] 2 2 3
This first line of results are just what we expected. Let’s simply add an NA to the vector:
v <-c(1,2,3,NA)print(c(mean(v), median(v), length(v)))
[1] NA NA 4
Take a look at the second line—both mean() and median() return NA. If this is okay, then you don’t have a problem. However, if you want to calculate the values without including the NAs, then you have two options. Here’s the first:
v <-c(1, 2, 3, NA)print(c(mean(v, na.rm =TRUE),median(v, na.rm =TRUE),sum(!is.na(v))))
[1] 2 2 3
The na.rm means “do you want to remove the NA values?” Note that length() does not take a na.rm argument so we count up the values in v that are not NA values.
There is a second way that uses the na.omit() function which strips out NA values from a vector:
v <-c(1, 2, 3, NA)v_with_no_na <-na.omit(v)print(c(mean(v_with_no_na),median(v_with_no_na),length(v_with_no_na)))
[1] 2 2 3
So, going back to our original problem of calculating statistics for the SAT column of the st_info data frame, we could use a more tidyverse-friendly approach that utilizes summarize():
# A tibble: 1 × 4
Mean Median SD Count
<dbl> <dbl> <dbl> <int>
1 1315. 1308. 137. 2000
11.2.2 Calculating values per row
Suppose that we want to calculate by what percentage each applicant exceeds (or falls short of) the overall SAT score for all applicants. We could run the following command:
# A tibble: 10 × 7
Full County St Sex Race SAT rel_SAT
<chr> <chr> <fct> <fct> <fct> <dbl> <dbl>
1 Angie Abel GEORGESC SC F W 1472 12.0
2 Ollie Acosta RICHLASC SC F W 1280 -2.64
3 Amber Adams BEAUFOSC SC F W 1585 20.6
4 Conner Adams HORRYSC SC M W 1296 -1.43
5 Gemma Adams WILLIASC SC F B 1197 -8.96
6 Jones Adams HOUSTOGA GA M W 1315 0.02
7 Walter Adams GREENVSC SC M A 1356 3.14
8 Caspian Adkins LEXINGSC SC M A 1181 -10.2
9 Dalet Aguilar HORRYSC SC F W 1151 -12.5
10 Tim Aguilar JEFFERGA GA M W 1323 0.63
This snippet performs several operations on the st_info dataset:
Mutation: It creates a new column rel_SAT, which calculates the relative SAT score as a percentage difference from the mean SAT score. Each value in rel_SAT indicates how much above or below the mean each individual SAT score is.
Arrangement: The data is then sorted by Family and Given names.
Selection and Display: It selects specific columns to be displayed and shows the first ten rows of this modified dataset.
Again, while this basically works, it also includes NA values in the calculation of the overall mean. While fixing that problem, we can also make the code a bit more readable by making a temporary column mean_SAT to hold that value (though without printing it). Let’s see how this works:
# A tibble: 10 × 7
Full County St Sex Race SAT rel_SAT
<chr> <chr> <fct> <fct> <fct> <dbl> <dbl>
1 Angie Abel GEORGESC SC F W 1472 12.0
2 Ollie Acosta RICHLASC SC F W 1280 -2.64
3 Amber Adams BEAUFOSC SC F W 1585 20.6
4 Conner Adams HORRYSC SC M W 1296 -1.43
5 Gemma Adams WILLIASC SC F B 1197 -8.96
6 Jones Adams HOUSTOGA GA M W 1315 0.02
7 Walter Adams GREENVSC SC M A 1356 3.14
8 Caspian Adkins LEXINGSC SC M A 1181 -10.2
9 Dalet Aguilar HORRYSC SC F W 1151 -12.5
10 Tim Aguilar JEFFERGA GA M W 1323 0.63
Note that, although we calculated the mean_SAT column, we didn’t include it in the select() function so it didn’t print out.
11.3 Summary columns
11.3.1 Components of summary commands
Let’s take a moment and go through each of the components and the various calculations used within the summarize() function.
group_by(col1, col2, ...)
The group_by() function is used to split the data into groups based on one or more variables. Subsequent operations are then performed on these groups rather than the whole dataset. This is useful when you want to perform summary statistics or transformations within individual groups. The groups created by group_by() are not explicitly visible in the output but are retained in the background to affect subsequent operations.
summarize()
This function reduces each group of values down to a single summary statistic. Multiple summaries are allowed.
n(): Counts the number of observations in each group. Useful to quickly understand the size of each group or as a denominator in average calculations.
n_distinct()
Counts the number of distinct values in each group. Helpful for understanding the variety or uniqueness within categorical variables in each group.
mean(col)
Calculates the mean (average) of a column in each group. Provides a measure of central tendency; common in various statistical analyses to understand the data’s general behavior.
median(col)
Calculates the median (middle value) of a column in each group. Another measure of central tendency that is less sensitive to outliers compared to the mean.
sd(col)
Calculates the standard deviation of a column in each group, a measure of dispersion or variability. Useful to understand the spread or variability of the data around the mean.
min(col)
Identifies the minimum value in each group. Helpful to understand the range and outliers in the data.
max(col)
Identifies the maximum value in each group. Like the minimum, it’s crucial for understanding the data’s range and outliers.
11.3.2 Discussion of the whole pipeline
When you combine group_by() with summarize(), you create a powerful tool to understand the nuanced behaviors within different subsections or categories of your data. Together, they allow a detailed and specific summary analysis, facilitating a deeper understanding of data patterns and variations across groups.
Keep in mind that when you use summarize(), it will remove the grouping structure unless you explicitly choose to keep it. In most cases, this is beneficial as it simplifies the resulting tibble to contain just your summary statistics.
Consider the following. It calculates summary statistics per numeric column, with the results for each column shown on a separate row of the result.
st_info |>select(where(is.numeric)) |># Select only numeric columnsgather(column, value) |># Convert to long formatgroup_by(column) |># Group by each original columnsummarise(mean =mean(value, na.rm =TRUE),median =median(value, na.rm =TRUE),sd =sd(value, na.rm =TRUE),count =sum(!is.na(value)) )
# A tibble: 2 × 5
column mean median sd count
<chr> <dbl> <dbl> <dbl> <int>
1 Age 21.7 21.7 0.327 2000
2 SAT 1315. 1308. 137. 2000
There is a lot going on here, so we want to take a few moments to explain each piece, going line-by-line.
select(where(is.numeric))
This selects only the columns in the data frame st_info where the columns are numeric. This part of the code sifts through the columns, keeping only those with numeric data and excluding others like characters or factors.
gather(column, value)
This converts the data frame from a wide format to a long format.
The gather() function takes multiple columns and collapses them into key-value pairs, increasing the number of rows.
column will contain the names of the original columns, and value will contain the corresponding values. Note that neither of these names—column or value—mean anything! They are just to be used further down in the command.
For instance, if you have two columns (“A” and “B”), gather() puts all values in a single column and another column specifying where each value came from (“A” or “B”). Thus, it would go from this representation:
A B
1 5
2 6
To this representation:
row column value
1 A 1
1 B 5
2 A 2
2 B 6
This is what it means to convert from a wide format to a long format.
group_by(column)
This groups the data by the original column names.
This function segments the data into groups based on unique column names.
Operations following this function (like summarise()) will operate independently within each group.
summarise()
This calculates summary statistics for each group.
mean = mean(value, na.rm = TRUE) calculates the mean of the numeric values, ignoring NA values.
median = median(value, na.rm = TRUE) calculates the median, ignoring NA values.
sd = sd(value, na.rm = TRUE) calculates the standard deviation, ignoring NA values.
count = sum(!is.na(value)) counts the non-NA values in each group.
The overall result is a summary table with calculated mean, median, standard deviation, and count for each numeric column of the original data frame, st_info.
In summary:
The pipeline starts by selecting numeric columns.
It then reshapes the data to a long format.
Following this, it groups the data by original columns.
Lastly, it calculates summary statistics for each of these groups, resulting in a concise, informative table.
11.3.5across()
11.3.5.1 Selects
selects the columns you want to operate on
where(is.numeric)
c(list of column names)
11.3.5.2 Function or list of functions
colname = ~funcname(.x, na.rm = TRUE):
12 Differences between arrange() and group_by()
The functions arrange() and group_by()—though they seem similar—serve different purposes when manipulating and analyzing data in R. Here’s a breakdown of their differences.
12.1arrange()
arrange() is used to sort or order rows in a data frame or tibble based on the values of one or more columns. It rearranges the rows in ascending order by default, but you can also sort them in descending order. Consider these two examples:
df <-tibble(x =c(3, 2, 1), y =c("A", "B", "C"))df |>arrange(x) # sort in ascending order
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 C
2 2 B
3 3 A
Note that the x column is arranged in ascending order as we specified.
df |>arrange(desc(x)) # sort in descending order
# A tibble: 3 × 2
x y
<dbl> <chr>
1 3 A
2 2 B
3 1 C
And now the x column is arranged in descending order.
12.2group_by()
group_by() is used to split data into groups based on one or more variables. This is particularly useful when you want to perform some operations separately on each group, such as summarizing or mutating.
Let’s first define a simple tibble with three rows and two columns:
# A tibble: 3 × 2
x y
<chr> <dbl>
1 A 1
2 B 2
3 A 3
Suppose that we want to calculate the average y value for every value of x. This is what group_by() is for:
df |>group_by(x) |># Grouping by a single variablesummarise(mean_y =mean(y))
# A tibble: 2 × 2
x mean_y
<chr> <dbl>
1 A 2
2 B 2
Notice that we did not have to arrange() the tibble in any way first; the group_by() command handles the ordering and the grouping itself.
Finally, we can also calculate summary statistics when grouping by multiple columns:
df |>group_by(x, y) |># Grouping by multiple variablessummarise(count =n())
# A tibble: 3 × 3
# Groups: x [2]
x y count
<chr> <dbl> <int>
1 A 1 1
2 A 3 1
3 B 2 1
This counts, correctly, that there is one each of A,1, A,3, and B,2.
12.3 Key Differences
arrange() modifies the order of rows, changing the way data is displayed but not how it’s fundamentally structured.
group_by(), on the other hand, changes the way data is treated in subsequent operations, particularly when paired with verbs like summarise() or mutate(), to apply functions within each group.
In essence, arrange() is about changing the row order for presentation, while group_by() is about defining subsets of the data to operate on in subsequent steps of your data analysis pipeline. They can also be used in conjunction to first group data and then arrange the resulting summarized data.