String functions
On this page, we use data from the New York Times Best Sellers list from 2010 to 2019 in order to demonstrate the range of capabilities of R’s string manipulation functions.
The R Studio string function cheatsheet can be found here..
1 Case study: Books
Before getting into the string functions themselves, we go through the process of importing the data, cleaning up the columns, and finally cleaning up the contents of the columns.
1.1 Import
The first step, as always, is to import the CSV file into a data frame.
You will find that this data is sufficiently complex that it lends itself to challenges that will push us to understand R’s capabilities. We attempt to demonstrate and explain the functions while using them in a realistically complex use case.
1.2 Column clean-up
Here is some information about this large dataset:
Rows: 61,430
Columns: 12
$ published_date <date> 2010-01-03, 2010-01-03, 2010-01-03, 2010-01-03, 20…
$ list_name <chr> "Chapter Books", "Chapter Books", "Chapter Books", …
$ list_name_encoded <chr> "chapter-books", "chapter-books", "chapter-books", …
$ rank <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, …
$ isbn13 <chr> "9780316036245", "9780439023481", "9780439023498", …
$ isbn10 <chr> "0316036242", "0439023483", "0439023491", "07636441…
$ title <chr> "WITCH AND WIZARD", "THE HUNGER GAMES", "CATCHING F…
$ author <chr> "James Patterson and Gabrielle Charbonnet", "Suzann…
$ description <chr> "One of each, brother and sister, flex their newfou…
$ amazon_product_url <chr> "https://www.amazon.com/Witch-Wizard-James-Patterso…
$ price <dbl> 17.99, 17.99, 17.99, 16.99, 17.99, 28.95, 40.00, 27…
$ weeks_on_list <dbl> 1, 67, 16, 15, 2, 14, 13, 8, 56, 7, 14, 5, 6, 38, 3…
Fortunately, the published_date
column is already recognized as a date
, so we don’t have to coerce that column. Same thing for rank
, price
, and weeks_on_list
— each one is already recognized as a numeric value.
For a bit more information, let’s skim()
the data:
Name | books |
Number of rows | 61430 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 8 |
Date | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
list_name | 0 | 1.00 | 5 | 36 | 0 | 59 | 0 |
list_name_encoded | 0 | 1.00 | 5 | 36 | 0 | 59 | 0 |
isbn13 | 4 | 1.00 | 13 | 13 | 0 | 8928 | 0 |
isbn10 | 1284 | 0.98 | 4 | 10 | 0 | 8124 | 0 |
title | 0 | 1.00 | 1 | 80 | 0 | 6577 | 0 |
author | 69 | 1.00 | 1 | 84 | 0 | 3609 | 0 |
description | 8119 | 0.87 | 9 | 315 | 0 | 8205 | 0 |
amazon_product_url | 0 | 1.00 | 38 | 156 | 0 | 6719 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
published_date | 0 | 1 | 2010-01-03 | 2019-12-29 | 2015-05-03 | 522 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rank | 0 | 1 | 3.00 | 1.41 | 1 | 2 | 3 | 4 | 5 | ▇▇▇▇▇ |
price | 0 | 1 | 3.86 | 8.79 | 0 | 0 | 0 | 0 | 150 | ▇▁▁▁▁ |
weeks_on_list | 0 | 1 | 24.71 | 63.53 | 0 | 0 | 2 | 16 | 607 | ▇▁▁▁▁ |
The first thing that to do is to simplify the data frame. We remove a few columns from the analysis (to simplify the presentation). We also shorten the names of four columns.
We assign the data to a new data frame
bookdata
but the data comes from thebooks
data frame (that we just created with theread.csv
statement above). We then delete four columns by usingselect()
with a minus in front of a list. Finally, withrename()
we rename four of the columns.
Notice that we still have the original books
data frame so that we can easily start the analysis over if we need to.
1.3 Column contents clean-up
Let’s take a look at a bit of the data, selecting every column except description
(because it is so long):
bookdata |>
select(pubdate, title, author, listname,
rank, weeks, isbn) |> # not printing description
arrange(pubdate, listname, rank) |>
head(n=6)
# A tibble: 6 × 7
pubdate title author listname rank weeks isbn
<date> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2010-01-03 WITCH AND WIZARD James Patte… chapter… 1 1 9780…
2 2010-01-03 THE HUNGER GAMES Suzanne Col… chapter… 2 67 9780…
3 2010-01-03 CATCHING FIRE Suzanne Col… chapter… 3 16 9780…
4 2010-01-03 THE MAGICIAN’S ELEPHANT Kate DiCami… chapter… 4 15 9780…
5 2010-01-03 FALLEN Lauren Kate chapter… 5 2 9780…
6 2010-01-03 GUINNESS WORLD RECORDS 2010 edited Cra… hardcov… 1 14 9781…
We are going to spend the rest of this section making factors out of two columns and then managing empty values in the data frame.
1.3.1 Make factors
Let’s now take a few moments to investigate the data in more depth. First, let’s see how many distinct values are in each column. (Note that you do not have to understand this command at all in order to use it. After the first line (with bookdata
), you can just copy-and-paste in order to use it on any table with any type of data.)
First, convert all columns in
bookdata
to character columns; if there are multiple data types, then this command would give an error without this line. The second line converts the output from a wide table to a long one. The third line groups the results by variable names and the fourth line counts the number of distinct values per variable.
bookdata |>
mutate(across(everything(), as.character)) |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
group_by(variable) |>
summarise(n_distinct = n_distinct(value)) |>
print()
# A tibble: 8 × 2
variable n_distinct
<chr> <int>
1 author 3610
2 description 8206
3 isbn 8929
4 listname 59
5 pubdate 522
6 rank 5
7 title 6577
8 weeks 598
Given how many unique values are in rank
and listname
, let’s go ahead and turn those columns into factors (as discussed in this section:
The ordering of the levels for both factors can stay as the default, so we don’t have to do anything with that.
1.3.2 Managing empty values
Next, let’s see how many string values signifying “empty” are in the character columns. Much of this process is discussed in greater depth on this page. This section doesn’t have anything in particular to do with string functions, so you can go on to the next section without missing anything on these functions. (But come back to it later when you’re managing empty values in your own data.)
Here’s an interpretation of the following command:
R will look at every
character
column. It will compute the sum, per column, of every observation that has one of the strings"NA"
,"--"
, or""
. It is an R standard for use.x
to refer back to a column—in this case, R is iterating through every character column (since this is what is prescribed in the first argument of theacross()
function), so.x
represents the name of each character column in turn.
Rows: 1
Columns: 4
$ isbn <int> 0
$ title <int> 0
$ author <int> 0
$ description <int> 0
Great! There don’t appear to be any of these values in the dataset. We might come across them later, but none of these strings are used in this dataset at this point.
Finally, let’s see how many actual NA
values are in any of the columns. (Again, go to this page if you want to learn about handing NA
values in more depth.)
R looks at every
character
column. For each one of those columns, it computes the sum of how many rows have anNA
column.
Rows: 1
Columns: 8
$ pubdate <int> 0
$ listname <int> 0
$ rank <int> 0
$ isbn <int> 4
$ title <int> 0
$ author <int> 69
$ description <int> 8119
$ weeks <int> 0
It’s the isbn
, author
, and description
columns that contain NA
values. Let’s take a moment and investigate which books have an NA
value in the isbn
column.
R filters those rows for which the
isbn
column has theNA
value, and then it selects theisbn
,title
, andauthor
columns.
# A tibble: 4 × 3
isbn title author
<chr> <chr> <chr>
1 <NA> SUMMER SECRETS Barbara Freethy
2 <NA> SUMMER SECRETS Barbara Freethy
3 <NA> SUMMER SECRETS Barbara Freethy
4 <NA> SUMMER SECRETS Barbara Freethy
Apparently just one book doesn’t have a value in the isbn
column.
For purposes of our analysis, we want to be able to assume that every book has an isbn
value, so we need to come up with a default value that works for this book (and for any future book that might appear on future lists).
We are going to define a default isbn
value for all books that do not have one. We define this as the first 13 lowercase letters (with spaces removed); if the title is shorter than that, then we add spaces onto the end in order to make it 13 characters.
R defines four new columns:
titlelen
is the length of the title,lowertitle
is the lowercase letters from the title with the spaces removed,shorttitle
is the first 13 letters of thelowertitle
(though it may be less than that if the title does not have 13 letters in it), andpadtitle
is theshorttitle
padded out to 13 characters (with_
) if it is not at least that long already. The command then applies anifelse()
in order to apply the correct value toisbn
depending on if it is blank. Finally, the command finishes by deleting those four temporary fields.
bookdata <-
bookdata |>
mutate(
titlelen = length(title),
lowertitle = tolower(str_remove(title, " ")),
shorttitle = str_sub(lowertitle, 1, 13),
padtitle = str_pad(shorttitle,
13,
side = "right",
pad = "_"),
isbn = ifelse(is.na(isbn),
ifelse(titlelen > 13,
shorttitle,
padtitle),
isbn)) |>
select(-c(titlelen, lowertitle, shorttitle, padtitle))
Let’s take a look back at bookdata
to ensure that the book now has a value in the isbn
column:
R looks for an
isbn
value with either an underscore in it or the wordsummer
in it. It then selects the values from theisbn
,title
, andauthor
columns.
bookdata |>
filter(str_detect(isbn, "_") | str_detect(isbn, "summer")) |>
select(isbn, title, author)
# A tibble: 4 × 3
isbn title author
<chr> <chr> <chr>
1 summersecrets SUMMER SECRETS Barbara Freethy
2 summersecrets SUMMER SECRETS Barbara Freethy
3 summersecrets SUMMER SECRETS Barbara Freethy
4 summersecrets SUMMER SECRETS Barbara Freethy
There’s just the one book and it now has an appropriate (temporary) isbn
number.
Let’s ensure that we now have just two columns with NA
values in them. This command is copied directly from above.
Rows: 1
Columns: 8
$ pubdate <int> 0
$ listname <int> 0
$ rank <int> 0
$ isbn <int> 0
$ title <int> 0
$ author <int> 69
$ description <int> 8119
$ weeks <int> 0
And so it is—it’s only the author
and description
columns that have NA
values.
1.4 Ready for analysis
Here is the bookdata
data frame, all ready for analysis:
pubdate listname rank
Min. :2010-01-03 hardcover-fiction : 2610 1:12286
1st Qu.:2013-02-03 hardcover-nonfiction : 2610 2:12286
Median :2015-05-03 paperback-nonfiction : 2610 3:12286
Mean :2015-01-28 picture-books : 2610 4:12286
3rd Qu.:2016-10-02 series-books : 2610 5:12286
Max. :2019-12-29 trade-fiction-paperback: 2610
(Other) :45770
isbn title author description
Length:61430 Length:61430 Length:61430 Length:61430
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
weeks
Min. : 0.00
1st Qu.: 0.00
Median : 2.00
Mean : 24.71
3rd Qu.: 16.00
Max. :607.00
2 Functions
Text manipulation and calculation are essential parts of data wrangling, and there are many functions in R that one can use to perform a variety of text-related tasks.
In the following we apply many of these tidyverse functions to the bookdata
dataframe that we explain above. While it is the case that each section (and function description) stands alone, the entirety of this page goes through the process of cleaning up the bookdata
data frame so that it is ready for processing.
We highly recommend reading this page from beginning to end at some point.
2.1 Trimming White Spaces: str_trim()
and str_squish()
The function str_trim()
removes leading and trailing white spaces from strings, essential for cleaning up text data. The function str_squish()
also removes extra space (not all the spaces; just extra spaces) from within a string in addition to those leading and trailing the string.
Let’s first get a baseline that tells us how many characters are in the title
column:
In this command, we first calculate the length of the
title
column and store it in the (temporary) columnstrlen
. We then calculate the total number of characters and store it in another value namedtitle_total
. Finally, we print that value out.
bookdata |>
mutate(strlen = str_length(title)) |>
summarize(title_total = sum(strlen, na.rm = TRUE)) |>
select(title_total) |>
as.data.frame()
title_total
1 1093865
Take note of that, and now let’s go through and remove extra whitespace with the following:
R modifies both the
title
column twice. The first time it trims all extra spaces from both the beginning and end of each column. The second time, it trims extra spaces from the beginning, the end, and in the middle of each field.
It is absolutely true that we didn’t need to do the str_trim()
commands since we are following them with the str_squish
commands. However, we wanted to demonstrate both of them for you.
Now let’s see how many characters remain in that column:
bookdata |>
mutate(strlen = str_length(title)) |>
summarize(title_total = sum(strlen, na.rm = TRUE)) |>
select(title_total) |>
as.data.frame()
title_total
1 1093828
We don’t end up getting rid of much whitespace; however, getting rid of any can help us make some assumptions about the text in that column as we continue to process it.
Note (again) that you will, generally, never need to do both str_trim()
and str_squish()
. We did this just for demonstration purposes. If it was for a real analysis, we would have just used the str_squish()
function.
Let’s go ahead and use str_squish()
on both the author
and description
columns:
These should also help us out as we continue.
2.2 Text Transformation: toupper()
, tolower()
, and str_to_title()
These functions transform text to lower case, upper case, or title case. These are frequently used to ensure consistency in your text data.
Let’s take a quick look at the first few values in the title
column.
Select the first four values from the
title
column frombookdata
. Then (via theas.data.frame()
function), remove the data types from the top row of the printout.
title
1 WITCH AND WIZARD
2 THE HUNGER GAMES
3 CATCHING FIRE
4 THE MAGICIAN’S ELEPHANT
Now that you know the original values for title
, let’s take a look at these three functions.
Define three new (temporary) columns:
lowert
which is all lowercase;titlet
which is “Title Case”; anduppert
which is all UPPERCASE. Then select the top 4 values from those three columns.
bookdata |>
mutate(lowert = tolower(title),
titlet = str_to_title(title),
uppert = toupper(title)) |>
select(lowert, titlet, uppert) |>
head(n=4) |>
as.data.frame()
lowert titlet uppert
1 witch and wizard Witch And Wizard WITCH AND WIZARD
2 the hunger games The Hunger Games THE HUNGER GAMES
3 catching fire Catching Fire CATCHING FIRE
4 the magician’s elephant The Magician’s Elephant THE MAGICIAN’S ELEPHANT
You can see the results of these functions by comparing these three columns to the previous table. No real surprises, but it’s nice to know that these are available.
2.3 Extracting and Detecting: str_extract()
, str_detect()
These functions allow for the extraction of or detection of patterns or words within strings. In this section, we focus on the use of str_detect()
. We use it to explore the author
field, a process which will lead us to do some further processing, as you will see.
First, let’s see if the word “editor” or “Edited” appear in the author
column of bookdata
.
R filters the rows for which either “editor” or “Edited” is in the
author
column by using thestr_detect()
function on theauthor
column. It then groups the results byauthor
, calculating the total number of rows that has that value in the author column. It then sorts in descending order by that total, printing out the author column and total.
bookdata |>
filter(str_detect(author, "editor") |
str_detect(author, "Edited")) |>
group_by(author) |>
summarize(n = n()) |>
arrange(desc(n)) |>
select(author, n) |>
as.data.frame()
author n
1 the editors of Garden and Gun Magazine 22
2 the editors of Blue Star Books 17
3 the editors of Rockridge Press 12
4 David DiBenedetto and the editors of Garden and Gun 5
5 the editors of Goop 5
6 the editors of Sports 5
7 the editors of domino 4
8 Edited R L Stine 3
9 the editors at Cook’s 2
10 the editors of Good Housekeeping 2
11 the editors of America's Test Kitchen and Guy Cros 1
12 the editors of Martha Stewart Living 1
We’re now going to look for a second set of words—those which might substitute for "and"
within the author
column. When you are looking for a word, you have to take some precaution so that you find the whole word and not text within a word. (Note that we didn’t simply make these up. We looked through several hundred rows within the column to see what kind of data that it contains. This is the result of that visual scan.)
This command has the same structure as the previous one, so we will not go through it in detail.
bookdata |>
filter(str_detect(author, " and ") |
str_detect(author, " & ") |
str_detect(author, " with ") |
str_detect(author, ", ") |
str_detect(author, "; ")) |>
group_by(author) |>
summarize(n = n()) |>
arrange(desc(n)) |>
select(author, n) |>
head(n=10) |>
as.data.frame()
author n
1 Bill O'Reilly and Martin Dugard 741
2 Todd Burpo with Lynn Vincent 368
3 Heidi Murkoff and Sharon Mazel 293
4 Robert Kirkman and Charlie Adlard 279
5 Malala Yousafzai with Christina Lamb 220
6 Chris Kyle with Scott McEwen and Jim DeFelice 208
7 James Patterson and Maxine Paetro 205
8 Sherri Duskey Rinker and Tom Lichtenheld 198
9 Sheryl Sandberg with Nell Scovell 142
10 Brian Kilmeade and Don Yaeger 138
It’s clear that we have to standardize the handlings of "and"
equivalents given the huge number of rows that have one or more of those strings in the author
column.
We will not cover str_extract()
here as it depends on a knowledge of regular expressions—a topic of nuance and substance. However, if your job depends on doing lots of text wrangling, then you owe it to yourself to explore regexs.
2.4 Pattern Matching and Replacement: str_replace()
The function str_replace()
allows the user to search for a specific pattern or word in a string and replace it with another word or pattern. Very useful for data cleaning and standardization.
We are going to go through several different types of fixes that a data analyst has to take care of. These tasks are especially important when you are working with data coming from someone else. You have to look through the data for all types of errors, and then fix them as consistently and carefully as possible.
In this first command, we fix typographical errors in the author
column.
We must remember to assign the results back to the
bookdata
data frame, so this goes on the first row. The second row says that the resulting data comes from thebookdata
data frame. We then use fourmutate()
andstr_replace()
functions to fix fourauthor
fields that had typographical errors in them.
bookdata <-
bookdata |>
mutate(author = str_replace(author, "Alan Moore Dave Gibbons",
"Alan Moore and Dave Gibbons")) |>
mutate(author = str_replace(author, "Dan Kainen,", "Dan Kainen")) |>
mutate(author = str_replace(author, "Slahi,", "Slahi")) |>
mutate(author = str_replace(author, "Archie Manning Jim Madsen",
"Archie Manning and Jim Madsen")) |>
mutate(author = str_replace(author, "Wesley King Created Kobe Bryant",
"Wesley King and Kobe Bryant")) |>
mutate(author = str_replace(author, "Svetlana Alexievich Keith Gessen",
"Svetlana Alexievich and Keith Gessen")) |>
mutate(author = str_replace(author, "Proust Adapted and", "Proust"))
We are sure that there are more in the column similar to these; as we find them, we will add them to the above command.
The following is not fixing a typo, but it is acknowledging a reality of how R handles character strings. The dash "-"
character is treated as a separation between two separate words, while an underscore "_"
character is treated as part of a word. Below, we are going to be dividing up names into first names and last names. We want a name such as Carl-Johan
to counted as the first name and not a first and middle name; as such, we want to join Carl
and Johan
with an underscore instead of a dash. This same analysis goes for Lopez-Alt
.
bookdata <-
bookdata |>
mutate(author = str_replace(author, "Carl-Johan", "Carl_Johan")) |>
mutate(author = str_replace(author, "Mary-Jane", "Mary_Jane")) |>
mutate(author = str_replace(author, "Lopez-Alt", "Lopez_Alt")) |>
mutate(author = str_replace(author, "Jones–Quartey",
"Jones_Quartey")) |>
mutate(author = str_replace(author, "Kendrick–Horton",
"Kendrick_Horton")) |>
mutate(author = str_replace(author, "-", "_")) |>
mutate(author = str_replace(author, "-", "_")) |>
mutate(author = str_replace(author, "–", "_"))
Again, we will add names to the above as we fine them.
Now, in our examination of the data, we saw that some of the author names contain double-quotes. Since strings are usually delimited by double-quotes, they can be tricky to handle when in the string. Let’s see how prevalent they are:
bookdata |>
filter(str_detect(author, "\"")) |>
group_by(author) |>
summarize(n = n()) |>
select(author, n) |>
arrange(desc(n), author)
# A tibble: 6 × 2
author n
<chr> <int>
1 "the staff of \"The Chew\"" 9
2 "Jimmy Fallon with the writers of \"Late Night\"" 6
3 "\"Lebanon\" Levi Stoltzfus" 5
4 "Loren Bouchard and the writers of \"Bob's Burgers\"" 5
5 "Rob \"Gronk\" Gronkowski with Jason Rosenhaus" 5
6 "Larry Bird, Earvin \"Magic\" Johnson and Jackie MacMullan" 4
Not a lot, but enough to worry about. Let’s go ahead and replace the double quote with a single quote. Note in the following that the representation of the double quote requires a backslash \
.
Now that the replacement has been made, let’s look at the prevalence of the single-quote in the data frame.
From the
booklong
data frame, filter those rows for whichauthor
contains a single-quote. Group those rows byauthor
. For each name, count the number of rows. Select bothauthor
and ’nand sort them in descending order by
nand alphabetically by
author`. Print the first 15 rows.
bookdata |>
filter(str_detect(author, "'")) |>
group_by(author) |>
summarize(n = n()) |>
select(author, n) |>
arrange(desc(n), author) |>
as.data.frame() |>
head(n=15)
author n
1 Bill O'Reilly and Martin Dugard 741
2 Dinesh D'Souza 70
3 Dick Lehr and Gerald O'Neill 52
4 Bill O'Reilly 45
5 Bryan Lee O'Malley 37
6 Jane O'Connor 32
7 America's Test Kitchen Kids 25
8 Bill O'Reilly and Dwight Jon Zimmerman 17
9 Keith O'Brien 13
10 Cynthia D'Aprix Sweeney 10
11 Bill O'Reilly and Bruce Feirstein 9
12 Cathy O'Neil 9
13 the staff of 'The Chew' 9
14 Bill O'Reilly and James Patterson 7
15 Jimmy Fallon with the writers of 'Late Night' 6
These results look good, with no surprises.
Now, for the following we are making another policy decision. We could make a separate field for these suffixes (Jr
, Sr
, etc.) but we are simply going to add them to the last name, again using the underscore character, so that R does not end up thinking that Jr
or IV
are the last names of some authors. We looked through the author
field using a set of str_detect()
functions, and we took now of how the suffixes are represented in the author names. We also include St
for all of the names such as St John
.
bookdata <-
bookdata |>
mutate(author = str_replace(author, ", Jr", "_Jr"),
author = str_replace(author, ", Sr", "_Sr"),
author = str_replace(author, " Jr", "_Jr"),
author = str_replace(author, " Sr", "_Sr"),
author = str_replace(author, " II", "_II"),
author = str_replace(author, " III", "_III"),
author = str_replace(author, " IV", "_IV"),
author = str_replace(author, " St ", " St_"),
author = str_replace(author, " St. ", " St_"))
The following command replaces several characters and/or phrases that essentially mean “and”:
bookdata <-
bookdata |>
mutate(author = str_replace(author, " & ", " and ")) |>
mutate(author = str_replace(author, "; ", " and ")) |>
mutate(author = str_replace(author, " as told to ", " and ")) |>
mutate(author = str_replace(author, " writing as ", " and ")) |>
mutate(author = str_replace(author, " lyrics ", " and ")) |>
mutate(author = str_replace(author, " Adapted ", " and ")) |>
mutate(author = str_replace(author, " created ", " and ")) |>
mutate(author = str_replace(author, " compiled ", " and ")) |>
mutate(author = str_replace(author, "lyrics ", "")) |>
mutate(author = str_replace(author, "Adapted ", "")) |>
mutate(author = str_replace(author, "created ", "")) |>
mutate(author = str_replace(author, "compiled ", "")) |>
mutate(author = str_replace(author, ", with ", " and ")) |>
mutate(author = str_replace(author, " and with ", " and ")) |>
mutate(author = str_replace(author, " with ", " and ")) |>
mutate(author = str_replace(author, " With ", " and "))
With the following command, we delete all of the text that we could find that basically means “and others that we are not going to name.” We delete this because we are using the author
column to collect the names of all the people who are specifically named in that column.
bookdata <-
bookdata |>
mutate(author = str_replace(author, " and others", "")) |>
mutate(author = str_replace(author, ", et al.", "")) |>
mutate(author = str_replace(author, ", et al", "")) |>
mutate(author = str_replace(author, " et al.", "")) |>
mutate(author = str_replace(author, " et al", "")) |>
mutate(author = str_replace(author, " and 639 Others", ""))
With the next command, we make the most basic replacement of the comma “, ” with the word “ and ”.
In this case, we use str_replace_all() since the comma can appear multiple times within one
author`. If we didn’t use it, then it would only replace the first comma—we want to replace all of them.
The following replaces the names of institutions, titles, and publications with the same name but with underscores instead of spaces. (Don’t worry; you will see why we care about this farther down this page.)
bookdata <-
bookdata |>
mutate(author = str_replace(author, "Garden and Gun", "Garden_and_Gun")) |>
mutate(author = str_replace(author,"Annex Security Team",
"Annex_Security_Team")) |>
mutate(author = str_replace(author,"the Late Show","The_Late_Show")) |>
mutate(author = str_replace(author,"Thug Kitchen","Thug_Kitchen")) |>
mutate(author = str_replace(author,
"America's Test Kitchen Kids",
"America's_Test_Kitchen_Kids")) |>
mutate(author = str_replace(author,
"America's Test Kitchen",
"America's_Test_Kitchen")) |>
mutate(author = str_replace(author,
"Queen Rania of Jordan",
"Queen_Rania_of_Jordan")) |>
mutate(author = str_replace(author,
"The Washington Post",
"The_Washington_Post")) |>
mutate(author = str_replace(author,
"Women's March Organizers",
"Women's_March_Organizers")) |>
mutate(author = str_replace(
author,
"the National Museum of African American History",
"the_National_Museum_of_African_American_History")) |>
mutate(author = str_replace(author, "One Direction", "One_Direction")) |>
mutate(author = str_replace(author,
"The Countess of Carnarvon",
"The_Countess_of_Carnarvon")) |>
mutate(author = str_replace(author,
"the Countess of Carnarvon",
"The_Countess_of_Carnarvon")) |>
mutate(author = str_replace(author, "George RR", "George R R")) |>
mutate(author = str_replace(author, "the Dalai Lama",
"the_Dalai_Lama")) |>
mutate(author = str_replace(author,"The Try Guys","The_Try_Guys")) |>
mutate(author = str_replace(author, "the Late Show", "the_Late_Show"),
author = str_replace(author, "The Daily Show", "The_Daily_Show"),
author = str_replace(author, "Late Nite", "Late_Nite"),
author = str_replace(author, "Wizards RPG Team",
"Wizards_RPG_Team"),
author = str_replace(author, "The Chew", "The_Chew"),
author = str_replace(author, "Althea Press", "Althea_Press"),
author = str_replace(author, "Good Housekeeping",
"Good_Housekeeping"),
author = str_replace(author, "Martha Stewart Living",
"Martha_Stewart_Living"),
author = str_replace(author, "National Geographic Kids",
"National_Geographic_Kids"),
author = str_replace(author, "Magica Quartet Staff",
"Magica_Quartet_Staff"),
author = str_replace(author, "Oh! Great Staff", "Oh!_Great_Staff"),
author = str_replace(author, "Coloring Books for Adults",
"Coloring_Books_for_Adults"),
author = str_replace(author,"the Adult Coloring Book Designs staff",
"the_Adult_Coloring_Book_Designs_Staff"),
author = str_replace(author, "Potter Style", "Potter_Style"),
author = str_replace(author, "Grumpy Cat staff", "Grumpy_Cat"),
author = str_replace(author, "the DK staff", "the_DK_staff"),
author = str_replace(author, "the Lucas Film Book Group",
"the_Lucas_Film_Book_Group"),
author = str_replace(author,
"the Lucasfilm Press",
"the_Lucasfilm_Press"),
author = str_replace(author, "Rockridge Press", "Rockridge_Press"),
author = str_replace(author, "the Mayo Clinic", "the_Mayo_Clinic"),
author = str_replace(author,
"the Weight Watchers",
"the_Weight_Watchers"),
author = str_replace(author, "Blue Star Books", "Blue_Star_Books"))
The following command deletes phrases that don’t help us name people in the author
column:
bookdata <-
bookdata |>
mutate(author = str_replace(author, "the members of the ", "")) |>
mutate(author = str_replace(author, "the members of ", "")) |>
mutate(author = str_replace(author, "members of the ", "")) |>
mutate(author = str_replace(author, "the special counsel", "")) |>
mutate(author = str_replace(author, "the writers of", "")) |>
mutate(author = str_replace(author, "the office of the ", "")) |>
mutate(author = str_replace(author, "the office of ", "")) |>
mutate(author = str_replace(author, "The Staff of ", "")) |>
mutate(author = str_replace(author, "the staff of ", "")) |>
mutate(author = str_replace(author, "the writers of ", "")) |>
mutate(author = str_replace(author, "the editors of", "")) |>
mutate(author = str_replace(author, "edited", "")) |>
mutate(author = str_replace(author, "the editors at ", "")) |>
mutate(author = str_replace(author, " and others", "")) |>
mutate(author = str_replace(author, "Various illustrators", ""))
And, here, in one final set of str_replace()
commands, we delete all of the role-related information. For this exercise, we do not care about the roles of the people in the author
column—we are simply looking to gather the names of the people named in it.
bookdata <-
bookdata |>
mutate(author = str_replace(author, "Edited ", "")) |>
mutate(author = str_replace(author, "from texts ", "")) |>
mutate(author = str_replace(author, "with an introduction ", " ")) |>
mutate(author = str_replace(author, "with related materials ", "")) |>
mutate(author = str_replace(author, "adapted", "")) |>
mutate(author = str_replace(author, "written", "")) |>
mutate(author = str_replace(author, " illustrated ", " and ")) |>
mutate(author = str_replace(author, " Illustrations ", " and ")) |>
mutate(author = str_replace(author, " Designed ", " and ")) |>
mutate(author = str_replace(author, " Translated ", " and ")) |>
mutate(author = str_replace(author, " Photographs ", " and ")) |>
mutate(author = str_replace(author, "illlustrated", "")) |>
mutate(author = str_replace(author, "Translated ", ""))
One last clean up from this part of the process—let’s ensure that extra spaces haven’t gotten introduced in the author
column. We will use the str_squish()
function to get rid of all of the extra whitespace.
2.5 Count number of matches: str_count()
The function str_count(), not surprisingly, counts the number of times one string appears in another one.
Suppose that we want to know the number of authors who appear in the 61k+ rows of the bookdata
data frame. We are not controlling for repeats.
From the vector
bookdata$author
(that is, the columnauthor
in thebookdata
data frame), go through it item-by-item and construct a new vector of the same size, where corresponding items in the new vector have the number of times" and "
appears plus one. Thus, if it findsc("x and y", "abc")
, then it would result inc(2, 1)
. Put this vector of numbers intonumOfAuthors
.
We can now do some calculations on this vector.
Print a vector of five values: the smallest number that appears in the vector, the average of all the numbers that appear in the vector, the median of all the numbers in that vector, the largest number in that vector, and the total of all of the numbers in the whole column. When calculating each of those statistics, remove all of the
NA
values first.
print(c(min(numOfAuthors, na.rm = TRUE),
mean(numOfAuthors, na.rm = TRUE),
median(numOfAuthors, na.rm = TRUE),
max(numOfAuthors, na.rm = TRUE),
sum(numOfAuthors, na.rm = TRUE)))
[1] 1.00000 1.21072 1.00000 5.00000 74291.00000
If you want to count but within the context of a tidyverse command, then you might use something like the following command.
First, create a new column
numOfAuthors
usingstr_count() to count the number of
” and “on each row and then adds 1. Then, since there isn't a
group_by(), the
summarize()command will calculate the
min,
mean,
median,
max, and
sum` for all of the rows.
bookdata |>
mutate(numOfAuthors = str_count(author, " and ") + 1) |>
summarize(
min = min(numOfAuthors, na.rm = TRUE),
mean = mean(numOfAuthors, na.rm = TRUE),
med = median(numOfAuthors, na.rm = TRUE),
max = max(numOfAuthors, na.rm = TRUE),
total = sum(numOfAuthors, na.rm = TRUE))
# A tibble: 1 × 5
min mean med max total
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.21 1 5 74291
As we knew from previous print-outs of the data, the smallest value is 1
and the largest value is 5
. But we also found out the average and median. Clearly, most books in this data set are written by one author.
2.6 Substring Extraction: str_sub()
Extracting parts of a string can be very useful. Using str_sub()
, you can obtain specific parts of a string based on character positions.
It would be useful to have a shorter title for print purposes. Note the maximum title length:
So let’s define a new column shorttitle
using the str_sub()
command:
And check out the results; this lists the longest titles in the data set:
bookdata |>
mutate(titlelen = str_length(title)) |>
group_by(titlelen, title) |>
arrange(desc(titlelen)) |>
select(shorttitle, title) |>
head(5) |>
as.data.frame()
titlelen shorttitle
1 80 Walt Disney’s Uncle Scroo
2 77 Alice In The Country Of H
3 73 No Matter How I Look At I
4 73 No Matter How I Look At I
5 73 No Matter How I Look At I
title
1 WALT DISNEY’S UNCLE SCROOGE AND DONALD DUCK: THE DON ROSA LIBRARY, VOLS. 1 AND 2
2 ALICE IN THE COUNTRY OF HEARTS: THE MAD HATTER'S LATE NIGHT TEA PARTY, VOL. 1
3 NO MATTER HOW I LOOK AT IT, IT'S YOU GUYS' FAULT I'M NOT POPULAR!, VOL. 1
4 NO MATTER HOW I LOOK AT IT, IT'S YOU GUYS' FAULT I'M NOT POPULAR!, VOL. 2
5 NO MATTER HOW I LOOK AT IT, IT'S YOU GUYS' FAULT I'M NOT POPULAR!, VOL. 3
You can see that having a long title makes for a more difficult to read table of data. Having the shorttitle
available will allow us to print out something that fits onto screens and pages more easily.
2.7 Splitting Text: separate_wider_delim()
The function separate_wider_delim()
splits a string column into multiple new columns by a specified delimiter.
First, let’s remind ourselves of what the author
column contains. Specifically, let’s look at the longest strings in the author
column.
First, calculate a column
authlen
that contains the length of the string in theauthor
column. Next, group the rows together byauthlen
andauthor
. Summarize by counting the total rows for eachauthor
. Sort these rows in descending order by the length of the string in theauthor
column. Finally, select theauthlen
,author
, andcount
in the first 15 rows.
bookdata |>
mutate(authlen = str_length(author)) |>
group_by(authlen, author) |>
summarize(count = n()) |>
arrange(desc(authlen)) |>
select(authlen, author, count) |>
head(15)
# A tibble: 15 × 3
# Groups: authlen [12]
authlen author count
<int> <chr> <int>
1 93 Cassandra Clare and Sarah Rees Brennan and Maureen Johnson and… 9
2 87 Apostolos Doxiadis and Christos H Papadimitriou and Alecos Pap… 3
3 77 Jeremy Sorese and Ian Jones_Quartey and Asia Kendrick_Horton a… 1
4 75 Stanley McChrystal and Tantum Collins and David Silverman and … 1
5 68 Lily Blake and Evan Daugherty and John Lee Hancock and Hossein… 1
6 68 Neil Gaiman and Sam Kieth and Mike Dringenberg and Malcolm Jon… 1
7 68 Philip Houston and Michael Floyd and Susan Carnicero and Don T… 2
8 67 Anne Berest and Audrey Diwan and Caroline de Maigret and Sophi… 39
9 66 Bob Greene and Howard Lancer and Ronald L Kotler and Diane L M… 5
10 64 Amanda Berry and Gina DeJesus and Mary Jordan and Kevin Sulliv… 22
11 64 Peyton Manning and Eli Manning and Archie Manning and Jim Mads… 1
12 63 Grant Morrison and J G Jones and Carlos Pacheco and Doug Mahnke 5
13 61 Val Emmich and Steven Levenson and Benj Pasek and Justin Paul 16
14 60 Ben Nemtin and Dave Lingwood and Duncan Penn and Jonnie Penn 1
15 59 Kourtney Kardashian and Kim Kardashian and Khloé Kardashian 3
This shows us that the author
column now has a series of author names separated by " and "
—which is not an accident, as we have been working to do just that with the commands above here on the page.
What we want to do now is to put a single author in a single column. This is what the separate_wider_delim()
is for. So let’s see an example of this in application.
The results will be assigned to the
bookdata
data frame and comes from the same. Let’s step through the five arguments tostep_wider_delim()
: 1)author
: the column that is going to be separated; 2)delim
: the string that divides the values in theauthor
column; 3)names
: the names of the columns that will be created to hold the separated values from theauthor
column; 4)too_few
: this option tells R to put the values starting inauthor1
and fill the rest withNA
values once theauthor
field runs out of names; and 5) ’cols_remove: when sets to
FALSE, it will not delete the
author` column when it is done processing; we are saving it simply for data checking later in the process.
Let’s take a look at some of the results when there are at least two authors. What we should see is the first full name in the author
column put in the author1
column, the second full name in the author
column put in the author2
column, and so on.
# A tibble: 11,608 × 4
author author1 author2 author3
<chr> <chr> <chr> <chr>
1 James Patterson and Gabrielle Charbonnet James Pa… Gabrie… <NA>
2 Julia Child and Simone Beck and Louisette Bertholle Julia Ch… Simone… Louise…
3 Eric Shanower and Skottie Young Eric Sha… Skotti… <NA>
4 Nancy Butler and Hugo Petrus Nancy Bu… Hugo P… <NA>
5 Mark Millar and Steve McNiven Mark Mil… Steve … <NA>
6 Glenn Beck and Kevin Balfe Glenn Be… Kevin … <NA>
7 James Patterson and NaRae Lee James Pa… NaRae … <NA>
8 James Patterson and NaRae Lee James Pa… NaRae … <NA>
9 Clive Cussler and Dirk Cussler Clive Cu… Dirk C… <NA>
10 Guy Fieri and Ann Volkwein Guy Fieri Ann Vo… <NA>
# ℹ 11,598 more rows
And so we do.
Now let’s count up the number of NA
values in each of the author fields. We should see an increasing number of NA
values as we go from 1 to 5 (since we asked separate_wider_delim
to fill them in that order):
bookdata |>
summarize(a_na = sum(is.na(author)),
a1_na = sum(is.na(author1)),
a2_na = sum(is.na(author2)),
a3_na = sum(is.na(author3)),
a4_na = sum(is.na(author4)),
a5_na = sum(is.na(author5))) |>
as.data.frame()
a_na a1_na a2_na a3_na a4_na a5_na
1 69 69 49822 60230 61320 61418
And so we do. Similarly, we should see a decreasing number of non-’NA` values as we go from 1 to 5:
bookdata |>
summarize(a = sum(!is.na(author)),
a1 = sum(!is.na(author1)),
a2 = sum(!is.na(author2)),
a3 = sum(!is.na(author3)),
a4 = sum(!is.na(author4)),
a5 = sum(!is.na(author5))) |>
as.data.frame()
a a1 a2 a3 a4 a5
1 61361 61361 11608 1200 110 12
And, again, we do.
Let’s now see what some of this data looks like for the longer author
data.
For the
bookdata
data frame, filter the rows for which theauthor4
column is not empty. Calculate two new columns—endauthor
which is the last 35 characters of theauthor
column, andnumAuth
which is the number of full names in theauthor
column. Group those filtered rows byauthor
,numAuth
, andauthor4
, calculating the number of rows in each group. Select the columnsendauthor, author4
,author
, and that calculated count. Order the rows in descending order by the number of authors in theauthor
column.
bookdata |>
filter(!is.na(author4)) |>
mutate(endauthor = str_sub(author, -35, -1)) |>
mutate(numAuth = str_count(author, " and ") + 1) |>
group_by(endauthor, numAuth, author4) |>
summarize(n = n()) |>
select(endauthor, author4, n, numAuth) |>
arrange(desc(numAuth)) |>
head(15)
# A tibble: 15 × 4
# Groups: endauthor, numAuth [15]
endauthor author4 n numAuth
<chr> <chr> <int> <dbl>
1 " and Kelly Link and Robin Wasserman" Kelly Link 9 5
2 "Janelle and Christine and Ron Brown" Christine 3 5
3 " Caroline de Maigret and Sophie Mas" Sophie Mas 39 4
4 " John Lee Hancock and Hossein Amini" Hossein Amini 1 4
5 " and Carlos Pacheco and Doug Mahnke" Doug Mahnke 5 4
6 " and Mary Jordan and Kevin Sullivan" Kevin Sullivan 22 4
7 "Alecos Papadatos and Annie Di Donna" Annie Di Donna 3 4
8 "and Susan Carnicero and Don Tennant" Don Tennant 2 4
9 "d David Silverman and Chris Fussell" Chris Fussell 1 4
10 "d Ronald L Kotler and Diane L McKay" Diane L McKay 5 4
11 "e Dringenberg and Malcolm Jones_III" Malcolm Jones_III 1 4
12 "g and Archie Manning and Jim Madsen" Jim Madsen 1 4
13 "ia Kendrick_Horton and Rachel Dukes" Rachel Dukes 1 4
14 "nson and Benj Pasek and Justin Paul" Justin Paul 16 4
15 "ood and Duncan Penn and Jonnie Penn" Jonnie Penn 1 4
In this table above, look at the author4
column and compare the value to the appropriate name in the author
column. They’re the same and each one looks like a full name. That’s a win!
2.7.1 Example of making long data
This section uses a function that is described in Wide to long data. Useful original documentation can be found on this page.
Reading through the below might help the reader gain some insight into how he/she might combine functions in a way that might not have been clear otherwise. It is not directly related to the topic of string functions, but it does provide the motivation for why we did much of the above.
First, we’re done with the author
column, so let’s delete it from the bookdata
data frame:
What we’re going to do now is make a new data frame booklong
that combines the data in the authorX
columns into one new column in a way that makes it easier to manipulate and analyze author information.
An informal, unstructured description of the following command is as follows: Create a new column called Auth_Name
and put all the author names from the five authorX
columns in it. In the new Auth_Rank
column, put a 1
in the column when you’ve put an author name from author1
column; put a 2
in the column when you’ve put an author name from author2
column; etc. Don’t create a new row of data when there’s an NA
value in an authorX
column.
The data from
bookdata
are manipulated and then are assigned to the new data framebooklong
. Consider the six arguments to thepivot_longer()
function: 1)cols
: the “wide columns” that are going to be converted into long columns; 2)values_to
: the name of the new column that will contain the data stored in the current wide columns; 3)names_to
: the name of the new column that usually contains the names of the old wide columns (i.e.,"author1"..."author5"
); 4)names_prefix
: this tells R to remove the string"author"
from the old wide columns (named with thecols
argument) so that only"1"..."5"
are left; 5)names_transform
: this tells R to convert the"1"..."5"
to integers1...5
when putting the value in the column named bynames_to
; 6)values_drop_na
: this tells R to not create a new row when the value in thevalues_to
column is going to be anNA
.
We now have a much different data frame compared to what we started with. How big is it?
It has many more rows, as expected. In fact, it has the same number of rows as the number of authors that was calculated above.
Now we can ask a question that we couldn’t easily ask before: How many distinct authors are represented in the NYTimes Best Seller list from 2010–2019?
Only just over 4000 authors are responsible for all of the books.
Let’s see how the new data frame looks and how it has captured the same information.
booklong |>
arrange(pubdate, listname, rank, Auth_Rank) |>
select(rank, shorttitle, Auth_Rank, Auth_Name) |>
head(15) |>
as.data.frame()
rank shorttitle Auth_Rank Auth_Name
1 1 Witch And Wizard 1 James Patterson
2 1 Witch And Wizard 2 Gabrielle Charbonnet
3 2 The Hunger Games 1 Suzanne Collins
4 3 Catching Fire 1 Suzanne Collins
5 4 The Magician’s Elephant 1 Kate DiCamillo
6 5 Fallen 1 Lauren Kate
7 1 Guinness World Records 20 1 Craig Glenday
8 2 Mastering The Art Of Fren 1 Julia Child
9 2 Mastering The Art Of Fren 2 Simone Beck
10 2 Mastering The Art Of Fren 3 Louisette Bertholle
11 3 The Pioneer Woman Cooks 1 Ree Drummond
12 4 4-Hour Workweek 1 Timothy Ferriss
13 5 It's Your Time 1 Joel Osteen
14 1 The Lost Symbol 1 Dan Brown
15 2 Cross Fire 1 James Patterson
Consider the first two rows of the above. It tells us that both James Patterson
(row 1) and Gabrielle Charbonnet
(row 2) are the authors of Witch and Wizard
(which is what we saw in the original table back in this section)
Now that we have all of the author names in one column, it’s much more convenient to do some data clean-up. First, let’s get rid of extra whitespace:
And now, let’s get rid of those authorX
columns since all of the information is now in the Auth_Name
column.
2.8 Combining Text: str_c()
The str_c()
function from the stringr
package helps concatenate strings with much flexibility (though we only discuss a limited set of features here).
Let’s assume that we want to use a character string instead of the integer in Auth_List
. We can use str_c
to create the string that we need.
Add a new column
AuthOrd
to thebooklong
data frame (but don’t save it for now) such that it is the combination of the stringAuth
and the character representation ofAuth_Rank
(e.g.,"1"
,"2"
) where those are separated by a dash. Sort the results and then select that new column, the original column, andAuth_Name
.
booklong |>
mutate(AuthOrd = str_c("Auth",
as.character(Auth_Rank),
sep = "-")) |>
arrange(pubdate, rank, isbn, Auth_Rank) |>
select(AuthOrd, Auth_Rank, Auth_Name)
# A tibble: 74,291 × 3
AuthOrd Auth_Rank Auth_Name
<chr> <int> <chr>
1 Auth-1 1 Jane O’Connor
2 Auth-1 1 Guy Fieri
3 Auth-2 2 Ann Volkwein
4 Auth-1 1 Sarah Palin
5 Auth-1 1 James Patterson
6 Auth-2 2 Gabrielle Charbonnet
7 Auth-1 1 Stephenie Meyer
8 Auth-1 1 Markus Zusak
9 Auth-1 1 Dan Brown
10 Auth-1 1 R Crumb
# ℹ 74,281 more rows
You can see that the number at the end of the AuthOrd
column is the number in its corresponding Auth_Rank
.
2.9 Splitting a string: str_split()
The ‘str_split()’ splits a string at a specific character into a vector of strings. You tell R how you want it to split the string by specifying the separator character.
Now that we have all of the author names in the Auth_Name
column, let’s divide it into FirstName
and LastName
. This will not always work as many names are multiple words or initials; however, to make it be more correct we would have to implement a function and this is too much of advanced topic for us right now.
But that doesn’t mean that we can’t try!
We will first implement a simpler version.
The data is coming from
booklong
and will be stored inbooklong
. First, we create thelen_in_words
column as the number of words in theAuth_Name
column. Next, we tell R to create theFirstName
column as the first word ofAuth_Name
. Then, if there are 2 words inAuth_Name
, thenLastName
is that second word; however, if there are 3 words inAuth_Name
, thenLastName
is that third word; finally, if there are not 2 or 3 words (i.e., 1 word or more than 3 words), then use the last word inAuth_Name
.
Let’s use the following to print author information for a sampling of rows from booklong
.
R uses data from
booklong
. It groups by four fields—but notice that once you’ve specifiedAuth_Name
, the other three columns will have the same values. Count the number of rows per group. Then filter eitherAuth_Name
containing one of six names orAuth_Name
containing the string “_Show”. Select the columns that we group_by plus the count we calculated. Finally, sort byLastName
and thenFirstName
. Print all the rows.
booklong |>
group_by(Auth_Name, FirstName, LastName, len_in_words) |>
summarize(n = n()) |>
filter(Auth_Name %in% c("Suzanne Collins", "Ronald L Kotler",
"George R R Martin", "G B Trudeau",
"Chris Van Dusen", "Bessel van der Kolk") |
str_detect(Auth_Name, "_Show")) |>
select(Auth_Name, FirstName, LastName, len_in_words, n) |>
arrange(LastName, FirstName) |>
print(n=Inf)
# A tibble: 8 × 5
# Groups: Auth_Name, FirstName, LastName [8]
Auth_Name FirstName LastName len_in_words n
<chr> <chr> <chr> <int> <int>
1 Suzanne Collins Suzanne Collins 2 276
2 Chris Van Dusen Chris Dusen 3 1
3 Bessel van der Kolk Bessel Kolk 4 72
4 Ronald L Kotler Ronald Kotler 3 5
5 George R R Martin George Martin 4 225
6 The_Daily_Show The_Daily_Show The_Daily_Show 1 1
7 The_Late_Show The_Late_Show The_Late_Show 1 7
8 G B Trudeau G Trudeau 3 17
A few observations:
- For the name “Chris Van Dusen”, this function determined that “Dusen” is the last name. Similar to “Bessel van der Kolk” for which it specified “Kolk” as the last name.
- For both “R L Stine” and “G B Trudeau”, the function chose just the first initial for the first name.
- For both of the shows, the full show name is given as both the first and last names.
These aren’t necessarily right or wrong, but we wanted to show another, somewhat more complicated way, of specifying the first and last name. To that we turn now.
R assigns the results to
booklong
and the original data comes frombooklong
. First, calculate three new columns:len_in_words
,FirstWord
, andSecWord
. (These last two are simply the first and second word ofAuth_Name
; we calculate them to make the code that follows easier to read.) Next, create theFirstName
column in the following way: ifFirstWord
is just one letter, then defineFirstName
as the first two words; ifFirstWord
is more than one letter, then defineFirstName
asFirstWord
. Now, the calculation ofLastName
is much more complicated than before. IfAuth_Name
is just two words long, then defineLastName
asSecWord
. IfAuth_Name
is three words long, then do the following: IfSecWord
is a connector (i.e., “le”, “von”, etc.), then setLastName
to the second and third words; otherwise, set it to the third word. Now, ifAuth_Name
is four words long, then do the following: If the second and third words are either “de la” or “van der”, then defineLastName
asAuth_Name
; otherwise, set it to the fourth word. IfAuth_Name
is not 2, 3, or 4 words long, then defineLastName
as the last word ofAuth_Name
. Finally, delete the two columnsFirstWord
andSecWord
.
booklong <-
booklong |>
mutate(len_in_words = str_count(Auth_Name, boundary("word")),
FirstWord = word(Auth_Name, 1),
SecWord = word(Auth_Name, 2)) |>
mutate(FirstName =
ifelse(str_length(FirstWord) == 1,
str_c(FirstWord, " ", SecWord),
FirstWord),
LastName =
ifelse(len_in_words == 2,
SecWord,
ifelse(len_in_words == 3,
ifelse(SecWord %in% c("le", "von", "van",
"du", "Du", "Van",
"Von", "Le"),
word(Auth_Name, 2, 3),
word(Auth_Name, 3)),
ifelse(len_in_words == 4,
ifelse(word(Auth_Name, 2, 3) %in%
c("de la", "van der"),
word(Auth_Name, 2, 4),
word(Auth_Name, 4)),
word(Auth_Name, -1))))) |>
select(-c(FirstWord, SecWord))
That was a lot. To be clear, we did not come up with this on the first try. We wrote this iteratively, fixing problems as we went along.
Now let’s use the following (the same command as above) to see the results of this second approach.
booklong |>
group_by(Auth_Name, FirstName, LastName, len_in_words) |>
summarize(n = n()) |>
filter(Auth_Name %in% c("Suzanne Collins", "Ronald L Kotler",
"George R R Martin", "G B Trudeau",
"Chris Van Dusen", "Bessel van der Kolk") |
str_detect(Auth_Name, "_Show")) |>
select(Auth_Name, FirstName, LastName, len_in_words, n) |>
arrange(LastName, FirstName) |>
print(n=Inf)
# A tibble: 8 × 5
# Groups: Auth_Name, FirstName, LastName [8]
Auth_Name FirstName LastName len_in_words n
<chr> <chr> <chr> <int> <int>
1 Suzanne Collins Suzanne Collins 2 276
2 Ronald L Kotler Ronald Kotler 3 5
3 George R R Martin George Martin 4 225
4 The_Daily_Show The_Daily_Show The_Daily_Show 1 1
5 The_Late_Show The_Late_Show The_Late_Show 1 7
6 G B Trudeau G B Trudeau 3 17
7 Chris Van Dusen Chris Van Dusen 3 1
8 Bessel van der Kolk Bessel van der Kolk 4 72
A few observations:
- For the name “Chris Van Dusen”, this function determined that “Van Dusen” is the last name. Similar to “Bessel van der Kolk” for which it specified “van der Kolk” as the last name.
- For both “R L Stine” and “G B Trudeau”, the function define the first name to consist of both initials.
- The same as before: for both of the shows, the full show name is given as both the first and last names.
2.10 String Padding: str_pad()
The ‘str_pad()’ function adds padding characters to strings, which can be useful for formatting text or creating fixed-width file outputs.
Consider the following demonstration of this function.
Select data from
booklong
. Define three new columns: 1)strL
will be padded with “-” on the left; 2)strR
is padded with “+” on the right; and 3)strB
is padded on both sides with “*” so that it is centered in the space. SelectFirstName
plus those three columns. Print the first 6 rows.
booklong |>
mutate(strL = str_pad(FirstName, 15, side = "left", pad = "-"),
strR = str_pad(FirstName, 15, side = "right", pad = "+"),
strB = str_pad(FirstName, 15, side = "both", pad = "*")) |>
select(FirstName, strL, strR, strB) |>
head(n=6) |>
as.data.frame()
FirstName strL strR strB
1 James ----------James James++++++++++ *****James*****
2 Gabrielle ------Gabrielle Gabrielle++++++ ***Gabrielle***
3 Suzanne --------Suzanne Suzanne++++++++ ****Suzanne****
4 Suzanne --------Suzanne Suzanne++++++++ ****Suzanne****
5 Kate -----------Kate Kate+++++++++++ *****Kate******
6 Lauren ---------Lauren Lauren+++++++++ ****Lauren*****
Of course, you could choose to pad the strings with whatever character that you would like (e.g., a space “ ”). We chose the above so that it would be easier to see.