String functions

On this page, we use data from the New York Times Best Sellers list from 2010 to 2019 in order to demonstrate the range of capabilities of R’s string manipulation functions.

The R Studio string function cheatsheet can be found here..

1 Case study: Books

Before getting into the string functions themselves, we go through the process of importing the data, cleaning up the columns, and finally cleaning up the contents of the columns.

1.1 Import

The first step, as always, is to import the CSV file into a data frame.

books <- read_csv("../data/class202405/bestsellers.csv")

You will find that this data is sufficiently complex that it lends itself to challenges that will push us to understand R’s capabilities. We attempt to demonstrate and explain the functions while using them in a realistically complex use case.

1.2 Column clean-up

Here is some information about this large dataset:

books |> glimpse()

Rows: 61,430
Columns: 12
$ published_date     <date> 2010-01-03, 2010-01-03, 2010-01-03, 2010-01-03, 20…
$ list_name          <chr> "Chapter Books", "Chapter Books", "Chapter Books", …
$ list_name_encoded  <chr> "chapter-books", "chapter-books", "chapter-books", …
$ rank               <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, …
$ isbn13             <chr> "9780316036245", "9780439023481", "9780439023498", …
$ isbn10             <chr> "0316036242", "0439023483", "0439023491", "07636441…
$ title              <chr> "WITCH AND WIZARD", "THE HUNGER GAMES", "CATCHING F…
$ author             <chr> "James Patterson and Gabrielle Charbonnet", "Suzann…
$ description        <chr> "One of each, brother and sister, flex their newfou…
$ amazon_product_url <chr> "https://www.amazon.com/Witch-Wizard-James-Patterso…
$ price              <dbl> 17.99, 17.99, 17.99, 16.99, 17.99, 28.95, 40.00, 27…
$ weeks_on_list      <dbl> 1, 67, 16, 15, 2, 14, 13, 8, 56, 7, 14, 5, 6, 38, 3…

Fortunately, the published_date column is already recognized as a date, so we don’t have to coerce that column. Same thing for rank, price, and weeks_on_list— each one is already recognized as a numeric value.

For a bit more information, let’s skim() the data:

skim(books)

Data summary
Name	books
Number of rows	61430
Number of columns	12
_______________________
Column type frequency:
character	8
Date	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
list_name	0	1.00	5	36	59
list_name_encoded	0	1.00	5	36	59
isbn13	4	1.00	13	13	8928
isbn10	1284	0.98	4	10	8124
title	0	1.00	1	80	6577
author	69	1.00	1	84	3609
description	8119	0.87	9	315	8205
amazon_product_url	0	1.00	38	156	6719

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
published_date	0	1	2010-01-03	2019-12-29	2015-05-03	522

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	3.00	1.41	1	2	3	4	5	▇▇▇▇▇
price	1	3.86	8.79	0	0	0	0	150	▇▁▁▁▁
weeks_on_list	1	24.71	63.53	0	0	2	16	607	▇▁▁▁▁

The first thing that to do is to simplify the data frame. We remove a few columns from the analysis (to simplify the presentation). We also shorten the names of four columns.

We assign the data to a new data frame bookdata but the data comes from the books data frame (that we just created with the read.csv statement above). We then delete four columns by using select() with a minus in front of a list. Finally, with rename() we rename four of the columns.

bookdata <-
  books |> 
    select(-c(list_name, isbn10, amazon_product_url, price)) |>
    rename(listname = list_name_encoded,
           weeks = weeks_on_list,
           pubdate = published_date,
           isbn = isbn13)

Notice that we still have the original books data frame so that we can easily start the analysis over if we need to.

1.3 Column contents clean-up

Let’s take a look at a bit of the data, selecting every column except description (because it is so long):

bookdata |> 
  select(pubdate, title, author, listname, 
         rank, weeks, isbn) |>  # not printing description
  arrange(pubdate, listname, rank) |> 
  head(n=6)

# A tibble: 6 × 7
  pubdate    title                       author       listname  rank weeks isbn 
  <date>     <chr>                       <chr>        <chr>    <dbl> <dbl> <chr>
1 2010-01-03 WITCH AND WIZARD            James Patte… chapter…     1     1 9780…
2 2010-01-03 THE HUNGER GAMES            Suzanne Col… chapter…     2    67 9780…
3 2010-01-03 CATCHING FIRE               Suzanne Col… chapter…     3    16 9780…
4 2010-01-03 THE MAGICIAN’S ELEPHANT     Kate DiCami… chapter…     4    15 9780…
5 2010-01-03 FALLEN                      Lauren Kate  chapter…     5     2 9780…
6 2010-01-03 GUINNESS WORLD RECORDS 2010 edited  Cra… hardcov…     1    14 9781…

We are going to spend the rest of this section making factors out of two columns and then managing empty values in the data frame.

1.3.1 Make factors

Let’s now take a few moments to investigate the data in more depth. First, let’s see how many distinct values are in each column. (Note that you do not have to understand this command at all in order to use it. After the first line (with bookdata), you can just copy-and-paste in order to use it on any table with any type of data.)

First, convert all columns in bookdata to character columns; if there are multiple data types, then this command would give an error without this line. The second line converts the output from a wide table to a long one. The third line groups the results by variable names and the fourth line counts the number of distinct values per variable.

bookdata |>
  mutate(across(everything(), as.character)) |> 
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  group_by(variable) |>
  summarise(n_distinct = n_distinct(value)) |> 
  print()

# A tibble: 8 × 2
  variable    n_distinct
  <chr>            <int>
1 author            3610
2 description       8206
3 isbn              8929
4 listname            59
5 pubdate            522
6 rank                 5
7 title             6577
8 weeks              598

Given how many unique values are in rank and listname, let’s go ahead and turn those columns into factors (as discussed in this section:

bookdata$rank <- factor(bookdata$rank)
bookdata$listname <- factor(bookdata$listname)

The ordering of the levels for both factors can stay as the default, so we don’t have to do anything with that.

1.3.2 Managing empty values

Next, let’s see how many string values signifying “empty” are in the character columns. Much of this process is discussed in greater depth on this page. This section doesn’t have anything in particular to do with string functions, so you can go on to the next section without missing anything on these functions. (But come back to it later when you’re managing empty values in your own data.)

Here’s an interpretation of the following command:

R will look at every character column. It will compute the sum, per column, of every observation that has one of the strings "NA", "--", or "". It is an R standard for use .x to refer back to a column—in this case, R is iterating through every character column (since this is what is prescribed in the first argument of the across() function), so .x represents the name of each character column in turn.

bookdata |> 
  summarize(across(where(is.character),
                   ~sum(.x %in% c("NA", "--", "")))) |> 
  glimpse()

Rows: 1
Columns: 4
$ isbn        <int> 0
$ title       <int> 0
$ author      <int> 0
$ description <int> 0

Great! There don’t appear to be any of these values in the dataset. We might come across them later, but none of these strings are used in this dataset at this point.

Finally, let’s see how many actual NA values are in any of the columns. (Again, go to this page if you want to learn about handing NA values in more depth.)

R looks at every character column. For each one of those columns, it computes the sum of how many rows have an NA column.

bookdata |> 
  summarize(across(everything(),
                   ~sum(is.na(.x)))) |> 
  glimpse()

Rows: 1
Columns: 8
$ pubdate     <int> 0
$ listname    <int> 0
$ rank        <int> 0
$ isbn        <int> 4
$ title       <int> 0
$ author      <int> 69
$ description <int> 8119
$ weeks       <int> 0

It’s the isbn, author, and description columns that contain NA values. Let’s take a moment and investigate which books have an NA value in the isbn column.

R filters those rows for which the isbn column has the NA value, and then it selects the isbn, title, and author columns.

bookdata |> 
  filter(is.na(isbn)) |> 
  select(isbn, title, author)

# A tibble: 4 × 3
  isbn  title          author         
  <chr> <chr>          <chr>          
1 <NA>  SUMMER SECRETS Barbara Freethy
2 <NA>  SUMMER SECRETS Barbara Freethy
3 <NA>  SUMMER SECRETS Barbara Freethy
4 <NA>  SUMMER SECRETS Barbara Freethy

Apparently just one book doesn’t have a value in the isbn column.

For purposes of our analysis, we want to be able to assume that every book has an isbn value, so we need to come up with a default value that works for this book (and for any future book that might appear on future lists).

We are going to define a default isbn value for all books that do not have one. We define this as the first 13 lowercase letters (with spaces removed); if the title is shorter than that, then we add spaces onto the end in order to make it 13 characters.

R defines four new columns: titlelen is the length of the title, lowertitle is the lowercase letters from the title with the spaces removed, shorttitle is the first 13 letters of the lowertitle (though it may be less than that if the title does not have 13 letters in it), and padtitle is the shorttitle padded out to 13 characters (with _) if it is not at least that long already. The command then applies an ifelse() in order to apply the correct value to isbn depending on if it is blank. Finally, the command finishes by deleting those four temporary fields.

bookdata <-
  bookdata |> 
    mutate(
      titlelen = length(title),
      lowertitle = tolower(str_remove(title, " ")),
      shorttitle = str_sub(lowertitle, 1, 13),
      padtitle = str_pad(shorttitle, 
                         13,
                         side = "right",
                         pad = "_"),
      isbn = ifelse(is.na(isbn), 
                    ifelse(titlelen > 13,
                           shorttitle,
                           padtitle),
                    isbn)) |> 
  select(-c(titlelen, lowertitle, shorttitle, padtitle))

Let’s take a look back at bookdata to ensure that the book now has a value in the isbn column:

R looks for an isbn value with either an underscore in it or the word summer in it. It then selects the values from the isbn, title, and author columns.

bookdata |> 
  filter(str_detect(isbn, "_") | str_detect(isbn, "summer")) |> 
  select(isbn, title, author)

# A tibble: 4 × 3
  isbn          title          author         
  <chr>         <chr>          <chr>          
1 summersecrets SUMMER SECRETS Barbara Freethy
2 summersecrets SUMMER SECRETS Barbara Freethy
3 summersecrets SUMMER SECRETS Barbara Freethy
4 summersecrets SUMMER SECRETS Barbara Freethy

There’s just the one book and it now has an appropriate (temporary) isbn number.

Let’s ensure that we now have just two columns with NA values in them. This command is copied directly from above.

bookdata |> 
  summarize(across(everything(),
                   ~sum(is.na(.x)))) |> 
  glimpse()

Rows: 1
Columns: 8
$ pubdate     <int> 0
$ listname    <int> 0
$ rank        <int> 0
$ isbn        <int> 0
$ title       <int> 0
$ author      <int> 69
$ description <int> 8119
$ weeks       <int> 0

And so it is—it’s only the author and description columns that have NA values.

1.4 Ready for analysis

Here is the bookdata data frame, all ready for analysis:

summary(bookdata)

    pubdate                              listname     rank     
 Min.   :2010-01-03   hardcover-fiction      : 2610   1:12286  
 1st Qu.:2013-02-03   hardcover-nonfiction   : 2610   2:12286  
 Median :2015-05-03   paperback-nonfiction   : 2610   3:12286  
 Mean   :2015-01-28   picture-books          : 2610   4:12286  
 3rd Qu.:2016-10-02   series-books           : 2610   5:12286  
 Max.   :2019-12-29   trade-fiction-paperback: 2610            
                      (Other)                :45770            
     isbn              title              author          description       
 Length:61430       Length:61430       Length:61430       Length:61430      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     weeks       
 Min.   :  0.00  
 1st Qu.:  0.00  
 Median :  2.00  
 Mean   : 24.71  
 3rd Qu.: 16.00  
 Max.   :607.00

2 Functions

Text manipulation and calculation are essential parts of data wrangling, and there are many functions in R that one can use to perform a variety of text-related tasks.

In the following we apply many of these tidyverse functions to the bookdata dataframe that we explain above. While it is the case that each section (and function description) stands alone, the entirety of this page goes through the process of cleaning up the bookdata data frame so that it is ready for processing.

We highly recommend reading this page from beginning to end at some point.

2.1 Trimming White Spaces: `str_trim()` and `str_squish()`

The function str_trim() removes leading and trailing white spaces from strings, essential for cleaning up text data. The function str_squish() also removes extra space (not all the spaces; just extra spaces) from within a string in addition to those leading and trailing the string.

Let’s first get a baseline that tells us how many characters are in the title column:

In this command, we first calculate the length of the title column and store it in the (temporary) column strlen. We then calculate the total number of characters and store it in another value named title_total. Finally, we print that value out.

bookdata |> 
  mutate(strlen = str_length(title)) |> 
  summarize(title_total = sum(strlen, na.rm = TRUE)) |> 
  select(title_total) |> 
  as.data.frame()

  title_total
1     1093865

Take note of that, and now let’s go through and remove extra whitespace with the following:

R modifies both the title column twice. The first time it trims all extra spaces from both the beginning and end of each column. The second time, it trims extra spaces from the beginning, the end, and in the middle of each field.

bookdata <-
  bookdata |> 
    mutate(title = str_trim(title, side = "both")) |> 
    mutate(title = str_squish(title))

It is absolutely true that we didn’t need to do the str_trim() commands since we are following them with the str_squish commands. However, we wanted to demonstrate both of them for you.

Now let’s see how many characters remain in that column:

bookdata |> 
  mutate(strlen = str_length(title)) |> 
  summarize(title_total = sum(strlen, na.rm = TRUE)) |> 
  select(title_total) |> 
  as.data.frame()

  title_total
1     1093828

We don’t end up getting rid of much whitespace; however, getting rid of any can help us make some assumptions about the text in that column as we continue to process it.

Note (again) that you will, generally, never need to do both str_trim() and str_squish(). We did this just for demonstration purposes. If it was for a real analysis, we would have just used the str_squish() function.

Let’s go ahead and use str_squish() on both the author and description columns:

bookdata <-
  bookdata |> 
    mutate(author = str_squish(author),
           description = str_squish(description))

These should also help us out as we continue.

2.2 Text Transformation: `toupper()`, `tolower()`, and `str_to_title()`

These functions transform text to lower case, upper case, or title case. These are frequently used to ensure consistency in your text data.

Let’s take a quick look at the first few values in the title column.

Select the first four values from the title column from bookdata. Then (via the as.data.frame() function), remove the data types from the top row of the printout.

bookdata |> select(title) |> head(n=4) |> as.data.frame()

                    title
1        WITCH AND WIZARD
2        THE HUNGER GAMES
3           CATCHING FIRE
4 THE MAGICIAN’S ELEPHANT

Now that you know the original values for title, let’s take a look at these three functions.

Define three new (temporary) columns: lowert which is all lowercase; titlet which is “Title Case”; and uppert which is all UPPERCASE. Then select the top 4 values from those three columns.

bookdata |> 
  mutate(lowert = tolower(title),
         titlet = str_to_title(title),
         uppert = toupper(title)) |> 
  select(lowert, titlet, uppert) |> 
  head(n=4) |> 
  as.data.frame()

                   lowert                  titlet                  uppert
1        witch and wizard        Witch And Wizard        WITCH AND WIZARD
2        the hunger games        The Hunger Games        THE HUNGER GAMES
3           catching fire           Catching Fire           CATCHING FIRE
4 the magician’s elephant The Magician’s Elephant THE MAGICIAN’S ELEPHANT

You can see the results of these functions by comparing these three columns to the previous table. No real surprises, but it’s nice to know that these are available.

2.3 Extracting and Detecting: `str_extract()`, `str_detect()`

These functions allow for the extraction of or detection of patterns or words within strings. In this section, we focus on the use of str_detect(). We use it to explore the author field, a process which will lead us to do some further processing, as you will see.

First, let’s see if the word “editor” or “Edited” appear in the author column of bookdata.

R filters the rows for which either “editor” or “Edited” is in the author column by using the str_detect() function on the author column. It then groups the results by author, calculating the total number of rows that has that value in the author column. It then sorts in descending order by that total, printing out the author column and total.

bookdata |> 
  filter(str_detect(author, "editor") | 
           str_detect(author, "Edited")) |> 
  group_by(author) |> 
  summarize(n = n()) |> 
  arrange(desc(n)) |> 
  select(author, n) |> 
  as.data.frame()

                                                author  n
1               the editors of Garden and Gun Magazine 22
2                       the editors of Blue Star Books 17
3                       the editors of Rockridge Press 12
4  David DiBenedetto and the editors of Garden and Gun  5
5                                  the editors of Goop  5
6                                the editors of Sports  5
7                                the editors of domino  4
8                                     Edited R L Stine  3
9                                the editors at Cook’s  2
10                    the editors of Good Housekeeping  2
11  the editors of America's Test Kitchen and Guy Cros  1
12                the editors of Martha Stewart Living  1

We’re now going to look for a second set of words—those which might substitute for "and" within the author column. When you are looking for a word, you have to take some precaution so that you find the whole word and not text within a word. (Note that we didn’t simply make these up. We looked through several hundred rows within the column to see what kind of data that it contains. This is the result of that visual scan.)

This command has the same structure as the previous one, so we will not go through it in detail.

bookdata |> 
  filter(str_detect(author, " and ") |
           str_detect(author, " & ") |
           str_detect(author, " with ") |
           str_detect(author, ", ") |
           str_detect(author, "; ")) |> 
  group_by(author) |> 
  summarize(n = n()) |> 
  arrange(desc(n)) |> 
  select(author, n) |> 
  head(n=10) |> 
  as.data.frame()

                                          author   n
1                Bill O'Reilly and Martin Dugard 741
2                   Todd Burpo with Lynn Vincent 368
3                 Heidi Murkoff and Sharon Mazel 293
4              Robert Kirkman and Charlie Adlard 279
5           Malala Yousafzai with Christina Lamb 220
6  Chris Kyle with Scott McEwen and Jim DeFelice 208
7              James Patterson and Maxine Paetro 205
8       Sherri Duskey Rinker and Tom Lichtenheld 198
9              Sheryl Sandberg with Nell Scovell 142
10                 Brian Kilmeade and Don Yaeger 138

It’s clear that we have to standardize the handlings of "and" equivalents given the huge number of rows that have one or more of those strings in the author column.

We will not cover str_extract() here as it depends on a knowledge of regular expressions—a topic of nuance and substance. However, if your job depends on doing lots of text wrangling, then you owe it to yourself to explore regexs.

2.4 Pattern Matching and Replacement: `str_replace()`

The function str_replace() allows the user to search for a specific pattern or word in a string and replace it with another word or pattern. Very useful for data cleaning and standardization.

We are going to go through several different types of fixes that a data analyst has to take care of. These tasks are especially important when you are working with data coming from someone else. You have to look through the data for all types of errors, and then fix them as consistently and carefully as possible.

In this first command, we fix typographical errors in the author column.

We must remember to assign the results back to the bookdata data frame, so this goes on the first row. The second row says that the resulting data comes from the bookdata data frame. We then use four mutate() and str_replace() functions to fix four author fields that had typographical errors in them.

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, "Alan Moore Dave Gibbons",
                                "Alan Moore and Dave Gibbons")) |> 
    mutate(author = str_replace(author, "Dan Kainen,", "Dan Kainen")) |> 
    mutate(author = str_replace(author, "Slahi,", "Slahi")) |> 
    mutate(author = str_replace(author, "Archie Manning Jim Madsen",
                                "Archie Manning and Jim Madsen")) |> 
    mutate(author = str_replace(author, "Wesley King Created Kobe Bryant",
                                "Wesley King and Kobe Bryant")) |> 
    mutate(author = str_replace(author, "Svetlana Alexievich Keith Gessen",
                                "Svetlana Alexievich and Keith Gessen")) |> 
    mutate(author = str_replace(author, "Proust Adapted and", "Proust"))

We are sure that there are more in the column similar to these; as we find them, we will add them to the above command.

The following is not fixing a typo, but it is acknowledging a reality of how R handles character strings. The dash "-" character is treated as a separation between two separate words, while an underscore "_" character is treated as part of a word. Below, we are going to be dividing up names into first names and last names. We want a name such as Carl-Johan to counted as the first name and not a first and middle name; as such, we want to join Carl and Johan with an underscore instead of a dash. This same analysis goes for Lopez-Alt.

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, "Carl-Johan", "Carl_Johan")) |> 
    mutate(author = str_replace(author, "Mary-Jane", "Mary_Jane")) |> 
    mutate(author = str_replace(author, "Lopez-Alt", "Lopez_Alt")) |> 
    mutate(author = str_replace(author, "Jones–Quartey", 
                                "Jones_Quartey")) |> 
    mutate(author = str_replace(author, "Kendrick–Horton", 
                                "Kendrick_Horton")) |> 
    mutate(author = str_replace(author, "-", "_")) |> 
    mutate(author = str_replace(author, "-", "_")) |> 
    mutate(author = str_replace(author, "–", "_"))

Again, we will add names to the above as we fine them.

Now, in our examination of the data, we saw that some of the author names contain double-quotes. Since strings are usually delimited by double-quotes, they can be tricky to handle when in the string. Let’s see how prevalent they are:

bookdata |> 
  filter(str_detect(author, "\"")) |> 
  group_by(author) |> 
  summarize(n = n()) |> 
  select(author, n) |> 
  arrange(desc(n), author)

# A tibble: 6 × 2
  author                                                          n
  <chr>                                                       <int>
1 "the staff of \"The Chew\""                                     9
2 "Jimmy Fallon with the writers of \"Late Night\""               6
3 "\"Lebanon\" Levi Stoltzfus"                                    5
4 "Loren Bouchard and the writers of \"Bob's Burgers\""           5
5 "Rob \"Gronk\" Gronkowski with Jason Rosenhaus"                 5
6 "Larry Bird, Earvin \"Magic\" Johnson and Jackie MacMullan"     4

Not a lot, but enough to worry about. Let’s go ahead and replace the double quote with a single quote. Note in the following that the representation of the double quote requires a backslash \.

bookdata <-
  bookdata |> 
    mutate(author = str_replace_all(author, "\"", "'"))

Now that the replacement has been made, let’s look at the prevalence of the single-quote in the data frame.

From the booklong data frame, filter those rows for which author contains a single-quote. Group those rows by author. For each name, count the number of rows. Select both author and ’nand sort them in descending order bynand alphabetically byauthor`. Print the first 15 rows.

bookdata |> 
  filter(str_detect(author, "'")) |> 
  group_by(author) |> 
  summarize(n = n()) |> 
  select(author, n) |> 
  arrange(desc(n), author) |> 
  as.data.frame() |> 
  head(n=15)

                                          author   n
1                Bill O'Reilly and Martin Dugard 741
2                                 Dinesh D'Souza  70
3                   Dick Lehr and Gerald O'Neill  52
4                                  Bill O'Reilly  45
5                             Bryan Lee O'Malley  37
6                                  Jane O'Connor  32
7                    America's Test Kitchen Kids  25
8         Bill O'Reilly and Dwight Jon Zimmerman  17
9                                  Keith O'Brien  13
10                       Cynthia D'Aprix Sweeney  10
11             Bill O'Reilly and Bruce Feirstein   9
12                                  Cathy O'Neil   9
13                       the staff of 'The Chew'   9
14             Bill O'Reilly and James Patterson   7
15 Jimmy Fallon with the writers of 'Late Night'   6

These results look good, with no surprises.

Now, for the following we are making another policy decision. We could make a separate field for these suffixes (Jr, Sr, etc.) but we are simply going to add them to the last name, again using the underscore character, so that R does not end up thinking that Jr or IV are the last names of some authors. We looked through the author field using a set of str_detect() functions, and we took now of how the suffixes are represented in the author names. We also include St for all of the names such as St John.

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, ", Jr", "_Jr"),
           author = str_replace(author, ", Sr", "_Sr"),
           author = str_replace(author, " Jr", "_Jr"),
           author = str_replace(author, " Sr", "_Sr"),
           author = str_replace(author, " II", "_II"),
           author = str_replace(author, " III", "_III"),
           author = str_replace(author, " IV", "_IV"),
           author = str_replace(author, " St ", " St_"),
           author = str_replace(author, " St. ", " St_"))

The following command replaces several characters and/or phrases that essentially mean “and”:

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, " & ", " and ")) |> 
    mutate(author = str_replace(author, "; ", " and ")) |> 
    mutate(author = str_replace(author, " as told to ", " and ")) |> 
    mutate(author = str_replace(author, " writing as ", " and ")) |> 
    mutate(author = str_replace(author, " lyrics ", " and ")) |> 
    mutate(author = str_replace(author, " Adapted ", " and ")) |> 
    mutate(author = str_replace(author, " created ", " and ")) |> 
    mutate(author = str_replace(author, " compiled ", " and ")) |> 
    mutate(author = str_replace(author, "lyrics ", "")) |> 
    mutate(author = str_replace(author, "Adapted ", "")) |> 
    mutate(author = str_replace(author, "created ", "")) |> 
    mutate(author = str_replace(author, "compiled ", "")) |> 
    mutate(author = str_replace(author, ", with ", " and ")) |> 
    mutate(author = str_replace(author, " and with ", " and ")) |> 
    mutate(author = str_replace(author, " with ", " and ")) |> 
    mutate(author = str_replace(author, " With ", " and "))

With the following command, we delete all of the text that we could find that basically means “and others that we are not going to name.” We delete this because we are using the author column to collect the names of all the people who are specifically named in that column.

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, " and others", "")) |> 
    mutate(author = str_replace(author, ", et al.", "")) |> 
    mutate(author = str_replace(author, ", et al", "")) |> 
    mutate(author = str_replace(author, " et al.", "")) |> 
    mutate(author = str_replace(author, " et al", "")) |> 
    mutate(author = str_replace(author, " and 639 Others", ""))

With the next command, we make the most basic replacement of the comma “, ” with the word “ and ”.

bookdata <-
  bookdata |> 
    mutate(author = str_replace_all(author, ", ", " and "))

In this case, we use str_replace_all() since the comma can appear multiple times within oneauthor`. If we didn’t use it, then it would only replace the first comma—we want to replace all of them.

The following replaces the names of institutions, titles, and publications with the same name but with underscores instead of spaces. (Don’t worry; you will see why we care about this farther down this page.)

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, "Garden and Gun", "Garden_and_Gun")) |> 
    mutate(author = str_replace(author,"Annex Security Team",
                                "Annex_Security_Team")) |> 
    mutate(author = str_replace(author,"the Late Show","The_Late_Show")) |> 
    mutate(author = str_replace(author,"Thug Kitchen","Thug_Kitchen")) |> 
    mutate(author = str_replace(author, 
                                "America's Test Kitchen Kids",
                                "America's_Test_Kitchen_Kids")) |> 
    mutate(author = str_replace(author,
                                "America's Test Kitchen",
                                "America's_Test_Kitchen")) |> 
    mutate(author = str_replace(author,
                                "Queen Rania of Jordan",
                                "Queen_Rania_of_Jordan")) |> 
    mutate(author = str_replace(author,
                                "The Washington Post",
                                "The_Washington_Post")) |> 
    mutate(author = str_replace(author,
                                "Women's March Organizers",
                                "Women's_March_Organizers")) |> 
    mutate(author = str_replace(
                        author,
                        "the National Museum of African American History",
                        "the_National_Museum_of_African_American_History")) |> 
    mutate(author = str_replace(author, "One Direction", "One_Direction")) |> 
    mutate(author = str_replace(author,
                                "The Countess of Carnarvon",
                                "The_Countess_of_Carnarvon")) |> 
    mutate(author = str_replace(author,
                                "the Countess of Carnarvon",
                                "The_Countess_of_Carnarvon")) |> 
    mutate(author = str_replace(author, "George RR", "George R R")) |> 
    mutate(author = str_replace(author, "the Dalai Lama", 
                                "the_Dalai_Lama")) |> 
    mutate(author = str_replace(author,"The Try Guys","The_Try_Guys")) |> 
    mutate(author = str_replace(author, "the Late Show", "the_Late_Show"),
           author = str_replace(author, "The Daily Show", "The_Daily_Show"),
           author = str_replace(author, "Late Nite", "Late_Nite"),
           author = str_replace(author, "Wizards RPG Team",
                                "Wizards_RPG_Team"),
           author = str_replace(author, "The Chew", "The_Chew"),
           author = str_replace(author, "Althea Press", "Althea_Press"),
           author = str_replace(author, "Good Housekeeping",
                                "Good_Housekeeping"),
           author = str_replace(author, "Martha Stewart Living",
                                "Martha_Stewart_Living"),
           author = str_replace(author, "National Geographic Kids", 
                                "National_Geographic_Kids"),
           author = str_replace(author, "Magica Quartet Staff",
                                "Magica_Quartet_Staff"),
           author = str_replace(author, "Oh! Great Staff", "Oh!_Great_Staff"),
           author = str_replace(author, "Coloring Books for Adults",
                                "Coloring_Books_for_Adults"),
           author = str_replace(author,"the Adult Coloring Book Designs staff",
                                "the_Adult_Coloring_Book_Designs_Staff"),
           author = str_replace(author, "Potter Style", "Potter_Style"),
           author = str_replace(author, "Grumpy Cat staff", "Grumpy_Cat"),
           author = str_replace(author, "the DK staff", "the_DK_staff"),
           author = str_replace(author, "the Lucas Film Book Group",
                                   "the_Lucas_Film_Book_Group"),
           author = str_replace(author, 
                                   "the Lucasfilm Press", 
                                   "the_Lucasfilm_Press"),
           author = str_replace(author, "Rockridge Press", "Rockridge_Press"),
           author = str_replace(author, "the Mayo Clinic", "the_Mayo_Clinic"),
           author = str_replace(author, 
                                   "the Weight Watchers", 
                                   "the_Weight_Watchers"),
           author = str_replace(author, "Blue Star Books", "Blue_Star_Books"))

The following command deletes phrases that don’t help us name people in the author column:

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, "the members of the ", "")) |> 
    mutate(author = str_replace(author, "the members of ", "")) |> 
    mutate(author = str_replace(author, "members of the ", "")) |> 
    mutate(author = str_replace(author, "the special counsel", "")) |> 
    mutate(author = str_replace(author, "the writers of", "")) |> 
    mutate(author = str_replace(author, "the office of the ", "")) |> 
    mutate(author = str_replace(author, "the office of ", "")) |> 
    mutate(author = str_replace(author, "The Staff of ", "")) |> 
    mutate(author = str_replace(author, "the staff of ", "")) |> 
    mutate(author = str_replace(author, "the writers of ", "")) |> 
    mutate(author = str_replace(author, "the editors of", "")) |> 
    mutate(author = str_replace(author, "edited", "")) |> 
    mutate(author = str_replace(author, "the editors at ", "")) |> 
    mutate(author = str_replace(author, " and others", "")) |> 
    mutate(author = str_replace(author,  "Various illustrators", ""))

And, here, in one final set of str_replace() commands, we delete all of the role-related information. For this exercise, we do not care about the roles of the people in the author column—we are simply looking to gather the names of the people named in it.

bookdata <-
  bookdata |> 
    mutate(author = str_replace(author, "Edited ", "")) |> 
    mutate(author = str_replace(author, "from texts ", "")) |> 
    mutate(author = str_replace(author, "with an introduction ", " ")) |> 
    mutate(author = str_replace(author, "with related materials ", "")) |> 
    mutate(author = str_replace(author, "adapted", "")) |> 
    mutate(author = str_replace(author, "written", "")) |> 
    mutate(author = str_replace(author, " illustrated ", " and ")) |> 
    mutate(author = str_replace(author, " Illustrations ", " and ")) |> 
    mutate(author = str_replace(author, " Designed ", " and ")) |> 
    mutate(author = str_replace(author, " Translated ", " and ")) |> 
    mutate(author = str_replace(author, " Photographs ", " and ")) |> 
    mutate(author = str_replace(author, "illlustrated", "")) |> 
    mutate(author = str_replace(author, "Translated ", ""))

One last clean up from this part of the process—let’s ensure that extra spaces haven’t gotten introduced in the author column. We will use the str_squish() function to get rid of all of the extra whitespace.

bookdata <-
  bookdata |> 
    mutate(author = str_squish(author))

2.5 Count number of matches: `str_count()`

The function str_count(), not surprisingly, counts the number of times one string appears in another one.

Suppose that we want to know the number of authors who appear in the 61k+ rows of the bookdata data frame. We are not controlling for repeats.

From the vector bookdata$author (that is, the column author in the bookdata data frame), go through it item-by-item and construct a new vector of the same size, where corresponding items in the new vector have the number of times " and " appears plus one. Thus, if it finds c("x and y", "abc"), then it would result in c(2, 1). Put this vector of numbers into numOfAuthors.

numOfAuthors = str_count(bookdata$author, " and ") + 1

We can now do some calculations on this vector.

Print a vector of five values: the smallest number that appears in the vector, the average of all the numbers that appear in the vector, the median of all the numbers in that vector, the largest number in that vector, and the total of all of the numbers in the whole column. When calculating each of those statistics, remove all of the NA values first.

print(c(min(numOfAuthors, na.rm = TRUE),
        mean(numOfAuthors, na.rm = TRUE),
        median(numOfAuthors, na.rm = TRUE),
        max(numOfAuthors, na.rm = TRUE),
        sum(numOfAuthors, na.rm = TRUE)))

[1]     1.00000     1.21072     1.00000     5.00000 74291.00000

If you want to count but within the context of a tidyverse command, then you might use something like the following command.

First, create a new column numOfAuthors using str_count() to count the number of” and “on each row and then adds 1. Then, since there isn't agroup_by(), thesummarize()command will calculate themin,mean,median,max, andsum` for all of the rows.

bookdata |> 
  mutate(numOfAuthors = str_count(author, " and ") + 1) |> 
  summarize(
    min = min(numOfAuthors, na.rm = TRUE),
    mean = mean(numOfAuthors, na.rm = TRUE),
    med = median(numOfAuthors, na.rm = TRUE),
    max = max(numOfAuthors, na.rm = TRUE),
    total = sum(numOfAuthors, na.rm = TRUE))

# A tibble: 1 × 5
    min  mean   med   max total
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1  1.21     1     5 74291

As we knew from previous print-outs of the data, the smallest value is 1 and the largest value is 5. But we also found out the average and median. Clearly, most books in this data set are written by one author.

2.6 Substring Extraction: `str_sub()`

Extracting parts of a string can be very useful. Using str_sub(), you can obtain specific parts of a string based on character positions.

It would be useful to have a shorter title for print purposes. Note the maximum title length:

bookdata |> 
  summarize(max_title_len = max(str_length(title))) |> 
  as.data.frame()

  max_title_len
1            80

So let’s define a new column shorttitle using the str_sub() command:

bookdata <-
  bookdata |> 
    mutate(shorttitle = str_to_title(str_sub(title, 1, 25)))

And check out the results; this lists the longest titles in the data set:

bookdata |> 
  mutate(titlelen = str_length(title)) |> 
  group_by(titlelen, title) |> 
  arrange(desc(titlelen)) |> 
  select(shorttitle, title) |> 
  head(5) |> 
  as.data.frame()

  titlelen                shorttitle
1       80 Walt Disney’s Uncle Scroo
2       77 Alice In The Country Of H
3       73 No Matter How I Look At I
4       73 No Matter How I Look At I
5       73 No Matter How I Look At I
                                                                             title
1 WALT DISNEY’S UNCLE SCROOGE AND DONALD DUCK: THE DON ROSA LIBRARY, VOLS. 1 AND 2
2    ALICE IN THE COUNTRY OF HEARTS: THE MAD HATTER'S LATE NIGHT TEA PARTY, VOL. 1
3        NO MATTER HOW I LOOK AT IT, IT'S YOU GUYS' FAULT I'M NOT POPULAR!, VOL. 1
4        NO MATTER HOW I LOOK AT IT, IT'S YOU GUYS' FAULT I'M NOT POPULAR!, VOL. 2
5        NO MATTER HOW I LOOK AT IT, IT'S YOU GUYS' FAULT I'M NOT POPULAR!, VOL. 3

You can see that having a long title makes for a more difficult to read table of data. Having the shorttitle available will allow us to print out something that fits onto screens and pages more easily.

2.7 Splitting Text: `separate_wider_delim()`

The function separate_wider_delim() splits a string column into multiple new columns by a specified delimiter.

First, let’s remind ourselves of what the author column contains. Specifically, let’s look at the longest strings in the author column.

First, calculate a column authlen that contains the length of the string in the author column. Next, group the rows together by authlen and author. Summarize by counting the total rows for each author. Sort these rows in descending order by the length of the string in the author column. Finally, select the authlen, author, and count in the first 15 rows.

bookdata |> 
  mutate(authlen = str_length(author)) |> 
  group_by(authlen, author) |> 
  summarize(count = n()) |> 
  arrange(desc(authlen)) |> 
  select(authlen, author, count) |> 
  head(15)

# A tibble: 15 × 3
# Groups:   authlen [12]
   authlen author                                                          count
     <int> <chr>                                                           <int>
 1      93 Cassandra Clare and Sarah Rees Brennan and Maureen Johnson and…     9
 2      87 Apostolos Doxiadis and Christos H Papadimitriou and Alecos Pap…     3
 3      77 Jeremy Sorese and Ian Jones_Quartey and Asia Kendrick_Horton a…     1
 4      75 Stanley McChrystal and Tantum Collins and David Silverman and …     1
 5      68 Lily Blake and Evan Daugherty and John Lee Hancock and Hossein…     1
 6      68 Neil Gaiman and Sam Kieth and Mike Dringenberg and Malcolm Jon…     1
 7      68 Philip Houston and Michael Floyd and Susan Carnicero and Don T…     2
 8      67 Anne Berest and Audrey Diwan and Caroline de Maigret and Sophi…    39
 9      66 Bob Greene and Howard Lancer and Ronald L Kotler and Diane L M…     5
10      64 Amanda Berry and Gina DeJesus and Mary Jordan and Kevin Sulliv…    22
11      64 Peyton Manning and Eli Manning and Archie Manning and Jim Mads…     1
12      63 Grant Morrison and J G Jones and Carlos Pacheco and Doug Mahnke     5
13      61 Val Emmich and Steven Levenson and Benj Pasek and Justin Paul      16
14      60 Ben Nemtin and Dave Lingwood and Duncan Penn and Jonnie Penn        1
15      59 Kourtney Kardashian and Kim Kardashian and Khloé Kardashian         3

This shows us that the author column now has a series of author names separated by " and "—which is not an accident, as we have been working to do just that with the commands above here on the page.

What we want to do now is to put a single author in a single column. This is what the separate_wider_delim() is for. So let’s see an example of this in application.

The results will be assigned to the bookdata data frame and comes from the same. Let’s step through the five arguments to step_wider_delim(): 1) author: the column that is going to be separated; 2) delim: the string that divides the values in the author column; 3) names: the names of the columns that will be created to hold the separated values from the author column; 4) too_few: this option tells R to put the values starting in author1 and fill the rest with NA values once the author field runs out of names; and 5) ’cols_remove: when sets toFALSE, it will not delete theauthor` column when it is done processing; we are saving it simply for data checking later in the process.

bookdata <-
  bookdata |> 
    separate_wider_delim(author, 
                         delim = " and ",
                         names = c("author1", "author2", "author3",
                                   "author4", "author5"),
                         too_few = "align_start",
                         cols_remove = FALSE)

Let’s take a look at some of the results when there are at least two authors. What we should see is the first full name in the author column put in the author1 column, the second full name in the author column put in the author2 column, and so on.

bookdata |> 
  filter(!is.na(author2)) |> 
  select(author, author1, author2, author3)

# A tibble: 11,608 × 4
   author                                              author1   author2 author3
   <chr>                                               <chr>     <chr>   <chr>  
 1 James Patterson and Gabrielle Charbonnet            James Pa… Gabrie… <NA>   
 2 Julia Child and Simone Beck and Louisette Bertholle Julia Ch… Simone… Louise…
 3 Eric Shanower and Skottie Young                     Eric Sha… Skotti… <NA>   
 4 Nancy Butler and Hugo Petrus                        Nancy Bu… Hugo P… <NA>   
 5 Mark Millar and Steve McNiven                       Mark Mil… Steve … <NA>   
 6 Glenn Beck and Kevin Balfe                          Glenn Be… Kevin … <NA>   
 7 James Patterson and NaRae Lee                       James Pa… NaRae … <NA>   
 8 James Patterson and NaRae Lee                       James Pa… NaRae … <NA>   
 9 Clive Cussler and Dirk Cussler                      Clive Cu… Dirk C… <NA>   
10 Guy Fieri and Ann Volkwein                          Guy Fieri Ann Vo… <NA>   
# ℹ 11,598 more rows

And so we do.

Now let’s count up the number of NA values in each of the author fields. We should see an increasing number of NA values as we go from 1 to 5 (since we asked separate_wider_delim to fill them in that order):

bookdata |> 
  summarize(a_na = sum(is.na(author)),
            a1_na = sum(is.na(author1)),
            a2_na = sum(is.na(author2)),
            a3_na = sum(is.na(author3)),
            a4_na = sum(is.na(author4)),
            a5_na = sum(is.na(author5))) |> 
  as.data.frame()

  a_na a1_na a2_na a3_na a4_na a5_na
1   69    69 49822 60230 61320 61418

And so we do. Similarly, we should see a decreasing number of non-’NA` values as we go from 1 to 5:

bookdata |> 
  summarize(a = sum(!is.na(author)),
            a1 = sum(!is.na(author1)),
            a2 = sum(!is.na(author2)),
            a3 = sum(!is.na(author3)),
            a4 = sum(!is.na(author4)),
            a5 = sum(!is.na(author5))) |> 
  as.data.frame()

      a    a1    a2   a3  a4 a5
1 61361 61361 11608 1200 110 12

And, again, we do.

Let’s now see what some of this data looks like for the longer author data.

For the bookdata data frame, filter the rows for which the author4 column is not empty. Calculate two new columns—endauthor which is the last 35 characters of the author column, and numAuth which is the number of full names in the author column. Group those filtered rows by author, numAuth, and author4, calculating the number of rows in each group. Select the columns endauthor, author4, author, and that calculated count. Order the rows in descending order by the number of authors in the author column.

bookdata |> 
  filter(!is.na(author4)) |> 
  mutate(endauthor = str_sub(author, -35, -1)) |> 
  mutate(numAuth = str_count(author, " and ") + 1) |> 
  group_by(endauthor, numAuth, author4) |> 
  summarize(n = n()) |> 
  select(endauthor, author4, n, numAuth) |> 
  arrange(desc(numAuth)) |> 
  head(15)

# A tibble: 15 × 4
# Groups:   endauthor, numAuth [15]
   endauthor                             author4               n numAuth
   <chr>                                 <chr>             <int>   <dbl>
 1 " and Kelly Link and Robin Wasserman" Kelly Link            9       5
 2 "Janelle and Christine and Ron Brown" Christine             3       5
 3 " Caroline de Maigret and Sophie Mas" Sophie Mas           39       4
 4 " John Lee Hancock and Hossein Amini" Hossein Amini         1       4
 5 " and Carlos Pacheco and Doug Mahnke" Doug Mahnke           5       4
 6 " and Mary Jordan and Kevin Sullivan" Kevin Sullivan       22       4
 7 "Alecos Papadatos and Annie Di Donna" Annie Di Donna        3       4
 8 "and Susan Carnicero and Don Tennant" Don Tennant           2       4
 9 "d David Silverman and Chris Fussell" Chris Fussell         1       4
10 "d Ronald L Kotler and Diane L McKay" Diane L McKay         5       4
11 "e Dringenberg and Malcolm Jones_III" Malcolm Jones_III     1       4
12 "g and Archie Manning and Jim Madsen" Jim Madsen            1       4
13 "ia Kendrick_Horton and Rachel Dukes" Rachel Dukes          1       4
14 "nson and Benj Pasek and Justin Paul" Justin Paul          16       4
15 "ood and Duncan Penn and Jonnie Penn" Jonnie Penn           1       4

In this table above, look at the author4 column and compare the value to the appropriate name in the author column. They’re the same and each one looks like a full name. That’s a win!

2.7.1 Example of making long data

This section uses a function that is described in Wide to long data. Useful original documentation can be found on this page.

Reading through the below might help the reader gain some insight into how he/she might combine functions in a way that might not have been clear otherwise. It is not directly related to the topic of string functions, but it does provide the motivation for why we did much of the above.

First, we’re done with the author column, so let’s delete it from the bookdata data frame:

bookdata <-
  bookdata |> 
    select(-author)

What we’re going to do now is make a new data frame booklong that combines the data in the authorX columns into one new column in a way that makes it easier to manipulate and analyze author information.

An informal, unstructured description of the following command is as follows: Create a new column called Auth_Name and put all the author names from the five authorX columns in it. In the new Auth_Rank column, put a 1 in the column when you’ve put an author name from author1 column; put a 2 in the column when you’ve put an author name from author2 column; etc. Don’t create a new row of data when there’s an NA value in an authorX column.

The data from bookdata are manipulated and then are assigned to the new data frame booklong. Consider the six arguments to the pivot_longer() function: 1) cols: the “wide columns” that are going to be converted into long columns; 2) values_to: the name of the new column that will contain the data stored in the current wide columns; 3) names_to: the name of the new column that usually contains the names of the old wide columns (i.e., "author1"..."author5"); 4) names_prefix: this tells R to remove the string "author" from the old wide columns (named with the cols argument) so that only "1"..."5" are left; 5) names_transform: this tells R to convert the "1"..."5" to integers 1...5 when putting the value in the column named by names_to; 6) values_drop_na: this tells R to not create a new row when the value in the values_to column is going to be an NA.

booklong <-
  bookdata |> 
    pivot_longer(
      cols = starts_with("author"),
      values_to = "Auth_Name",
      names_to = "Auth_Rank",
      names_prefix = "author",
      names_transform = list(Auth_Rank = as.integer),
      values_drop_na = TRUE
    )

We now have a much different data frame compared to what we started with. How big is it?

dim(booklong)

[1] 74291    10

It has many more rows, as expected. In fact, it has the same number of rows as the number of authors that was calculated above.

Now we can ask a question that we couldn’t easily ask before: How many distinct authors are represented in the NYTimes Best Seller list from 2010–2019?

booklong |> 
  summarize(n_distinct = n_distinct(Auth_Name)) |> 
  as.data.frame()

  n_distinct
1       4080

Only just over 4000 authors are responsible for all of the books.

Let’s see how the new data frame looks and how it has captured the same information.

booklong |> 
  arrange(pubdate, listname, rank, Auth_Rank) |> 
  select(rank, shorttitle, Auth_Rank, Auth_Name) |> 
  head(15) |> 
  as.data.frame()

   rank                shorttitle Auth_Rank            Auth_Name
1     1          Witch And Wizard         1      James Patterson
2     1          Witch And Wizard         2 Gabrielle Charbonnet
3     2          The Hunger Games         1      Suzanne Collins
4     3             Catching Fire         1      Suzanne Collins
5     4   The Magician’s Elephant         1       Kate DiCamillo
6     5                    Fallen         1          Lauren Kate
7     1 Guinness World Records 20         1        Craig Glenday
8     2 Mastering The Art Of Fren         1          Julia Child
9     2 Mastering The Art Of Fren         2          Simone Beck
10    2 Mastering The Art Of Fren         3  Louisette Bertholle
11    3   The Pioneer Woman Cooks         1         Ree Drummond
12    4           4-Hour Workweek         1      Timothy Ferriss
13    5            It's Your Time         1          Joel Osteen
14    1           The Lost Symbol         1            Dan Brown
15    2                Cross Fire         1      James Patterson

Consider the first two rows of the above. It tells us that both James Patterson (row 1) and Gabrielle Charbonnet (row 2) are the authors of Witch and Wizard (which is what we saw in the original table back in this section)

Now that we have all of the author names in one column, it’s much more convenient to do some data clean-up. First, let’s get rid of extra whitespace:

booklong <-
  booklong |> 
    mutate(Auth_Name = str_squish(Auth_Name))

And now, let’s get rid of those authorX columns since all of the information is now in the Auth_Name column.

bookdata <-
  bookdata |> 
  select(-c(author1, author2, author3, author4, author5))

2.8 Combining Text: `str_c()`

The str_c() function from the stringr package helps concatenate strings with much flexibility (though we only discuss a limited set of features here).

Let’s assume that we want to use a character string instead of the integer in Auth_List. We can use str_c to create the string that we need.

Add a new column AuthOrd to the booklong data frame (but don’t save it for now) such that it is the combination of the string Auth and the character representation of Auth_Rank (e.g., "1", "2") where those are separated by a dash. Sort the results and then select that new column, the original column, and Auth_Name.

booklong |> 
  mutate(AuthOrd = str_c("Auth", 
                         as.character(Auth_Rank), 
                         sep = "-")) |> 
  arrange(pubdate, rank, isbn, Auth_Rank) |> 
  select(AuthOrd, Auth_Rank, Auth_Name)

# A tibble: 74,291 × 3
   AuthOrd Auth_Rank Auth_Name           
   <chr>       <int> <chr>               
 1 Auth-1          1 Jane O’Connor       
 2 Auth-1          1 Guy Fieri           
 3 Auth-2          2 Ann Volkwein        
 4 Auth-1          1 Sarah Palin         
 5 Auth-1          1 James Patterson     
 6 Auth-2          2 Gabrielle Charbonnet
 7 Auth-1          1 Stephenie Meyer     
 8 Auth-1          1 Markus Zusak        
 9 Auth-1          1 Dan Brown           
10 Auth-1          1 R Crumb             
# ℹ 74,281 more rows

You can see that the number at the end of the AuthOrd column is the number in its corresponding Auth_Rank.

2.9 Splitting a string: `str_split()`

The ‘str_split()’ splits a string at a specific character into a vector of strings. You tell R how you want it to split the string by specifying the separator character.

Now that we have all of the author names in the Auth_Name column, let’s divide it into FirstName and LastName. This will not always work as many names are multiple words or initials; however, to make it be more correct we would have to implement a function and this is too much of advanced topic for us right now.

But that doesn’t mean that we can’t try!

We will first implement a simpler version.

The data is coming from booklong and will be stored in booklong. First, we create the len_in_words column as the number of words in the Auth_Name column. Next, we tell R to create the FirstName column as the first word of Auth_Name. Then, if there are 2 words in Auth_Name, then LastName is that second word; however, if there are 3 words in Auth_Name, then LastName is that third word; finally, if there are not 2 or 3 words (i.e., 1 word or more than 3 words), then use the last word in Auth_Name.

booklong <-
  booklong |> 
    mutate(len_in_words = str_count(Auth_Name, boundary("word"))) |> 
    mutate(FirstName = word(Auth_Name, 1, sep = " "), 
           LastName = ifelse(len_in_words == 2,
                             word(Auth_Name, 2, sep = " "),
                             ifelse(len_in_words == 3,
                                    word(Auth_Name, 3, sep = " "),
                                    word(Auth_Name, -1, sep = " "))))

Let’s use the following to print author information for a sampling of rows from booklong.

R uses data from booklong. It groups by four fields—but notice that once you’ve specified Auth_Name, the other three columns will have the same values. Count the number of rows per group. Then filter either Auth_Name containing one of six names or Auth_Name containing the string “_Show”. Select the columns that we group_by plus the count we calculated. Finally, sort by LastName and then FirstName. Print all the rows.

booklong |> 
  group_by(Auth_Name, FirstName, LastName, len_in_words) |> 
  summarize(n = n()) |> 
  filter(Auth_Name %in% c("Suzanne Collins", "Ronald L Kotler",
                          "George R R Martin", "G B Trudeau", 
                          "Chris Van Dusen", "Bessel van der Kolk") |
           str_detect(Auth_Name, "_Show")) |> 
  select(Auth_Name, FirstName, LastName, len_in_words, n) |>
  arrange(LastName, FirstName) |> 
  print(n=Inf)

# A tibble: 8 × 5
# Groups:   Auth_Name, FirstName, LastName [8]
  Auth_Name           FirstName      LastName       len_in_words     n
  <chr>               <chr>          <chr>                 <int> <int>
1 Suzanne Collins     Suzanne        Collins                   2   276
2 Chris Van Dusen     Chris          Dusen                     3     1
3 Bessel van der Kolk Bessel         Kolk                      4    72
4 Ronald L Kotler     Ronald         Kotler                    3     5
5 George R R Martin   George         Martin                    4   225
6 The_Daily_Show      The_Daily_Show The_Daily_Show            1     1
7 The_Late_Show       The_Late_Show  The_Late_Show             1     7
8 G B Trudeau         G              Trudeau                   3    17

A few observations:

For the name “Chris Van Dusen”, this function determined that “Dusen” is the last name. Similar to “Bessel van der Kolk” for which it specified “Kolk” as the last name.
For both “R L Stine” and “G B Trudeau”, the function chose just the first initial for the first name.
For both of the shows, the full show name is given as both the first and last names.

These aren’t necessarily right or wrong, but we wanted to show another, somewhat more complicated way, of specifying the first and last name. To that we turn now.

R assigns the results to booklong and the original data comes from booklong. First, calculate three new columns: len_in_words, FirstWord, and SecWord. (These last two are simply the first and second word of Auth_Name; we calculate them to make the code that follows easier to read.) Next, create the FirstName column in the following way: if FirstWord is just one letter, then define FirstName as the first two words; if FirstWord is more than one letter, then define FirstName as FirstWord. Now, the calculation of LastName is much more complicated than before. If Auth_Name is just two words long, then define LastName as SecWord. If Auth_Name is three words long, then do the following: If SecWord is a connector (i.e., “le”, “von”, etc.), then set LastName to the second and third words; otherwise, set it to the third word. Now, if Auth_Name is four words long, then do the following: If the second and third words are either “de la” or “van der”, then define LastName as Auth_Name; otherwise, set it to the fourth word. If Auth_Name is not 2, 3, or 4 words long, then define LastName as the last word of Auth_Name. Finally, delete the two columns FirstWord and SecWord.

booklong <-
  booklong |> 
    mutate(len_in_words = str_count(Auth_Name, boundary("word")),
           FirstWord = word(Auth_Name, 1),
           SecWord = word(Auth_Name, 2)) |> 
    mutate(FirstName = 
             ifelse(str_length(FirstWord) == 1,
                    str_c(FirstWord, " ", SecWord),
                    FirstWord),
           LastName = 
             ifelse(len_in_words == 2, 
                    SecWord,
                    ifelse(len_in_words == 3,
                           ifelse(SecWord %in% c("le", "von", "van",
                                                 "du", "Du", "Van",
                                                 "Von", "Le"),
                                  word(Auth_Name, 2, 3),
                                  word(Auth_Name, 3)),
                           ifelse(len_in_words == 4,
                                  ifelse(word(Auth_Name, 2, 3) %in% 
                                           c("de la", "van der"),
                                         word(Auth_Name, 2, 4),
                                         word(Auth_Name, 4)),
                                  word(Auth_Name, -1))))) |> 
  select(-c(FirstWord, SecWord))

That was a lot. To be clear, we did not come up with this on the first try. We wrote this iteratively, fixing problems as we went along.

Now let’s use the following (the same command as above) to see the results of this second approach.

booklong |> 
  group_by(Auth_Name, FirstName, LastName, len_in_words) |> 
  summarize(n = n()) |> 
  filter(Auth_Name %in% c("Suzanne Collins", "Ronald L Kotler",
                          "George R R Martin", "G B Trudeau", 
                          "Chris Van Dusen", "Bessel van der Kolk") |
           str_detect(Auth_Name, "_Show")) |> 
  select(Auth_Name, FirstName, LastName, len_in_words, n) |>
  arrange(LastName, FirstName) |> 
  print(n=Inf)

# A tibble: 8 × 5
# Groups:   Auth_Name, FirstName, LastName [8]
  Auth_Name           FirstName      LastName       len_in_words     n
  <chr>               <chr>          <chr>                 <int> <int>
1 Suzanne Collins     Suzanne        Collins                   2   276
2 Ronald L Kotler     Ronald         Kotler                    3     5
3 George R R Martin   George         Martin                    4   225
4 The_Daily_Show      The_Daily_Show The_Daily_Show            1     1
5 The_Late_Show       The_Late_Show  The_Late_Show             1     7
6 G B Trudeau         G B            Trudeau                   3    17
7 Chris Van Dusen     Chris          Van Dusen                 3     1
8 Bessel van der Kolk Bessel         van der Kolk              4    72

A few observations:

For the name “Chris Van Dusen”, this function determined that “Van Dusen” is the last name. Similar to “Bessel van der Kolk” for which it specified “van der Kolk” as the last name.
For both “R L Stine” and “G B Trudeau”, the function define the first name to consist of both initials.
The same as before: for both of the shows, the full show name is given as both the first and last names.

2.10 String Padding: `str_pad()`

The ‘str_pad()’ function adds padding characters to strings, which can be useful for formatting text or creating fixed-width file outputs.

Consider the following demonstration of this function.

Select data from booklong. Define three new columns: 1) strL will be padded with “-” on the left; 2) strR is padded with “+” on the right; and 3) strB is padded on both sides with “*” so that it is centered in the space. Select FirstName plus those three columns. Print the first 6 rows.

booklong |> 
  mutate(strL = str_pad(FirstName, 15, side = "left", pad = "-"),
         strR = str_pad(FirstName, 15, side = "right", pad = "+"),
         strB = str_pad(FirstName, 15, side = "both", pad = "*")) |> 
  select(FirstName, strL, strR, strB) |> 
  head(n=6) |> 
  as.data.frame()

  FirstName            strL            strR            strB
1     James ----------James James++++++++++ *****James*****
2 Gabrielle ------Gabrielle Gabrielle++++++ ***Gabrielle***
3   Suzanne --------Suzanne Suzanne++++++++ ****Suzanne****
4   Suzanne --------Suzanne Suzanne++++++++ ****Suzanne****
5      Kate -----------Kate Kate+++++++++++ *****Kate******
6    Lauren ---------Lauren Lauren+++++++++ ****Lauren*****

Of course, you could choose to pad the strings with whatever character that you would like (e.g., a space “ ”). We chose the above so that it would be easier to see.

1 Case study: Books

1.1 Import

1.2 Column clean-up

1.3 Column contents clean-up

1.3.1 Make factors

1.3.2 Managing empty values

1.4 Ready for analysis

2 Functions

2.1 Trimming White Spaces: str_trim() and str_squish()

2.2 Text Transformation: toupper(), tolower(), and str_to_title()

2.3 Extracting and Detecting: str_extract(), str_detect()

2.4 Pattern Matching and Replacement: str_replace()

2.5 Count number of matches: str_count()

2.6 Substring Extraction: str_sub()

2.7 Splitting Text: separate_wider_delim()