Case study: Babynames

This dataset is available from Kaggle. We are going to use this dataset to demonstrate some of the steps that you might have to take in order to transform a raw dataset into what you need to do in order to do the analysis that you are planning.

1 Importing data

Here we read the CSV file into the babynames data frame.

babynames <- read.csv("../data/class202405/babyNamesUSYOB-full.csv")

2 Examining data

Now let’s take a look at what we have:

summary(babynames)
  YearOfBirth       Name               Sex                Number       
 Min.   :1880   Length:1858689     Length:1858689     Min.   :    5.0  
 1st Qu.:1950   Class :character   Class :character   1st Qu.:    7.0  
 Median :1983   Mode  :character   Mode  :character   Median :   12.0  
 Mean   :1973                                         Mean   :  183.4  
 3rd Qu.:2002                                         3rd Qu.:   32.0  
 Max.   :2015                                         Max.   :99680.0  

As you can see above, there are more than 1.8 million observations (rows) in this dataset. How many babies are represented within this dataset?

sum(babynames$Number)
[1] 340851912

Over 340 million!

Let’s look at the top 15 rows of babynames, ordered alphabetically within YearOfBirth and Sex. We would expect to see names for females beginning with A from 1880:

babynames |> 
  arrange(YearOfBirth, Sex, Name) |> 
  head(n=15)
   YearOfBirth     Name Sex Number
1         1880    Abbie   F     71
2         1880     Abby   F      6
3         1880  Abigail   F     12
4         1880      Ada   F    652
5         1880     Adah   F     24
6         1880  Adaline   F     23
7         1880     Adda   F     14
8         1880    Addie   F    274
9         1880    Adela   F      9
10        1880 Adelaide   F     65
11        1880    Adele   F     41
12        1880   Adelia   F     18
13        1880  Adeline   F     54
14        1880   Adella   F     26
15        1880   Adelle   F      5

Now we can reorder these rows descending by Number within YearOfBirth and Sex—that is, for each year and gender, it will list the names from most popular to least popular.

babynames |> 
  arrange(YearOfBirth, Sex, desc(Number), Name) |> 
  head()
  YearOfBirth      Name Sex Number
1        1880      Mary   F   7065
2        1880      Anna   F   2604
3        1880      Emma   F   2003
4        1880 Elizabeth   F   1939
5        1880    Minnie   F   1746
6        1880  Margaret   F   1578

3 Modifying and calculating

We are going to do some calculations on the datatable. For each YearOfBirth and Sex, we want to calculate how many names are recorded that year and how many babies are in the dataset. In the first observation, you can see that 90,992 female babies are given 942 names in 1880.

BabyYearSex <-
  babynames |> 
    group_by(YearOfBirth, Sex) |> 
    summarize(
      NumOfNames = n(),
      NumOfBabies = sum(Number),
    )
head(BabyYearSex, n=8)
# A tibble: 8 × 4
# Groups:   YearOfBirth [4]
  YearOfBirth Sex   NumOfNames NumOfBabies
        <int> <chr>      <int>       <int>
1        1880 F            942       90992
2        1880 M           1058      110490
3        1881 F            938       91953
4        1881 M            997      100743
5        1882 F           1028      107848
6        1882 M           1099      113686
7        1883 F           1054      112318
8        1883 M           1030      104627

Let’s use the slice() function to get the top names for each YearOfBirth/Sex combination.

babynames |> 
  arrange(YearOfBirth, Sex, desc(Number)) |>
  group_by(YearOfBirth, Sex) |> 
  slice(1:10) |> 
  head(n=15)
# A tibble: 15 × 4
# Groups:   YearOfBirth, Sex [2]
   YearOfBirth Name      Sex   Number
         <int> <chr>     <chr>  <int>
 1        1880 Mary      F       7065
 2        1880 Anna      F       2604
 3        1880 Emma      F       2003
 4        1880 Elizabeth F       1939
 5        1880 Minnie    F       1746
 6        1880 Margaret  F       1578
 7        1880 Ida       F       1472
 8        1880 Alice     F       1414
 9        1880 Bertha    F       1320
10        1880 Sarah     F       1288
11        1880 John      M       9655
12        1880 William   M       9531
13        1880 James     M       5927
14        1880 Charles   M       5348
15        1880 George    M       5126

Now that we know how to do it, let’s store that information in a data frame:

BabyYearTopNames <-
  babynames |> 
    arrange(YearOfBirth, Sex, desc(Number)) |>
    group_by(YearOfBirth, Sex) |> 
    slice(1:10)
head(BabyYearTopNames, n=15)
# A tibble: 15 × 4
# Groups:   YearOfBirth, Sex [2]
   YearOfBirth Name      Sex   Number
         <int> <chr>     <chr>  <int>
 1        1880 Mary      F       7065
 2        1880 Anna      F       2604
 3        1880 Emma      F       2003
 4        1880 Elizabeth F       1939
 5        1880 Minnie    F       1746
 6        1880 Margaret  F       1578
 7        1880 Ida       F       1472
 8        1880 Alice     F       1414
 9        1880 Bertha    F       1320
10        1880 Sarah     F       1288
11        1880 John      M       9655
12        1880 William   M       9531
13        1880 James     M       5927
14        1880 Charles   M       5348
15        1880 George    M       5126

Let’s calculate how many babies are given one of the top names (for each sex) for each year:

BabyYearTopNames |> 
  group_by(YearOfBirth, Sex) |> 
  summarize(
    CountTopBabiesThisYear = sum(Number)
  )
# A tibble: 272 × 3
# Groups:   YearOfBirth [136]
   YearOfBirth Sex   CountTopBabiesThisYear
         <int> <chr>                  <int>
 1        1880 F                      22429
 2        1880 M                      48854
 3        1881 F                      22211
 4        1881 M                      44122
 5        1882 F                      25820
 6        1882 M                      48566
 7        1883 F                      26308
 8        1883 M                      44600
 9        1884 F                      29860
10        1884 M                      47180
# ℹ 262 more rows

Again, let’s now store this information in a data frame:

BabyYearTopNamesSums <-
  BabyYearTopNames |> 
    group_by(YearOfBirth, Sex) |> 
    summarize(
      CountTopBabiesThisYear = sum(Number))
head(BabyYearTopNamesSums)
# A tibble: 6 × 3
# Groups:   YearOfBirth [3]
  YearOfBirth Sex   CountTopBabiesThisYear
        <int> <chr>                  <int>
1        1880 F                      22429
2        1880 M                      48854
3        1881 F                      22211
4        1881 M                      44122
5        1882 F                      25820
6        1882 M                      48566

Use left_join() to get all of the yearly statistics in one table:

BabyYearSex <-
  BabyYearSex |> 
    select(YearOfBirth, Sex, NumOfNames, NumOfBabies) |> 
    left_join(BabyYearTopNamesSums, 
              by = join_by(YearOfBirth, Sex))
head(BabyYearSex, n=8)
# A tibble: 8 × 5
# Groups:   YearOfBirth [4]
  YearOfBirth Sex   NumOfNames NumOfBabies CountTopBabiesThisYear
        <int> <chr>      <int>       <int>                  <int>
1        1880 F            942       90992                  22429
2        1880 M           1058      110490                  48854
3        1881 F            938       91953                  22211
4        1881 M            997      100743                  44122
5        1882 F           1028      107848                  25820
6        1882 M           1099      113686                  48566
7        1883 F           1054      112318                  26308
8        1883 M           1030      104627                  44600

One interesting calculation (among many) that we can perform on the data that we’ve gathered so far is to answer the following question: “How popular are the popular names for each sex in each year?” One way to answer this is to determine what percentage of all babies born each year have popular names. Let’s assume that “popular” means one of the top 10 names.

BabyYearSex <-
  BabyYearSex |> 
    mutate(Concentration = CountTopBabiesThisYear / NumOfBabies)
head(BabyYearSex)
# A tibble: 6 × 6
# Groups:   YearOfBirth [3]
  YearOfBirth Sex   NumOfNames NumOfBabies CountTopBabiesThisYear Concentration
        <int> <chr>      <int>       <int>                  <int>         <dbl>
1        1880 F            942       90992                  22429         0.246
2        1880 M           1058      110490                  48854         0.442
3        1881 F            938       91953                  22211         0.242
4        1881 M            997      100743                  44122         0.438
5        1882 F           1028      107848                  25820         0.239
6        1882 M           1099      113686                  48566         0.427

4 Graphing

It’s pretty straight-forward to graph this in R (once you’ve gone through the graphing information for this class). Let’s do a basic version first:

BabyYearSex |> 
  ggplot(aes(YearOfBirth, Concentration)) +
  geom_line(aes(color = Sex))

We can see that since the mid 1940s, popular names have significantly dropped in popularity. Further, female names were much more concentrated in 1880 but are now on par with male names.

Now let’s do a little bit of design work:

BabyYearSex |> 
  ggplot(aes(YearOfBirth, Concentration)) +
  geom_line(aes(color = Sex)) +
  scale_x_continuous(limits = c(1875, 2020),
                     breaks = c(1880, 1900, 1920, 1940, 1960, 1980, 2000, 2020)) + 
  scale_y_continuous(limits = c(0, 0.5),
                     breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5),
                     labels = c("0", "10%", "20%", "30%", "40%", "50%")) +
  scale_color_manual(values = c("F" = "firebrick", "M" = "dodgerblue"),
                     labels = c("Female", "Male")) +
  labs(title = "Concentration of Baby Names, 1880-2015",
       subtitle = "How popular have popular names been over time?",
       x = "Year of Birth",
       y = "Top 10 Names as Percentage of Total",
       color = "Gender") +
  theme_bw() +
  theme(legend.position = c(0.8, 0.8),
        title = element_text(colour = "#582C83",
                                  face = "bold"))