Refining: Themes & colors

Before you read this page, we recommend that you understand both the introductory graphics page and the illustrative example.

1 Introduction

This page continues the process of building a graph while conforming to the following process:

Process for defining a `ggplot`

The “gather tools & build easel” page goes through the first two steps in the process. The “paint” page goes through the process of choosing and setting up geometries (the data on the graph). The “framing” page goes through the process of constructing the frame, while this page goes through the process of refining the graph with colors and themes.

The analogy to the painting process breaks down for this step. You are retroactively choosing the look of the easel and frame in this step (with the theme) and setting the colors of the graphed data (with the colors). You will define the theme for every graph but the colors for only a subset of the graphs.

2 Structure

Here’s the structure of the basic statement to build a ggplot, highlighting the commands that we focus on in this page:

dataframename |> 
  ggplot(aes(X)) +
    geom_Y(optional-stuff) +
    facet_Z(column-info) +
    labs(...) +
    scale_x_continuous/discrete(...) +
    scale_y_continuous/discrete(...)
    theme_A() +
    scale_fill/color_B(specification)

As you can see above, the effects that we achieve on this page require very few statements.

3 Focus of this page

Within this page, we provide examples of both themes and colors.

3.1 Themes

The themes specify the overall look of your graphs. You should, in almost every case, choose one theme and use it for all of your graphs across all of your documents (or, at least, all of the documents within a particular use case or target audience).

  • Themes built in to ggplot, such as theme_bw(), theme_minimal(), and others. You don’t have to do anything special to use these themes.
  • Other ggthemes (such as theme_tufte() and theme_fivethirtyeight()) can be found at this page.
    • Install this into RStudio with install.packages("ggthemes").
    • Use this package with library(ggthemes).

3.2 Colors

Color palettes specify the colors of the data plots in your graphs. You have almost an infinite amount of choices for specifying colors, with the following basic decisions:

  • Pre-defined: Many color palettes exist already, and it’s a good idea to start your learning process by simply using them. These palettes are either built-in to R or can be downloaded in libraries. We provide examples of each.
    • Install ColorBrewer with install.packages("RColorBrewer").
      • Use the package with library(RColorBrewer).
      • Information about ColorBrewer on this page.
    • Information about colorblind-friendly palettes can be found on this page.
  • Manual: In certain use cases, it makes sense to define your own color palettes. Lots of digital resources exist that can help you do this, and we’ll show you how to apply them to your graphs.

3.3 Introductory examples

All of this discussion might seem a bit abstract, so in this section we’ll present a set of examples that illuminate the effects of themes versus the effects of color palettes. In this section, we present three examples of the theme theme_bw() with three different color palettes. In the following section, we present three examples of a different theme, theme_fivethirtyeight(), with those same three color palettes.

These examples are not meant to highlight nuances in the use of themes and colors but are simply meant to demonstrate their general effects. Details follow in all of the examples that follow.

3.3.1 theme_bw()

You’ll see in this and the next two graphs the use of the theme_bw() theme, one that is built-in to ggplot. It sets the font faces and sizes and formatting for the titles, axes, and legends. Look at this graph and the following two, and notice the similar look of the frame among the first three (especially compared to the following three).

3.3.1.1 A palette from brewer

ColorBrewer has many nice color palettes. In this case, we use the YlGnBu (yellow, green, blue) color palette. You can see the results in the colors of the bars. Note that this is a fill palette; we knew to use this palette for the following reason: In the aes() function, we have fill=Response, meaning a different fill color is used for each different value of Response.

survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    theme_bw() +
    scale_fill_brewer(palette = "YlGnBu")

3.3.1.2 A manual palette

This graph uses the same theme as the previous graph. The only difference between the two is in the color palette specified in scale_fill_manual(). Since there are five different values of Response, we need to specify five different hex codes. Go to this section for information about hex codes.

survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

3.3.1.3 A palette for colorblind users

Again, in this graph we use the theme_bw() theme so the different look can be attributed to the color palette. The viridis color palette (with more information on this page) is designed to enable colorblind users to interpret graphs.

You can see that the full command is scale_fill_viridis_d() — the fill is specified as before, but also note the d at the end. This stands for discrete and is chosen since Response is a discrete variable; the alternative is c for continuous.

survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    theme_bw() +
    scale_fill_viridis_d()

3.3.2 theme_fivethirtyeight()

In this section, we use a different theme, theme_fivethirtyeight(), in combination with the same palettes that we used above. The discussions in this section are not needed since the only difference is the theme that is used. Be sure to compare each of these graphs with the corresponding one in the previous section.

3.3.2.1 A palette from brewer
survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    theme_fivethirtyeight() +
    scale_fill_brewer(palette = "YlGnBu")

3.3.2.2 A manual palette
survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    theme_fivethirtyeight() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

3.3.2.3 A palette for colorblind users
survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    theme_fivethirtyeight() +
    scale_fill_viridis_d()

4 Examples

For these 50+ examples, we are following the graph creation process started on the aesthetics page. At the bottom of each graph, you can find a link to the first step so you can follow the graph creation process from the beginning for each graph.

For most of the examples, we use the same theme so as to demonstrate the feel of consistency of theme usage within a document. We recommend that you do the same…no matter your choice of theme.

4.1 1 discrete (with implicit count)

4.1.1 Single stacked bar: x (constant), y (implicit count), fill + bar()

We use the color-blind fill/discrete palette. We have also added color="grey" to geom_bar() to outline the fill areas on the bar. This is especially needed since the color for NA is so light.

This is not the most attractive graph, for sure. Many times when there are large blocks of color in a graph, lighter pastel colors look better. However, the viridis color palette is not designed to be attractive; it is designed to be useful for color-blind users.

survey |> 
  ggplot(aes(x = "All responses", fill=Status)) +
    geom_bar(color="grey") +
    labs(title = "Number of question responses by Student Status",
         subtitle = "For all years",
         x = element_blank(),
         y = "Number of responses",
         fill = "Student status") +
    scale_y_continuous(limits = c(0, 300000),
                       breaks = c(0, 50000, 100000, 
                                  150000, 200000, 
                                  250000, 300000),
                       labels = c("0", "50k", "100k", 
                                  "150k", "200k", 
                                  "250k", "300k")) +
    theme_bw() +
    scale_fill_viridis_d()

4.1.2 Bar graph showing distribution: x, y (implicit count) + bar()

Since this graph only has one color and fill, no color palette needs to be chosen. However, in order to lessen the intensity of the large block of color, we have set alpha in the geom_bar().

student_econ |> 
  ggplot(aes(Sex)) +
    geom_bar(alpha = 0.75) +
    labs(title = "Number of applicants by Gender",
         subtitle = "For most recent year",
         x = "Gender",
         y = "Number of applicants") +
    scale_y_continuous(limits = c(0, 1250),
                       breaks = c(0, 250, 500, 750, 
                                  1000, 1250)) + 
    theme_bw()

4.2 1 continuous

4.2.1 Histogram for continuous column: x + histogram()

Again, since this graph has just one color, we had previously set the fill and color are set outside of the aes() in geom_histogram() since they are unchanging. Further, a color palette does not need to be specified.

student_econ |> 
  ggplot(aes(x = Age)) +
    geom_histogram(bins=30, 
                   fill="grey", 
                   color="black",
                   na.rm = TRUE) +
    labs(title = "Number of applicants by Age",
         subtitle = "For most recent year",
         x = "Age",
         y = "Number of applicants") +
    theme_bw()

4.3 2 discrete

4.3.1 Stacked bar: x, y (implicit count), fill + bar()

In the previous step, we showed two different ways to handle the x-axis labels. We’ll drop one here since that choice does not affect choices of themes or colors.

4.3.1.1 Just the bar chart

Sometimes you are designing for the graph to be read on a black-and-white printer so you want to you grey-scales rather than colors. A very good choice for this situation is the "Greys" palette from brewer for fill.

survey |> 
  ggplot(aes(Question, fill=Response)) +
    geom_bar(color="black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(angle = 45)) +
    theme_bw() +
    scale_fill_brewer(palette = "Greys")

4.3.1.2 Adding a text label to the bar chart

Now, further suppose that we want to show the underlying values on the graph itself. In the next two charts, we demonstrate how to show the value itself or a calculated value (percent, in this case).

Our first step it to create a data frame containing the values that we need — in this case, n which is the count of the number of a particular response to a particular question.

surveycalc <- 
  survey |> 
    group_by(Question, Response) |> 
    summarize(n = n())
surveycalc |> 
  select(Question, Response, n) |> 
  arrange(Question, Response)
# A tibble: 50 × 3
# Groups:   Question [10]
   Question     Response                 n
   <ord>        <fct>                <int>
 1 TooDifficult "Strongly\nDisagree"  3040
 2 TooDifficult "Disagree"            5917
 3 TooDifficult "Neutral"             9052
 4 TooDifficult "Agree"               5956
 5 TooDifficult "Strongly\nAgree"     2914
 6 NotRelevant  "Strongly\nDisagree"  4933
 7 NotRelevant  "Disagree"            7335
 8 NotRelevant  "Neutral"             7271
 9 NotRelevant  "Agree"               4898
10 NotRelevant  "Strongly\nAgree"     2465
# ℹ 40 more rows

As you can see from the data above, we now have the n column available containing the number of times a response to a question was chosen.

Now that we have the data, it is only a matter of adding the geom_label() function to the graph:

  • aes(label): specifies that the value to be printed is the n column (that we just calculated).
  • position: since this is a stacked graph, we need to specify position_stack() for the placement of the label. Further, since we want it to be placed in the middle (vertically) of the bar, then we just vjust = 0.5. If we had wanted it near the top of each bar section, we might have specified 0.9 instead.
  • fill: specifies the background fill color of each label.
  • color: specifies the color of the font.
  • fontface: specifies the font face to be used.
  • size: specifies the size of the text label.

One other change is on line 3. (To be clear, this feels more like a bug than a feature, but we need to add this in order to make it work, so here we go.) In order to get the positioning of the labels to be placed in the right position, the fill position has to be specified as being reversed. Why? We have no idea. Just do it.

surveycalc |> 
  ggplot(aes(Question, y = n, 
             fill=fct_rev(Response))) +
    geom_bar(stat = "identity", color="black") +
    geom_label(aes(label = n),
               position = position_stack(vjust = 0.5),
               fill = "black",
               color = "white", fontface = "bold",
               size = 3) +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response",
         fill = "Response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(angle = 45)) +
    theme_bw() +
    scale_fill_brewer(palette = "Greys")

4.3.1.3 Adding a calculated percent label

We’re going to build on the previous example. This time we want to show the percentage of times (not the raw count) that a response to a question is chosen. This requires a two-step calculation, first involving summarize() and then a mutate(). The summarize() portion removes all columns other than Question, Response, and n. We want to retain these columns after the next group_by(), so we use mutate() to calculate two more columns:

  • totquestresponses: the sum of all the n values for a particular Question.
  • percent: the percent of times a particular response is chosen for a particular question.

You can see from the printed data that the total of the n column for a particular value in the Question column equals the value in any one of its totquestresponses column values (which are all the same). The percent column is, just as is specified, the n column divided by the totquestresponses column.

surveycalc <- 
  survey |> 
    group_by(Question, Response) |> 
    summarize(n = n()) |>
    group_by(Question) |> 
    mutate(totquestresponses = sum(n),
           percent = n/totquestresponses)
surveycalc |> 
  select(Question, Response, n, totquestresponses, percent) |> 
  arrange(Question, Response)
# A tibble: 50 × 5
# Groups:   Question [10]
   Question     Response                 n totquestresponses percent
   <ord>        <fct>                <int>             <int>   <dbl>
 1 TooDifficult "Strongly\nDisagree"  3040             26879  0.113 
 2 TooDifficult "Disagree"            5917             26879  0.220 
 3 TooDifficult "Neutral"             9052             26879  0.337 
 4 TooDifficult "Agree"               5956             26879  0.222 
 5 TooDifficult "Strongly\nAgree"     2914             26879  0.108 
 6 NotRelevant  "Strongly\nDisagree"  4933             26902  0.183 
 7 NotRelevant  "Disagree"            7335             26902  0.273 
 8 NotRelevant  "Neutral"             7271             26902  0.270 
 9 NotRelevant  "Agree"               4898             26902  0.182 
10 NotRelevant  "Strongly\nAgree"     2465             26902  0.0916
# ℹ 40 more rows

Having done the calculations, the only change that needs to be made here is in line 5 below. Let’s explain it piece-by-piece:

  • The value that we’re printing is percent * 100. Thus 0.113247 would be 11.3247.
  • However, we want to show 11.3 in the graph. The tool for this is the sprintf() function.
  • Now, at the end of 11.3 we want to concatenate (join together) a percent sign. The tool for this is the str_c() function (described on this page)

FYI, a great resource to learn about sprintf() can be found on this page.

surveycalc |> 
  ggplot(aes(Question, y = n, 
             fill=forcats::fct_rev(Response))) +
    geom_bar(stat = "identity", color="black") +
    geom_label(aes(label = str_c(sprintf("%1.1f", 
                                         percent * 100), 
                                 "%",
                                 sep = "")),
               position = position_stack(vjust = 0.5),
               fill = "black",
               color = "white", fontface = "bold",
               size = 3) +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response",
         fill = "Response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(angle = 45)) +
    theme_bw() +
    scale_fill_brewer(palette = "Greys")

4.3.2 Grouped bar: x, y (implicit count), fill + bar()

This is the first graph of many that contrasts results for males and females. As such, we are going to choose one color for each (in a manual color palette) and then use those colors every time we are representing males and females.

student_econ |> 
  ggplot(aes(Race, fill = Sex)) +
    geom_bar(position = "dodge") +
    labs(title = "Number of applicants for each race",
         subtitle = "For most recent year",
         x = "Race",
         y = "Number of applicants") +
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

4.3.3 Facet wrap around grouped bar: x, y (implicit count), fill (redundant), facet + bar()

This graph presents the same information as the previous graph. Notice again that we’re using the same colors for male and female bars.

student_econ_ABHW |> 
  ggplot(aes(Sex, fill = Sex)) +
    facet_wrap(~Race, 
               ncol = 4,
               labeller = label_both) +
    geom_bar() +
    labs(title = "Number of applicants per Gender per Race",
         subtitle = "For most recent year",
         x = "Gender",
         y = "Number of applicants") +
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

4.4 1 discrete, 1 continuous

4.4.1 Bar chart of averages: x, y + col()

This chart only uses one color so the only thing that needs to be done is to lighten the color a bit using alpha.

surveyQAvg |> 
  ggplot(aes(Question, Avg)) +
    geom_col(alpha = 0.8) +
    labs(title = "Average response per Survey Question",
         subtitle = "For all years",
         x = "Question",
         y = "Average score") +
    scale_y_continuous(limits = c(0, 5), 
                       breaks = c(1, 2, 3, 4, 5),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0", "5.0")) +
    scale_x_discrete(guide = guide_axis(angle = 45))  +
    theme_bw()

4.4.2 Bar chart with sorted averages: x (reordered), y + col()

This chart only uses one color so the only thing that needs to be done is to lighten the color a bit using alpha and, in this case, changing the color of the outlines to black.

After looking at the value, we have decided at this late stage to show the value on the graph as well. To do this, we use geom_text(). Its attributes here require a bit of explaining:

  • aes(label): the text displayed on the graph. We use the sprintf() function to format the number so that it has 2 digits after the decimal point.
  • aes(y): the vertical (y) position of the text. Here we want it above the top of the bar so we add a value from the y position of the top line of the bar (Avg in this graph). If we had wanted it below, we would have used subtraction. The needed value per graph depends on the values in the y-axis, so you will probably need to experiment a few times to get the position that you want.
  • size: the size of the text. Again, you’ll need to experiment with this value. You might also put it inside the aes() and have it vary depending on a column.
  • color: the color of the text.

We didn’t use geom_label() here because we did not need the text background box.

surveyQAvg |> 
  ggplot(aes(x = fct_reorder(Question, Avg, .desc = TRUE),
             Avg)) +
    geom_col(alpha = 0.8, color = "black") +
    geom_text(aes(label = sprintf("%1.2f", Avg), 
                  y = Avg + 0.15),
              size = 3, color = "black") + 
    labs(title = "Average response per Survey Question (in descending order)",
         subtitle = "For all years",
         x = "Question",
         y = "Average score") +
    scale_y_continuous(limits = c(0, 5),
                       breaks = c(1, 2, 3, 4, 5),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0", "5.0")) +
    scale_x_discrete(guide = guide_axis(angle = 45)) +
    theme_bw()

4.4.3 Boxplot reliant on other variable: x, y + boxplot()

Since neither the fill nor the color varies by a variable with these boxplots, we normally would see black outlines and white fills. In this case, we think it’s easier to interpret the graphs by changing the fill color. We do this for both of the graphs in this section.

student_econ |> 
  ggplot(aes(Race, SAT)) +
    geom_boxplot(alpha = 0.5, fill = "grey") +
    labs(title = "Range of SAT scores per applicant race",
         subtitle = "For all years",
         x = "Race",
         y = "SAT scores") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    theme_bw()

In this example, we show the distribution of UnivGPA values by the Gender of the applicant:

admitdatagenderMFgpa |> 
  ggplot(aes(x = Gender, y = UnivGPA)) +
    geom_boxplot(alpha = 0.5, fill = "lightblue") +
    labs(title = "Range of University GPA per student gender",
         subtitle = "For all years",
         x = "Gender",
         y = "University GPA") +
    scale_y_continuous(limits = c(0, 4),
                       breaks = c(0, 1, 2, 3, 4),
                       labels = c("0.0", "1.0",
                                  "2.0", "3.0", "4.0")) +
    theme_bw() +
    scale_fill_brewer(palette = "Greys")

4.4.4 Violin chart reliant on other variable: x, y + horizontal violin()

In this graph we use a manual fill color palette to set the colors since we want to be consistent with the gender colors that we have previously used.

admitdatagendergpa |> 
  ggplot(aes(x = Gender, 
             y = UnivGPA, 
             fill=Gender)) +
    geom_violin(scale="count")+
    labs(title = "Distribution of University GPA per student gender",
         subtitle = "For all years",
         x = "Gender",
         y = "University GPA",
         fill = "Student Gender") +
    scale_x_discrete(limits = rev) +
    scale_y_continuous(limits = c(0, 4),
                       breaks = c(0, 1, 2, 3, 4),
                       labels = c("0.0", "1.0",
                                  "2.0", "3.0", "4.0")) +
    coord_flip()+
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425",
                                 "darkgrey", "lightgrey"))

4.5 2 discrete, 1 continuous

4.5.1 Grouped bar (x, y, fill + col())

This one is similar to this graph but uses geom_col(). This uses the familiar viridis color palette, adding black outlines and muting the colors somewhat.

surveyQRN |> 
  ggplot(aes(x = Question, 
             y = n, 
             fill = Response)) +
    geom_col(color = "black", alpha = 0.75) +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 30000),
                       breaks = c(0, 5000, 10000,
                                  15000, 20000,
                                  25000, 30000),
                       labels = c("0", "5k", "10k",
                                  "15k", "20k",
                                  "25k", "30k")) +
    scale_x_discrete(guide = guide_axis(angle = 45))+
    theme_bw() +
    scale_fill_viridis_d()

Sometimes, we want to show the bars next to each other rather than stacked. Here we use a manual color palette while, again, adding a black outline to the bars. We think the black outlines are a near requirement when one or more of the fill-colors are very light.

surveyQRN |> 
  ggplot(aes(x = Question, y = n, 
             fill = Response)) +
    geom_col(position="dodge",
             color = "black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 15000),
                       breaks = c(0, 3000, 6000,
                                  9000, 12000, 15000),
                       labels = c("0", "3k", "6k",
                                  "9k", "12k", "15k")) +
    scale_x_discrete(guide = guide_axis(angle = 45))+
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

Another example which displays the same graph as shown in this section. This uses our typical manual color palette for male and female.

student_econ |> 
  group_by(Race, Sex) |> 
  summarize(Count = n()) |> 
  ggplot(aes(Race, y = Count, fill = Sex)) +
    geom_col(position="dodge") +
    labs(title = "Number of applicants for each race",
         subtitle = "For most recent year",
         x = "Race",
         y = "Number of applicants")+
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

4.5.2 Facets around bar (x, y, facet + col())

This graph, again, displays the data shown in the graphs in this section. No color palette is used here though we do set an alpha in order to tone down the fill color.

surveyQRN |> 
  ggplot(aes(x = Response, y = n)) +
    facet_wrap(~Question, ncol = 2) +
    geom_col(position="dodge",
             alpha = 0.8) +
    labs(title = "Distributions of responses for each question",
         subtitle = "For all years",
         x = "Response",
         y = "Number of responses") +
    scale_y_continuous(limits = c(0, 14000),
                       breaks = c(0, 3500, 7000, 
                                  10500, 14000),
                       labels = c("0", "3.5k", "7k",
                                  "10.5k", "14k"))+
    theme_bw()

4.5.3 Facets around horizontal bar (x, y, facet + col())

In this graph, we display the same data as in the previous graph. The only difference, obviously, is that the bars are displayed horizontally. The theme and fill stay the same.

surveyQRN |> 
  ggplot(aes(x = n, y = Response)) +
    facet_wrap(~Question, ncol = 5) +
    geom_col(position="dodge", 
             alpha = 0.8) +
    labs(title = "Distributions of responses for each question",
         subtitle = "For all years",
         y = "Response",
         x = "Number of responses") +
    scale_x_continuous(limits = c(0, 14000),
                       breaks = c(0, 3500, 7000, 
                                  10500, 14000),
                       labels = c("0", "3.5k", "7k",
                                  "10.5k", "14k"),
                       guide = guide_axis(angle = 45))+
    theme_bw()

4.5.4 Grouped bar, narrowed width of bars (x, y, fill + col())

In this and the following two sections, we are going to modify the graph shown in this section. We had already set the color outlines to black in a previous step; here, we just use the usual theme and set a manual color palette.

surveyQRN |> 
  ggplot(aes(x = Question, 
             y = n, 
             fill = Response)) +
    geom_col(position="dodge", 
             width=0.8,
             color = "black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 15000),
                       breaks = c(0, 3000, 6000,
                                  9000, 12000, 15000),
                       labels = c("0", "3k", "6k",
                                  "9k", "12k", "15k")) +
    scale_x_discrete(guide = guide_axis(angle = 45))+
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

4.5.5 Grouped bar, narrowed & overlapping bars (x, y, fill + col())

Again, we are displaying the same data as the previous graph. We also use the same theme and color palette.

surveyQRN |> 
  ggplot(aes(x = Question, 
             y = n, 
             fill = Response)) +
    geom_col(position=position_dodge(width=0.5), 
             width=0.9,
             color = "black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 15000),
                       breaks = c(0, 3000, 6000,
                                  9000, 12000, 15000),
                       labels = c("0", "3k", "6k",
                                  "9k", "12k", "15k")) +
    scale_x_discrete(guide = guide_axis(angle = 45))+
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

4.5.6 Grouped bar, narrowed & spaced bars (x, y, fill + col())

In this graph, we take the opposite approach to graph in the previous section. We narrow the groups by a small amount but narrow each individual bar even more. We still use the same theme and color palette.

surveyQRN |> 
  ggplot(aes(x = Question, 
             y = n, 
             fill = Response)) +
    geom_col(position=position_dodge(width=0.9), 
             width=0.5,
             color = "black") +
    labs(title = "Number of responses per question",
         subtitle = "For all years",
         x = "Question",
         y = "Number of each response") +
    scale_y_continuous(limits = c(0, 15000),
                       breaks = c(0, 3000, 6000,
                                  9000, 12000, 15000),
                       labels = c("0", "3k", "6k",
                                  "9k", "12k", "15k")) +
    scale_x_discrete(guide = guide_axis(angle = 45))+
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

4.5.7 Percent Stacked bar (x, y, fill + col())

Back in this section, we created a graph that showed the raw counts of the number of each response for each question. This time we want to show the percent of answers for each response to each question. Again, we use the same theme and color palette as before.

surveyQRN |> 
  ggplot(aes(x = Question, 
             y = n, 
             fill = Response)) +
    geom_col(position = "fill", color = "black") +
    labs(title = "Percentage of each response to each question",
         subtitle = "For all years",
         x = "Question",
         y = "Percent of each response") +
    scale_y_continuous(breaks = c(0.0, 0.25, 0.5, 
                                  0.75, 1.0),
                       labels = c("0%", "25%", "50%",
                                  "75%", "100%")) +
    scale_x_discrete(guide = guide_axis(angle = 45))+
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4", 
                                 "#adcbe3", "#e7eff6", 
                                 "#63ace5"))

4.5.8 Bar chart wrapped by a facet: x, y (implicit count), fill (redundant), facet + bar()

This graph presents the same information as in this section but without specifying the ncol argument. Notice how we don’t have to change the theme and color palette when working with facets; they automatically get applied to each facet.

student_econ_ABHW |> 
  ggplot(aes(Sex, fill = Sex)) +
    geom_bar(alpha = 0.8) +
    facet_wrap(~Race) +
    labs(title = "Number of applicants per Gender per Race",
         subtitle = "For most recent year",
         x = "Gender",
         y = "Number of applicants")+
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

4.5.9 Column chart wrapped by facets: x, y (explicit count), facet + col()

This graph displays the same information as the previous graph but goes about it entirely differently. Neither the theme nor the color palette changes.

student_RaceSexCount |> 
  ggplot(aes(Sex, Count)) +
    geom_col(aes(fill = Sex), 
             alpha = 0.8) +
    facet_wrap(~Race) +
    labs(title = "Number of applicants per Gender per Race",
         subtitle = "For most recent year",
         x = "Gender",
         y = "Number of applicants")+
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

If you decide that you want to show the value on the graph as well, you should use the geom_text() geom. Its attributes here require a bit of explaining:

  • label: the text displayed on the graph.
  • y: the vertical (y) position of the text. Here we want it below the top of the bar so we subtract a value from the y position of the top line of the bar (Count in this graph). If we had wanted it above, we would have used addition. The needed value per graph depends on the values in the y-axis, so you will probably need to experiment a few times to get the position that you want.
  • size: the size of the text. Again, you’ll need to experiment with this value. You might also put it inside the aes() and have it vary depending on a column.
  • color: the color of the text.
student_RaceSexCount |> 
  ggplot(aes(Sex, Count)) +
    geom_col(aes(fill = Sex), 
             alpha = 0.8) +
    geom_text(aes(label = Count, y = Count - 20),
              size = 3, color = "black") + 
    facet_wrap(~Race) +
    labs(title = "Number of applicants per Gender per Race",
         subtitle = "For most recent year",
         x = "Gender",
         y = "Number of applicants")+
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

4.5.10 Colored column chart wrapped by facets: x, y (explicit count), fill (redundant), facet + col()

Clearly, this is the same graph as the previous section but with bar colors specified by fill=Sex. Everything else is the same. Again, we’re using the same theme and manual color palette.

student_RaceSexCount |> 
  ggplot(aes(Sex, Count, fill=Sex)) +
    geom_col(alpha = 0.8) +
    facet_wrap(~Race) +
    labs(title = "Number of applicants per Gender per Race",
         subtitle = "For most recent year",
         x = "Gender",
         y = "Number of applicants")+
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425"))

4.5.11 Boxplot differentiated by 2 other columns: x, y, color + boxplot()

In this graph, we set the fill to lightgrey so that it is easier to read the graph. We used a manual color palette for the outline color. Contrast this with the next graph to see what you prefer.

admitdataRaceGPAPell |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA,
             color = PellStatus)) +
    geom_boxplot(varwidth=TRUE, 
                 fill = "lightgrey") +
    labs(title = paste("University GPA range per",
                       "race and Pell Grant Status",
                       sep=" "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA",
         color = "Pell Grant\nStatus")+
    theme_bw() +
    scale_color_manual(values = c("#2a4d69", "#4b86b4"))

4.5.12 Colored boxplot differentiated by 2 other columns: x, y, fill + boxplot()

This graph is exactly the same as the graph in the previous section except that it uses fill instead of color to highlight Pell Status. We use a fill color palette that changes based on the PellStatus.

admitdataRaceGPAPell |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA,
             fill = PellStatus)) +
    geom_boxplot(varwidth=TRUE) +
    labs(title = paste("University GPA range per",
                       "race and Pell Grant Status",
                       sep=" "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA",
         fill = "Pell Grant Status")+
    theme_bw() +
    scale_fill_manual(values = c("#2a4d69", "#4b86b4"))

4.5.13 Boxplot differentiated by one column and wrapped by another: x, y, facet + boxplot()

This graph displays the same data as in the previous graph except that it uses facet_wrap() instead of fill or color to differentiate Pell Status. We did set the fill for the boxplot to make it easier to compare the plots.

admitdataRaceGPAPell |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA)) +
    facet_wrap(~PellStatus,
               labeller = label_both) +
    geom_boxplot(varwidth=TRUE, 
                 fill = "lightblue") +
    labs(title = paste("University GPA range per",
                       "race and Pell Grant Status",
                       sep=" "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA")+
    theme_bw()

4.5.14 Horizontal boxplot differentiated by one column and wrapped by another: x, y, facet + horizontal boxplot()

This graph displays the same data as in the previous graph except that it displays the boxplots horizontally. The themes and fills are the same for both graphs.

admitdataRaceGPAPell |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA)) +
    facet_wrap(~PellStatus,
               labeller = label_both) +
    geom_boxplot(varwidth=TRUE, fill = "lightblue") +
    coord_flip() +
    labs(title = paste("University GPA range per",
                       "race and Pell Grant Status",
                       sep=" "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA")+
    theme_bw() 

4.6 Ordered, numeric, discrete

If you have one column that is ordered (e.g., a series of dates), another that is discrete (e.g., sex, Pell Grant Status, race/ethnicity, etc.), and another column that is numeric, then the situation is calling out for a line graph that further differentiates (based on the discrete column) via either colors or facets. Let’s look at a few examples.

4.6.1 Line chart: x, y, color + line()

As per usual, we set the usual theme and a color palette for the color of the lines. In a previous step, we set the linewidth to 1 to make it easier to see.

admitdataYearGenderCount |> 
  ggplot(aes(x = AdmitCalendarYear, 
             y = Count, 
             color = Gender,
             shape = Gender)) +
    geom_line(linewidth = 1) +
    labs(title = paste("Number of Admits by",
                       "Gender Per Calendar Year",
                       sep = " "),
         x = "Calendar Year",
         y = "Number of Admits") +
    scale_y_continuous(limits = c(0, 900),
                       breaks = c(0, 150, 300, 450,
                                  600, 750, 900)) +
    scale_x_continuous(limits = c(2011, 2022),
                       breaks = c(2012, 2014, 2016,
                                  2018, 2020, 2022)) +
    theme_bw() +
    scale_color_manual(values = c("#00aedb", "#ffc425",
                                 "darkgrey", "lightgrey"))

If you are worried that the reader will not be able to see the line colors well enough, you might consider adding a point plot to the graph. We plot the Count against AdmitCalendarYear (as is done for the line) and we set size=3 so that the shapes are easier to see. The legend changes to show both the colors and the shapes for all the lines.

admitdataYearGenderCount |> 
  ggplot(aes(x = AdmitCalendarYear, 
             y = Count, 
             color = Gender,
             shape = Gender)) +
    geom_line(linewidth = 1) +
    geom_point(aes(AdmitCalendarYear, Count),
               size = 3) +
    labs(title = paste("Number of Admits by",
                       "Gender Per Calendar Year",
                       sep = " "),
         x = "Calendar Year",
         y = "Number of Admits") +
    scale_y_continuous(limits = c(0, 900),
                       breaks = c(0, 150, 300, 450,
                                  600, 750, 900)) +
    scale_x_continuous(limits = c(2011, 2022),
                       breaks = c(2012, 2014, 2016,
                                  2018, 2020, 2022)) +
    theme_bw() +
    scale_color_manual(values = c("#00aedb", "#ffc425",
                                 "darkgrey", "lightgrey"))

If knowing the underlying values is important, then you might want to plot the values on the graph. Both geom_text() and geom_label() can accomplish this. You can learn about the subtleties of both on this page. The geom_label() option plots text within a background rectangle. Let’s examine the arguments for a moment:

  • label: the text that is displayed on the graph.
  • fill: sets the color of the label background.
  • color: sets the color of the label text to a constant value (since it’s outside of the aes()).
  • fontface: makes the text a little easier to read against the fill.
  • hjust: puts the text directly over the plotted point.

Notice that we also removed the shape argument from the ggplot(aes()) because the label prints over it.

admitdataYearGenderCount |> 
  ggplot(aes(x = AdmitCalendarYear, 
             y = Count, 
             color = Gender)) +
    geom_line(linewidth = 1) +
    geom_point(aes(AdmitCalendarYear, Count),
               size = 3) +
    geom_label(aes(label = Count, fill = Gender), 
               color = "white", fontface = "bold",
               hjust = "center") + 
    labs(title = paste("Number of Admits by",
                       "Gender Per Calendar Year",
                       sep = " "),
         x = "Calendar Year",
         y = "Number of Admits") +
    scale_y_continuous(limits = c(0, 900),
                       breaks = c(0, 150, 300, 450,
                                  600, 750, 900)) +
    scale_x_continuous(limits = c(2011, 2022),
                       breaks = c(2012, 2014, 2016,
                                  2018, 2020, 2022)) +
    theme_bw() +
    scale_color_manual(values = c("#00aedb", "#ffc425",
                                 "darkgrey", "lightgrey"))

In the following, we have a similar setup to the first graph in this section. We are not going to use a point plot here because there are far too many points on each line.

babynames |> 
  filter(Name %in% c("Jennifer", "Teresa", "Karen", 
                     "Linda", "Nancy") &
           Sex == "F") |> 
  ggplot(aes(x = YearOfBirth, 
             y = Number, 
             color=Name)) +
  geom_line(linewidth = 1) +
    labs(title = paste("Number of Births with",
                       "a Given Name Per Year",
                       sep = " "),
         x = "Calendar Year",
         y = "Number of babies born with a given name") +
    scale_x_continuous(limits = c(1880, 2020),
                       breaks = c(1880, 1900, 1920,
                                  1940, 1960, 1980,
                                  2000, 2020)) +
    scale_y_continuous(breaks = c(0, 20000, 40000,
                                  60000, 80000, 
                                  100000),
                       labels = c("0", "20k", "40k",
                                  "60k", "80k", "100k")) +
    theme_bw() +
    scale_color_manual(values = c("#d11141", "#00b159", 
                                  "#00aedb", "#f37735", 
                                  "#ffc425"))

4.6.2 Line chart wrapped by facets: x, y, facet + line()

This graph displays the same data as in the previous section; however, we are now differentiating on Gender via a facet_wrap() instead of color. We have no need to use a color or fill palette here.

admitdataYearGenderCount |> 
  ggplot(aes(x = AdmitCalendarYear, 
             y = Count)) +
    facet_wrap(~Gender,
               labeller = label_both) + 
    geom_line(linewidth = 1) +
    labs(title = paste("Number of Admits by",
                       "Gender Per Calendar Year",
                       sep = " "),
         x = "Calendar Year",
         y = "Number of Admits") +
    scale_y_continuous(limits = c(0, 900),
                       breaks = c(0, 150, 300, 450,
                                  600, 750, 900)) +
    scale_x_continuous(limits = c(2011, 2022),
                       breaks = c(2012, 2014, 2016,
                                  2018, 2020, 2022)) +
    theme_bw()

4.7 2 continuous

4.7.1 Point plot with fitted line: x, y + point() + smooth()

The default colors for both the line and point plots work just fine here. Let’s also talk for a moment about the alpha values set.

  • We can see the value of alpha=0.2 for point by noticing the heavy black in the middle. This stands out much more prominently because of the transparency of the points. The single points at the edge of the graph are much lighter since they are only 1- or 2-deep whereas the points in the middle of the cluster are deeply layered, and thus much darker.
  • The alpha=1.0 for the line means that it is fully opaque. That, combined with it being plotted after the point plots means that it is shown on top of the point plots.
admitdataIncGPAGender |> 
  ggplot(aes(x = FamilyIncome, y = UnivGPA)) +
    geom_point(alpha = 0.2) +
    geom_smooth(alpha = 1.0) +
    labs(title = "University GPA by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    theme_bw()

4.7.2 Hexplot with fitted line: x, y + hex() + smooth()

In this section, we present three different ways to apply colors to the geom_hex() plot. The theme stays the same for all three.

This first version simply shows the default colors.

admitdataIncGPAGender |> 
  ggplot(aes(FamilyIncome, UnivGPA)) + 
    geom_hex(bins = 50) + 
    geom_smooth() +
    labs(title = "University GPA range by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    theme_bw()

This second version demonstrates the use of scale_fill_distiller() to set the colors. This is from the Brewer library. The direction argument tells the function to use colors from light-to-dark. The name argument sets the name of the fill legend to the right of the graph. Finally, the palette argument sets the color palette that the function will use.

"Blues" is what’s known as as sequential palette. This is in contrast to a divergent palette. From this page: “If your story emphasizes the highest (=darkest) values, go for a sequential color scale. If your story is about the lowest and highest values, go for a diverging scale.” The use of a diverging scale is especially useful if there is a meaningful middle point.

The following are some of Brewer’s palettes:

  • Diverging: BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral
  • Sequential: Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu, YlOrBr, YlOrRd

In this case, we’re choosing a sequential palette because we want to emphasize the most dense areas and there is not a meaningful middle point.

admitdataIncGPAGender |> 
  ggplot(aes(FamilyIncome, UnivGPA)) + 
    geom_hex(bins = 50) + 
    geom_smooth() +
    labs(title = "University GPA range by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    scale_fill_distiller(palette = "Blues",
                         direction = 1,
                         name = "Count") +
    theme_bw()

Finally, in this third case, we create our own sequential palette using the scale_fill_gradient() function. This has several arguments:

  • name (required): the legend title for fill.
  • limits: the limits of the scale; this will remove data outside of these limits. We created the graph without this function first (see above), and noted the limits of the data in the legend. We then used those limits.
  • breaks: these are the breaks that we want to appear in the legend.
  • labels: these are the labels for each break.
  • na.values: the color for the missing values. We use "white" here because we want it to look like the white background of the theme.
  • low: the color of the smallest values in the graph.
  • high: the color of the largest values in the graph.
admitdataIncGPAGender |> 
  ggplot(aes(FamilyIncome, UnivGPA)) + 
    geom_hex(bins = 50) + 
    geom_smooth() +
    labs(title = "University GPA range by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    scale_fill_gradient("Count",
                        limits = c(0, 80),
                        breaks = c(0, 20, 40, 60, 80),
                        labels = c("0", "20", "40", 
                                   "60", "80"),
                        na.value = "white",
                        low = "lightyellow",
                        high = "darkblue") + 
    theme_bw()

4.7.3 Density/2D plot with fitted line: x, y + density_2d() + smooth()

The geom_density_2d geoms are yet another way to show density. We use the filled version to show the fill colors in the graph. We use the geom_density_2d() version to show the lines on the graph.

In this first version, we use scale_fill_brewer() to set the colors using a sequential palette. Here, we discover a problem with this palette for our present purposes — the graph has 11 separate levels, but the palette only has 9 colors. Thus, the last two (highest/greatest) levels go back and reuse the light colors from the first two (lowest/smallest) levels. Not good.

admitdataIncGPAGender |> 
  ggplot(aes(FamilyIncome, UnivGPA)) + 
    geom_density_2d_filled(alpha = 0.5) +
    geom_density_2d(linewidth = 0.25, color = "black") + 
    geom_smooth() +
    labs(title = "University GPA range by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    scale_fill_brewer(palette = "Blues",
                      direction = 1) +
    theme_bw()

In order to fix this, we create our own sequential color palette using the colorRampPalette() function. We mimic the "Blues" palette but create a palette with 11 color levels. We also get rid of the alpha setting for the filled() geom in order to make the color intensity more obvious.

wbcolors <- colorRampPalette(c("white", "darkblue"))(11)
admitdataIncGPAGender |> 
  ggplot(aes(FamilyIncome, UnivGPA)) + 
    geom_density_2d_filled() +
    geom_density_2d(linewidth = 0.25, color = "black") + 
    geom_smooth() +
    labs(title = "University GPA range by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    scale_fill_manual("Levels",
                      values = wbcolors) +
    theme_bw()

4.7.4 Boxplot based on continuous column: x, y + boxplot()

There is little to do here other than set the theme. We also set the fill for the boxplot because we like how this makes it stand out from the graph.

admitdataIncGPAGender |> 
  ggplot(aes(x = FamilyIncome, y = UnivGPA)) +
    geom_boxplot(aes(group=cut_width(FamilyIncome, 
                                     width=25000,
                                     boundary=0),
                     varwidth=TRUE), 
                 fill = "lightgrey") +
    labs(title = "University GPA range by Family Income",
         subtitle = "For all years",
         x = "Family Income",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(breaks = c(0, 50000, 100000,
                                  150000, 200000,
                                  250000),
                       labels = c("$0", "$50k",
                                  "$100k", "$150k",
                                  "$200k", "$250k")) +
    theme_bw()

4.8 2 continuous, 1 discrete

4.8.1 Point plot with fitted line for subsets: x, y, color + point() + smooth()

Other than setting the theme, we apply our usual color palette for males and females. In a previous step, we had already set the alpha for the point plots.

admitdataIncGPAMF |> 
  ggplot(aes(x = HSGPA, y = UnivGPA, color = Gender)) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "gam", alpha = 1.0) +
    labs(title = paste("University GPA vs.",
                       "High School GPA by Gender",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw() +
    scale_color_manual(values = c("#00aedb", "#ffc425"))

4.8.2 Point plot with fitted line wrapped by facets: x, y, facet + point() + smooth()

This is the same data as shown in this graph; however, here we use facets to differentiate the gender values rather than colors. The only thing that we have to do here is set the theme.

admitdataIncGPAMF |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_wrap(~Gender) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "gam", alpha = 1.0) +
    labs(title = paste("University GPA vs.",
                       "High School GPA by Gender",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

4.8.3 Boxplot wrapped by facets: x, y, facet + boxplot()

Again, this is the same data as in the previous section; however, this time, instead of displaying all of the data with point, we display a summary of the data with a boxplot. We set both the fill of the boxplot and the theme for the whole graph.

admitdataIncGPAMF |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_wrap(~Gender) +
    geom_boxplot(aes(group=cut_width(HSGPA, 
                                     width=0.25,
                                     boundary=2.0)), 
                 fill = "lightgrey") +
    labs(title = paste("University GPA vs.",
                       "High School GPA by Gender",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

The following graph simplifies the previous one by ignoring Gender and displaying a boxplot for all data in each bin. Look how clear it is now that the bulk of the data is in the middle of the HS GPA range of values.

admitdataIncGPAMF |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    geom_boxplot(aes(group=cut_width(HSGPA, 
                                     width=0.25,
                                     boundary=2.0)), 
                 fill = "lightgrey",
                 varwidth=TRUE) +
    labs(title = "University GPA ranges vs. High School GPA",
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(limits = c(2.0, 4.0),
                       breaks = c(2, 2.5, 3.0, 3.5, 4.0),
                       labels = c("2.0", "2.5",
                                  "3.0", "3.5",
                                  "4.0")) +
    theme_bw() 

4.8.4 Violin plot wrapped by facets: x, y, facet + violin()

We are, again, displaying the same data as in the previous section. Since there are not multiple colors, we do not have to set a fill or color palette. We set the fill for the violin plot to make it easier to read.

admitdataIncGPAMF |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_wrap(~Gender) +
    geom_violin(aes(group=cut_width(HSGPA, 
                                    width=0.25,
                                    boundary=2.0),
                    scale="count"), 
                fill = "lightgrey") +
    labs(title = paste("University GPA distributions vs.",
                       "High School GPA by Gender",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

4.8.5 Jitter and box plot by subset: x, y, color + jitter() + boxplot()

This graph, again displaying the same underlying data as in this previous section.

This is something a bit different. Our color palette for the continuous variable HSGPA is set to scale_color_viridis_c (note the c at the end for continuous). This makes the point plot vary depending on the value of the applicant’s HSGPA.

admitdataIncGPAMF |> 
  ggplot(aes(x = Gender, 
             y = UnivGPA, 
             color = HSGPA)) +
    geom_jitter(alpha = 0.3) + 
    geom_boxplot(alpha = 1.0, 
                 linewidth=0.75, 
                 color = "black",
                 fill = NA) +
    labs(title = paste("University GPA",
                       "distributions by",
                       "Gender (and HS GPA)",
                       sep = " "),
         subtitle = "For all years",
         x = "Gender",
         y = "University GPA",
         color = "HS GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw() +
    scale_color_viridis_c()

4.9 2 continuous, 2 discrete

In this section we will be working with three fairly complex sets of data and showing varying ways of highlighting relationships within that data. We are very much drawing on ggplot’s strengths and flexibility with these graphs.

4.9.1 Point plot with fitted line for subsets wrapped by facet: x, y, color, facet + point() + smooth()

In this step we set our usual color palette for male and female and also set the graph theme.

to_full_name <- as_labeller(c("ARTS" = "Arts",
                           "BUSI" = "Business",
                           "HUMA" = "Humanities",
                           "STEM" = "STEM"))
admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA, color = Gender)) +
    facet_wrap(~ProbableMajorType,
               labeller = to_full_name) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "gam", alpha = 1.0) +
    labs(title = paste("University GPA",
                       "distributions by",
                       "Gender (and HS GPA)",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()  +
    scale_color_manual(values = c("#00aedb", "#ffc425"))

4.9.2 Point plot with fitted line wrapped by a facet grid: x, y, facet_grid + point() + smooth()

We are, again, displaying the same data as shown in the previous section; however, in this case, we are using facet_grid() instead of facet_wrap(). We use the default color scheme and our usual theme.

to_title <- function(string_input) str_to_title(string_input)
admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_grid(
      ProbableMajorType~Gender,
      labeller = labeller(Gender = label_both,
                          ProbableMajorType = as_labeller(to_title))) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "gam", alpha = 1.0) +
    labs(title = paste("University GPA vs.",
                       "HS GPA by Gender",
                       "& Student Type",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    scale_x_continuous(guide = guide_axis(angle = 45)) +
    theme_bw()

4.9.3 Boxplot wrapped by a facet grid: x, y, facet_grid + boxplot()

This is, again, plotting the same values as shown in the previous section. We are building on the approach used in this section.

Again, we use the default color scheme (except for setting the fill to lightgrey) and our usual theme.

admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_grid(ProbableMajorType~Gender) +
    geom_boxplot(aes(group=cut_width(HSGPA, 
                                     width=0.25,
                                     boundary=2.0)), 
                 fill = "lightgrey") +
    labs(title = paste("University GPA vs.",
                       "HS GPA by Gender",
                       "& Student Type",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

4.9.4 Violin wrapped by a facet grid: x, y, facet_wrap + violin()

Again, this graph is based on the same data as the previous section. Again, we use the standard colors, set the fill to lightgrey, and apply our usual theme.

Notice how this graph makes it very clear that some combinations of student type and gender have many fewer than the norm.

admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_grid(ProbableMajorType~Gender) +
    geom_violin(aes(group=cut_width(HSGPA, 
                                     width=0.25,
                                     boundary=2.0)), 
                 fill = "lightgrey") +
    labs(title = paste("University GPA vs.",
                       "HS GPA by Gender",
                       "& Student Type",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

4.9.5 Jitter and box plot for subsets wrapped by a facet: x, y, color, facet + jitter()

This graph is based on the data shown in the previous graph and builds on the approach used in this section. This step adds our usual theme plus the color palette for continuous data specially for color-blind readers.

admitdataHSUnivMFMajor |> 
  ggplot(aes(x = Gender, 
             y = UnivGPA, 
             color = HSGPA)) +
    facet_wrap(~ProbableMajorType,
               ncol=4) +
    geom_jitter(alpha = 0.3) +
    geom_boxplot(alpha = 1.0,
                 linewidth=0.75,
                 color = "black",
                 fill = NA,
                 varwidth = TRUE) +
    labs(title = paste("University GPA vs.",
                       "HS GPA by Gender",
                       "& Student Type",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA",
         color = "HS GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 
                                  3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw() +
    scale_color_viridis_c()

4.9.6 Boxplot differentiated by two columns and wrapped by a facet: x, y, color, facet + boxplot()

In this section we are working with new data, but we are building on the approach used in this section. This step adds our usual theme plus a manual color palette. As always, carefully specify whether it’s scale_color_manual or scale_fill_manual depending on the contents of the aes().

admitdataRaceUnivPellMajor |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA,
             color = PellStatus)) +
    facet_wrap(~ProbableMajorType) +
    geom_boxplot(varwidth=TRUE) +
    labs(title = paste("University GPA ranges",
                       "by Race/ethnicity",
                       "and by Gender &",
                       "Pell Grant status",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA",
         color = "Pell Grant\nStatus") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw() +
    scale_color_manual(values = c("#2a4d69", "#4b86b4"))

4.9.7 Boxplot differentiated by one column and wrapped by a facet grid: x, y, facet_grid + boxplot()

In this graph, we are again displaying the distribution of UnivGPA against race/ethnicity; however, in this case we are using facet_grid() on both StudentType (first time freshmen vs. transfer) and PellStatus (yes or no).

Since there’s only one color, no color or fill palette is needed for this graph. We add a fill to the boxplot to make it easier to read. We also use our typical theme.

admitdataRaceUnivPellMajor |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA)) +
    facet_grid(StudentType~PellStatus) +
    geom_boxplot(varwidth=TRUE, 
                 fill = "lightgrey") +
    labs(title = paste("University GPA ranges",
                       "by Race/ethnicity",
                       "and by Gender &",
                       "Pell Grant status",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA",
         color = "Pell Grant\nStatus") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

4.9.8 Horizontal boxplot differentiated by one column and wrapped by a facet grid: x, y, facet_grid + horizontal boxplot()

This is exactly the same graph as in the previous section except that we flipped the graph horizontally using coord_flip().

This is the same as the previous graph — add a fill to the boxplot and use our usual theme.

admitdataRaceUnivPellMajor |> 
  ggplot(aes(IPEDSRaceEthnicity, UnivGPA)) +
    facet_grid(StudentType~PellStatus) +
    geom_boxplot(varwidth=TRUE, 
                 fill = "lightgrey") +
    coord_flip() +
    labs(title = paste("University GPA ranges",
                       "by Race/ethnicity",
                       "and by Gender &",
                       "Pell Grant status",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "University GPA",
         color = "Pell Grant\nStatus") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_bw()

4.9.9 Boxplot and jitter differentiated by two discrete and one continuous column: x, y, size, color + boxplot() + jitter()

student_RaceSexPCISAT |> 
  ggplot(aes(Race, SAT)) +
    geom_jitter(aes(size = PCI20,
                    color = Sex),
               alpha = 0.25,
               width = 0.25) +
    geom_boxplot(fill = NA,
                 color = "black",
                 varwidth = TRUE) +
    labs(title = paste("SAT ranges by",
                       "Race/ethnicity and",
                       "by Gender & Per",
                       "Capita Income",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "SAT scores",
         color = "Gender",
         size = "Per Capita\nIncome '20") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    theme_minimal() +
    scale_color_manual(values = c("#00aedb", "#ffc425"))

4.9.10 Horizontal boxplot and jitter differentiated by two discrete and one continuous column: x, y, size, color + horizontal boxplot() + point()

This is the same graph as in the previous section except that it has had the coordinates flipped and differentiates based on Sex for both points and boxplot.

Since the plot color for jitter and the fill color for boxplot are based on the same column (Sex), we specify both the fill and color color palettes to be the same hex codes.

student_RaceSexPCISAT |> 
  ggplot(aes(Race, SAT))+
    geom_jitter(aes(size = PCI20,
                   color = Sex),
               alpha = 0.25,
               width = 0.25) +
    geom_boxplot(aes(fill = Sex),
                 color = "black",
                 varwidth = TRUE) +
    coord_flip() +
    labs(title = paste("SAT ranges by",
                       "Race/ethnicity and",
                       "by Gender & Per",
                       "Capita Income",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "SAT scores",
         color = "Gender",
         fill = "Gender",
         size = "Per Capita\nIncome '20") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    scale_x_discrete(limits = rev) +
    theme_bw() +
    scale_fill_manual(values = c("#00aedb", "#ffc425")) +
    scale_color_manual(values = c("#00aedb", "#ffc425"))

4.9.11 Horizontal boxplot and jitter differentiated by one discrete and one continuous column wrapped by a facet: x, y, size, facet + horizontal boxplot() + point()

This graph represents the same data as shown in the previous section. The only difference is that we are using a facet_wrap() based on Sex instead of using a color in the plots.

In this step we only add our usual theme. Everything else stays the same. We do want to highlight that we set color="blue" to make the boxplot stand out against the plotted points and fill=NA to ensure that the boxplots are transparent so that the plotted points are still visible.

student_RaceSexPCISAT |> 
  ggplot(aes(Race, SAT)) +
    geom_jitter(aes(size = PCI20),
               alpha = 0.3,
               width = 0.25) +
    geom_boxplot(color = "blue",
                 fill = NA) +
    facet_wrap(~Sex) + 
    coord_flip() +
    labs(title = paste("SAT ranges by",
                       "Race/ethnicity and",
                       "by Gender & Per",
                       "Capita Income",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "SAT scores",
         color = "Gender",
         size = "Per Capita\nIncome '20") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    scale_x_discrete(limits = rev) +
    theme_bw()

4.10 3 continuous, 2 discrete

In this last section, we display data based on five separate columns, with three of them being continuous. We have to make decisions very carefully here in order to help the reader interpret the data as easily as possible.

4.10.1 Horizontal boxplot and point differentiated by one discrete and two continuous columns wrapped by a facet: x, y, size, color, facet + horizontal boxplot() + jitter()

The only additions made in this step are the addition of our usual theme plus the color-blind color palette for continuous values (for the Age values). Previously, we had set the linewidth for the boxplot to make it more visible against the dark colors of the plotted points.

student_WHBASATPCIAge |> 
  ggplot(aes(Race, SAT)) +
  geom_point(aes(size = PCI20,
                 colour = Age),
             alpha = 0.3) + 
  geom_boxplot(linewidth = 1,
               fill = NA,
               varwidth = TRUE) +
  facet_wrap(~Sex) + 
  coord_flip() +
    labs(title = paste("SAT ranges by",
                       "Race/ethnicity and",
                       "by Gender, Age, &",
                       "Per Capita Income",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "SAT scores",
         color = "Age",
         size = "Per Capita\nIncome '20") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    scale_x_discrete(limits = rev) +
    theme_bw() +
    scale_color_viridis_c()

4.10.2 Horizontal boxplot and jitter differentiated by one discrete and two continuous columns wrapped by a facet: x, y, size, color, facet + horizontal boxplot() + jitter()

We are displaying the exact same information that we did in the previous section. The pjalette and theme are the same as the previous graph.

student_WHBASATPCIAge |> 
  ggplot(aes(Race, SAT)) +
    geom_jitter(aes(size = PCI20,
                   color = Age),
               alpha = 0.3) + 
    geom_boxplot(linewidth = 1,
                 fill = NA,
                 varwidth = TRUE) +
    facet_wrap(~Sex) + 
    coord_flip() +
    labs(title = paste("SAT ranges by",
                       "Race/ethnicity and",
                       "by Gender, Age, &",
                       "Per Capita Income",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "SAT scores",
         color = "Age",
         size = "Per Capita\nIncome '20") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    scale_x_discrete(limits = rev) +
    theme_bw() +
    scale_color_viridis_c()

4.10.3 Horizontal violin and jitter differentiated by one discrete and two continuous columns wrapped by a facet: x, y, size, color, facet + horizontal violin() + jitter()

Again, we are displaying the same information as shown in the previous section. The palette and theme are the same as the previous graph as well.

student_WHBASATPCIAge |> 
  ggplot(aes(Race, SAT)) +
    geom_jitter(aes(size = PCI20,
                   colour = Age),
               alpha = 0.3) + 
    geom_violin(scale = "count",
                linewidth = 1,
                fill = NA) +
    facet_wrap(~Sex) + 
    coord_flip() +
    labs(title = paste("SAT distributions by",
                       "Race/ethnicity and",
                       "by Gender, Age, &\n",
                       "Per Capita Income",
                       sep = " "),
         subtitle = "For all years",
         x = "Race/ethnicity",
         y = "SAT scores",
         color = "Age",
         size = "Per Capita\nIncome '20") +
    scale_y_continuous(limits = c(800, 1600),
                       breaks = c(800, 1000, 1200, 
                                  1400, 1600)) +
    scale_x_discrete(limits = rev) +
    theme_bw() + 
    scale_color_viridis_c()

5 Other resources