Illustrative graphing example

1 Gathering the data

The first step of any graph creation process is ensuring that you have the data that you need.

We’re going to start with the student_econ data frame; specifically, we are focusing on PCI20 and SAT:

student_econ |> 
  select(PCI20, SAT) |> 
  arrange(PCI20, SAT)
# A tibble: 2,000 × 2
   PCI20   SAT
   <dbl> <dbl>
 1 22644  1191
 2 29738   978
 3 29895  1061
 4 29895  1068
 5 32702  1031
 6 32702  1162
 7 32702  1165
 8 32702  1186
 9 32702  1197
10 32702  1204
# ℹ 1,990 more rows

As we change what we are graphing, we will continue to ensure that we have the data available.

2 Mapping

The very basics for defining a plot in ggplot are the data and the mapping. In Figure 1, we tell ggplot that we will be working with the data in student_econ, and that we want PCI20 on the x axis and SAT on the y axis.

You can clearly see that data is not plotted on the graph yet. That’s because we haven’t told it how we want it to be plotted. That will come next. But you can see that the axes seem to have the correct values on them.

student_econ |> 
  ggplot(mapping = aes(x = PCI20, y = SAT))
Figure 1: The first step: defining the data and the mapping

3 Geometry

Next, in Figure 2, we tell ggplot that we want a scatterplot on the graph. It knows because of our specification of the x axis and y axis in the previous step that the scatterplot will involve these two data sets.

Speaking of the x axis and y axis, did you see that we no longer have either mapping = in the first argument of ggplot or x = and y = in the first two arguments to aes()? We can get away with these shortcuts because ggplot assumes that the first argument of ggplot() will be the mapping term and the first two arguments of aes() will be x and then y.

This is how we will write these terms from here on out, but you can write them however you’d like.

student_econ |> 
  ggplot(aes(PCI20, SAT)) +
  geom_point()
Figure 2: Adding a point plot (new: line 3)

At this point, every point is the same color. We want to see if female students and male students score differently across different income levels.

First, let’s see that we have the data:

student_econ |> 
  select(PCI20, SAT, Sex) |> 
  arrange(PCI20, SAT)
# A tibble: 2,000 × 3
   PCI20   SAT Sex  
   <dbl> <dbl> <fct>
 1 22644  1191 M    
 2 29738   978 M    
 3 29895  1061 F    
 4 29895  1068 F    
 5 32702  1031 F    
 6 32702  1162 F    
 7 32702  1165 F    
 8 32702  1186 F    
 9 32702  1197 F    
10 32702  1204 F    
# ℹ 1,990 more rows

In Figure 3, we specify that the color (colour or color; they are synonyms and can be used interchangeably) of the plotted point should represent the Sex of the applicant. That’s the only change that has to be made; ggplot does the rest.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point()
Figure 3: Specifying that each sex will be plotted with a separate color (change: line 2)

Suppose that for some reason that you want to change the colors that are used in the scatterplot. If the data have some correspondence to colors, then this would be a great feature to take advantage of. Alternatively, your organization might have a specific color palette that it likes to use; again, this would be a great way to apply that palette.

As you can see in Figure 4, you can change these colors with the scale_color_manual() function.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point() +
  scale_color_manual(values = c("gold", "darkgreen"))
Figure 4: Changing the colors of the scatterplot (new: line 4)

We’re having a bit of trouble seeing all of the plotted points since so many of them have the same x values. We can make them semi-transparent by using the alpha value in the plot. In Figure 5, this change is made on line 3. You’ll now be able to see where the points overlap.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  scale_color_manual(values = c("gold", "darkgreen"))
Figure 5: Lightening the colors in the graph (change: line 3)

All of the above are useful for viewing the documents online, but sometimes we have to print these on black-and-white printers. If that’s the case, then you should make the changes as shown in Figure 6.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point() +
  scale_color_grey()
Figure 6: Specifying that the scatterplot should be in greyscale (change: line 4)

You might notice in the previous figure that the grey dots are difficult to see against the grey background. We fix this in Figure 7 with the addition of theme_classic() (as well as adding the alpha value again).

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  scale_color_grey() + 
  theme_classic()
Figure 7: Specifying that the scatterplot should be in greyscale (change: line 4)

We are now going to go back to the graph as it was in Figure 3 except, for now, in Figure 8, we are going to add the alpha value so that we can better see all the plotted values.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5)
Figure 8: Lightening the colors for the graph (change: line 3)

Certainly this is useful; however, it might be more useful to get an idea of where a fitted curve through the values would go with separate lines for males and females. Figure 9 shows these two lines; you can see how close they are throughout.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
Figure 9: Adding a smoothed line through the data (new: line 4)

4 Facets

Now that we have finished specifying the geometry of the graph, we shall specify the facets. When you tell ggplot that you want a facet on variable Z, then ggplot will create separate graphs for every different value of Z.

In this case, we want to create a facet on the St variable. Let’s verify that we have the data that we need:

student_econ |> 
  select(PCI20, SAT, Sex, St) |> 
  arrange(PCI20, SAT)
# A tibble: 2,000 × 4
   PCI20   SAT Sex   St   
   <dbl> <dbl> <fct> <fct>
 1 22644  1191 M     GA   
 2 29738   978 M     GA   
 3 29895  1061 F     GA   
 4 29895  1068 F     GA   
 5 32702  1031 F     GA   
 6 32702  1162 F     GA   
 7 32702  1165 F     GA   
 8 32702  1186 F     GA   
 9 32702  1197 F     GA   
10 32702  1204 F     GA   
# ℹ 1,990 more rows

In Figure 10, we tell ggplot that we want a facet on the St variable. Since we are asking for a facet on just one variable, then we use the facet_wrap() function. Since the St variable has just two values in this sample (fictional) data set, this will create two separate graphics with the same structure as the previous graph…but the data will be divided so that all of the applicants from one state are in one graph. The idea is that this should allow the analyst to see if the applicants differ from one state to another.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_wrap(~St)
Figure 10: Specify that each state should be represented on a separate graph (new: line 5)

If we want to create facets for the combination of two separate variables, then we use the facet_grid() function (as shown in Figure 11). This allows us to drill a bit deeper and see if the relationship between PCI20 and SAT scores varies not simply by the state but by some particular combination of state and race.

But (again), let’s ensure that we have the data that we need:

student_econ |> 
  select(PCI20, SAT, Sex, Race, St) |> 
  arrange(PCI20, SAT)
# A tibble: 2,000 × 5
   PCI20   SAT Sex   Race  St   
   <dbl> <dbl> <fct> <fct> <fct>
 1 22644  1191 M     W     GA   
 2 29738   978 M     H     GA   
 3 29895  1061 F     W     GA   
 4 29895  1068 F     W     GA   
 5 32702  1031 F     W     GA   
 6 32702  1162 F     W     GA   
 7 32702  1165 F     B     GA   
 8 32702  1186 F     W     GA   
 9 32702  1197 F     W     GA   
10 32702  1204 F     W     GA   
# ℹ 1,990 more rows

Given that races with codes "W", "H", "B", and "A" dominate the records, we will limit the graphs to just those rows:

student_econ |> 
  select(PCI20, SAT, Sex, Race, St) |> 
  filter(Race %in% c("W", "H", "B", "A")) |> 
  arrange(PCI20, SAT)
# A tibble: 1,955 × 5
   PCI20   SAT Sex   Race  St   
   <dbl> <dbl> <fct> <fct> <fct>
 1 22644  1191 M     W     GA   
 2 29738   978 M     H     GA   
 3 29895  1061 F     W     GA   
 4 29895  1068 F     W     GA   
 5 32702  1031 F     W     GA   
 6 32702  1162 F     W     GA   
 7 32702  1165 F     B     GA   
 8 32702  1186 F     W     GA   
 9 32702  1197 F     W     GA   
10 32702  1204 F     W     GA   
# ℹ 1,945 more rows

Note that it can be a useful bit of documentation to include all the various select, filter, and join commands in the graph definition statement itself.

student_econ |> 
  select(PCI20, SAT, Sex, Race, St) |> 
  filter(Race %in% c("W", "H", "B", "A")) |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_grid(vars(Race), vars(St))
Figure 11: Specify that each state/race combination should be represented on a separate graph (change: line 8). We also added line 3 as a means of limiting the graphs to just four separate races.

In other cases—especially if the data is going to be used in multiple ways—you might want to create a data frame and store it separately and then define the graph. Consider this:

student_econ_grid <-
  student_econ |> 
    select(PCI20, SAT, Sex, Race, St) |> 
    filter(Race %in% c("W", "H", "B", "A"))
student_econ_grid |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_grid(vars(Race), vars(St))

That’s it for facets for this graph. We will continue this discussion with the one-facet version rather than the two-facet version.

5 Coordinate space

Now to specify the coordinate space for this graph. This usually means that you want to change the axes in some way—the grid marks, the labels, the range of values, etc.

In Figure 12, we use the scale_x_continuous() and scale_y_continuous() functions to do the following:

  • scale_x_continuous: specify where the grid marks (breaks) should go, and then specify what the labels should be at those breaks.
  • scale_y_continuous: specify where the grid marks (breaks) should go. We are satisfied with the labels as they currently are but we do want to ensure the range covers the entire set of possible SAT scores.
student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_wrap(~St) +
  scale_x_continuous(breaks = c(20000, 40000, 
                                60000, 80000, 
                                100000),
                     labels = c("$20k", "$40k", 
                                "$60k", "$80k", 
                                "$100k")) + 
  scale_y_continuous(limits = c(800, 1600),
                     breaks = c(800, 1000, 1200, 
                                1400, 1600))
Figure 12: Specify where the labels should appear on the axes (new: lines 7–15)

In the previous graph, it’s a bit unclear as to what the x-value is for the right-most plotted values. While we specified that the graph should have breaks all the way up to “$100k”, it is not actually displaying that value on the axis. In order to change this, we need to increase the maximum value of the x axis.

In Figure 13, we use the limits value to increase the maximum value on the x axis.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_wrap(~St) +
  scale_x_continuous(limits = c(20000, 105000),
                     breaks = c(20000, 40000, 
                                60000, 80000, 
                                100000),
                     labels = c("$20k", "$40k", 
                                "$60k", "$80k", 
                                "$100k")) + 
  scale_y_continuous(limits = c(800, 1600),
                     breaks = c(800, 1000, 1200, 
                                1400, 1600))
Figure 13: Ensure that the maximum X label appears on the axis itself (new: line 7)

6 Labels

The labels section of the ggplot specification allows you to set all kinds of information related to the title, axes, and legend.

In Figure 14, we use the labs() function to specify the graph title, the graph subtitle, the x axis label, the y axis label, and the title of the color legend.

student_econ |> 
  ggplot(aes(PCI20, SAT, 
             colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_wrap(~St) +
  scale_x_continuous(limits = c(15000, 105000),
                     breaks = c(20000, 40000, 
                                60000, 80000, 
                                100000),
                     labels = c("$20k", "$40k", 
                                "$60k", "$80k", 
                                "$100k")) + 
  scale_y_continuous(limits = c(800, 1600),
                     breaks = c(800, 1000, 1200, 
                                1400, 1600)) +
  labs(title = "Examining relationship between PCI and SAT scores", 
       subtitle = "2022 Applicants", 
       x = "Per Capita Income (2020)", 
       y = "SAT scores",
       color = "Gender")
Figure 14: Add titles for the figure, the axes, and the legend (new: lines 17–21)

7 Theme

We are now finalizing the look of our graph. The data representation has been taken care of. Now it is time to tweak the final appearance.

In Figure 15, we first set the graph to use one of many themes that are already defined in ggplot. You’ll see that it mainly adds a border and removes the background grey from the plot body.

student_econ |> 
  ggplot(aes(PCI20, SAT, colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  facet_wrap(~St) +
  scale_x_continuous(limits = c(15000, 105000),
                     breaks = c(20000, 40000, 
                                60000, 80000, 
                                100000),
                     labels = c("$20k", "$40k", 
                                "$60k", "$80k", 
                                "$100k")) + 
  scale_y_continuous(limits = c(750, 1600),
                     breaks = c(800, 1000, 1200, 
                                1400, 1600)) +
  labs(title = "Examining relationship between PCI and SAT scores", 
       subtitle = "2022 Applicants (by state)", 
       x = "Per Capita Income (2020)", 
       y = "SAT scores") +
  theme_bw()
Figure 15: Change the presentation theme for the figure (new: line 20)

The ggplot package allows us to build on these predefined themes. In Figure 16, we make the following changes:

  • legend.position: We specify that we want the legend to be in the lower-right corner of the graph. Since there is open space here, it allows ggplot to make the graph body bigger.
  • title: We specify that the title and subtitle of the graph as well as the names of both axes and the title of the legend should all be bold and purple.
student_econ |> 
  ggplot(aes(PCI20, SAT, colour = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  scale_x_continuous(limits = c(15000, 105000),
                     breaks = c(20000, 40000, 
                                60000, 80000, 
                                100000),
                     labels = c("$20k", "$40k", 
                                "$60k", "$80k", 
                                "$100k")) + 
  scale_y_continuous(limits = c(750, 1600),
                     breaks = c(800, 1000, 1200, 
                                1400, 1600)) +
  labs(title = "Examining relationship between PCI and SAT scores", 
       subtitle = "2022 Applicants", 
       x = "Per Capita Income (2020)", 
       y = "SAT scores") +
  theme_bw() +
  theme(legend.position = c(0.8, 0.2),
        title = element_text(colour = "#582C83",
                             face = "bold"))
Figure 16: Change some of the details for the figure (new: lines 20–22)

8 Summary

Figure 16 is obviously quite complex. No one, not even the most experienced R programmer, would build that in one, two, or even three attempts. We wouldn’t. And we wouldn’t expect you to, either.

Actually, it’s kind of the whole point of ggplot: enable you to iteratively explore the data and determine which graphs you want to spend more time on because they deserve that time!

You can see that the graph was built up in distinct steps. You could stop at any moment after the first few and you would still have a useful graph.

Further, having defined the program for this graph, if you were to get a new set of data for your applicants next year, then you could simply plug in the data and it would immediately create these same graphs.

Having presented this example, we absolutely do not expect that you now understand ggplot and all of its nuances. The motivating idea has been to walk you through the whole process and show you some of the possibilities.

Now we will highlight a bit more for you in this page.