Graphing details

This page provides a one-stop-shopping overview of ggplot, the R tidyverse graphing package. It demonstrates the pieces used to construct a graph while also showing a few of the most used graph types that an analyst builds. While reviewing all of this, the reader should try to appreciate the flexibility of ggplot’s construction process in light of Excel’s graph selection process.

This site provides a variety of ways for you to build your understanding of how to create graphs in R:

1 Mapping (aesthetics)

The basic structure of the most important part of ggplot() is as follows:

dataframename |> 
  ggplot(aes(XVar, YVar, 
             color = ColorVar,
             fill = FillVar,
             size = SizeVar))

You have to define this part of the statement before you do anything else — it defines the most basic information about the graph.

Using the above statement to demonstrate what aes() can do, let’s look at the parts of the statement:

  • XVar: This defines the variable that will be on the x-axis. Note that this might also be specified as x = XVar; this is mandatory.
  • YVar: This defines the variable that will be on the y-axis. Note that this might also be specified as y = YVar; this is mandatory.
  • color = ColorVar: This defines the color of the lines (or outlines) in the graph. (It is easy to confuse this with fill, below.) As the underlying value of ColorVar changes, the color of the line being plotted changes. You can also specify a constant here instead, such as color = black or color = "#21618C". More information about colors can be found here.
  • fill = FillVar: This defines the color of the object fills in the graph. As the underlying value of FillVar changes, the color of the fills being plotted changes. This can also be set to a constant (as just above).
  • size = SizeVar: This defines the size of a point being plotted. As the underlying value of SizeVar changes, the size of the object being plotted changes. This can be set to a constant value (usually between 1 and 5, with 1.5 being the default).

You can see an example of an aes() function in a simple graph here.

2 Geometry

Below are a few of the available geometries (what the tidyverse calls graph types). Here are some short descriptions with pointers to examples provided on this page. Many other examples are provided in our extensive gallery.

  • geom_line(): line segments plotted between points. Examples: 1, 2.
  • geom_bar(): bars, displayed either side-by-side or stacked, whose height reflects the underlying data. Examples: 1, 2.
  • geom_boxplot(): a boxplot, sometimes known as a box-and-whisker plot, reflects the underlying distribution of the data, displaying quartiles and the maximum and minimum values. Examples: 1, 2, 3, 4, 5.
  • geom_point(): plotted points reflecting the values of the underlying data. Examples: 1, 2, 3, 4, 5.
  • geom_histogram(): a common type of bar graph used to display the underlying distribution of values. Examples: 1.
  • geom_smooth(): a tool used to display a trend/regression line along with a confidence interval around the line. Examples: 1, 2.

Note that the information specified in the aes() (above) is inherited by the specific geom_X() defined in this step (unless it is overwritten).

2.1 Line plot example

Here we are plotting the frequency of five particular female baby names in a particular year. After telling ggplot() that the x-axis is YearOfBirth and the y-axis is Number, we tell it to plot different line graphs for each of the different Name values.

Code
babynames |> 
  filter(Name %in% c("Jennifer", "Teresa", "Karen", 
                     "Linda", "Nancy") &
           Sex == "F") |> 
  ggplot(aes(x = YearOfBirth, y = Number)) +
  geom_line(aes(color = Name))

2.2 Bar graph examples

2.2.1 Stacked bar

Stacked bar graphs are the default in ggplot. The following tells ggplot to create different bars for each value of Race (the x-axis variable), and to use a different fill color in those bars depending on the number of students of each Sex.

Code
student_econ |> 
  ggplot(aes(Race, fill = Sex)) +
    geom_bar()

2.2.2 Grouped bar

To change from the stacked bar above, it is simply a matter of adding position = "dodge" to the last line of the graph description, as follows.

Code
student_econ |> 
  ggplot(aes(Race, fill = Sex)) +
    geom_bar(position = "dodge")

2.3 Boxplot example

A boxplot displays a graph that reflects the data’s underlying distribution. Here, it displays the distribution of SAT values for applicants by Race.

Code
student_econ |> 
  ggplot(aes(Race, SAT)) +
  geom_boxplot()

2.4 Point examples

Point plots can be used by themselves but they are frequently combined with other graph types.

2.4.1 Point (with boxplot) example

In this example, we add a point plot to the above boxplot. We specify that the size of the point should reflect the applicant’s PCI20 value and the color of the point should reflect the applicant’s Sex. The alpha value ensures that the points plotted are semi-transparent so that points below it aren’t obscured.

Code
student_econ |> 
  ggplot(aes(Race, SAT)) +
  geom_boxplot() +
  geom_point(aes(size = PCI20,
                 color = Sex),
             alpha = 0.5)

2.4.2 Point (with line) example

In this example, we use a point plot to emphasize the specific points that are plotted on the line plot. Each line has a different color/shape combination based on Gender (as specified in the aes() function).

Code
admitdataYearGenderCount |> 
  ggplot(aes(x = AdmitCalendarYear, 
             y = Count, 
             color = Gender,
             shape = Gender)) +
    geom_line(linewidth = 1) +
    geom_point(size = 3)

2.5 Histogram example

A histogram is one of the few graphs for which a y variable need not be specified. In this case, we define a histogram with 30 bins/bars to show the distribution of students by Age.

Code
student_econ |> 
  ggplot(aes(x = Age)) +
    geom_histogram(bins=30, 
                   fill="grey", 
                   color="black",
                   na.rm = TRUE)

3 Facet

R’s faceting capabilities are both hard to wrap one’s head around (since they’re so different than the possibilities available in other applications) while simultaneously being very easy to try out.

Simply put, to add a facet to a graph means that you want ggplot to create a separate graph for every single value of a discrete variable. You would use facet_wrap to define graphs based on one variable and facet_grid to define graphs based on all the value combinations of two variables.

3.1 One-dimensional grid of plots

In this example, we build on this example shown above. As before, the x is set to Race, the y is set to SAT, and the size is set to PCI20. The main difference is that here the color was based on Sex while here we define a facet for it. We, instead, set the color here to be based on Age (which isn’t shown in the previous graph).

Thus, because of the faceting, we now have two graphs (one for F and one for M) so that we can more easily see how the value of Sex affects the distribution of values as represented by both the boxplot and the point plot.

Code
student_econ |> 
  filter(Race %in% c("W", "H", "B", "A")) |> 
  ggplot(aes(Race, SAT)) + 
    facet_wrap(~Sex) +
    geom_boxplot() +
    geom_point(aes(size = PCI20,
                   color = Age),
               alpha = 0.5)

3.2 Two-dimensional grid of plots

In this graph, we define a 4x2 grid of graphs based on the applicants possible major types and gender. In this default setup, each graph has the same x- and y-axes.

Here, after defining the facet grid, we set up both a point plot and a trend line (with its associated confidence interval). We put the trend line after the point plot because we want it to appear on top of the points.

Code
admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_grid(ProbableMajorType~Gender) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "gam", alpha = 1.0)

4 Coordinate space

The ggplot package provides many options for changing how the axes are displayed. This example shows some options for setting up the y-axis when it has continuous values:

  • limits defines the minimum and maximum values on the axis.
  • breaks defines where you want “ticks” to appear; they can be evenly spaced or not.
  • labels defines what you want to show on each one of the ticks.

This type of information, and more, can be similarly set up for discrete or continuous axes on the x- or y-axis.

Code
admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA)) +
    facet_grid(ProbableMajorType~Gender) +
    geom_boxplot(aes(group=cut_width(HSGPA, 
                                     width=0.25,
                                     boundary=2.0))) +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0"))

5 Labels

The labs() function enables ggplot to define the text that appears in many places outside of the graph itself — the title, subtitle, x-axis label, and more.

This graph builds on the graph shown in this section. Here we add a title, subtitle, x-axis name, and y-axis name.

Code
student_econ |> 
  ggplot(aes(Race, SAT)) +
  geom_boxplot() +
  labs(title = "Examining relationship between Race and SAT scores", 
       subtitle = "2022 Applicants", 
       x = "Race", 
       y = "SAT scores")

6 Theme & colors

Themes are used to define the overall look of the graph, from the fonts to the location of the legends and much more. Here we use the theme_fivethirtyeight() theme from the ggthemes library. Compare the look of this graph to the one in the previous section — the fonts are different and many other details differ as well.

We have already discussed colors in this section, but here we show how it can be used to set colors within the graph. We use scale_color_manual() to define the color for each value of Gender (since it is the color variable); a similar function exists for fill.

Code
to_full_name <- as_labeller(c("ARTS" = "Arts",
                           "BUSI" = "Business",
                           "HUMA" = "Humanities",
                           "STEM" = "STEM"))
admitdataHSUnivMFMajor |> 
  ggplot(aes(x = HSGPA, y = UnivGPA, color = Gender)) +
    facet_wrap(~ProbableMajorType,
               labeller = to_full_name) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "gam", alpha = 1.0) +
    labs(title = paste("University GPA",
                       "distributions by",
                       "Gender (and HS GPA)",
                       sep = " "),
         subtitle = "For all years",
         x = "HS GPA",
         y = "University GPA") +
    scale_y_continuous(limits = c(1.0, 4.0),
                       breaks = c(1.0, 2.0, 3.0, 4.0),
                       labels = c("1.0", "2.0",
                                  "3.0", "4.0")) +
    theme_fivethirtyeight()  +
    scale_color_manual(values = c("#00aedb", "#ffc425"))

Color the title.

This page.