Graphing overview

1 Introduction

The ggplot package—part of the tidyverse—is unique in its approach to creating graphics. It is right there in its name: the gg stands for the grammar of graphics. What does this mean? Let’s figure it out through a comparison and an analogy.

For the comparison

Think about how you create graphs in Excel. It has a huge set of pre-defined graphics, and you can set many options for each one. However, if Excel didn’t deem to define a graphic that you wanted to make, then you can’t make it.

For the analogy

Language doesn’t have “pre-defined sentences” from which you choose. Language has rules for structuring sentences and parts of speech (subjects, verbs, adverbs, etc.) that can be assembled in a seemingly unlimited number of ways.

The ggplot package defines a grammar for putting together a wide variety of pieces that can define a huge number of graphs. It’s more like a set of Legos than a set of pre-fabricated houses that you might have to choose from.

In this “graphing” section of the site, we do the following:

  • Provide a brief overview of the different pieces that can be used to describe and create a graph.
  • Go through an example, creating a graph in an iterative fashion as a means of highlighting the roles that each piece of a graph creation command can fulfill
  • Demonstrate and briefly discuss how changing specific pieces can radically change the graph that gets created
  • Go through each of the pieces in more, but not full, detail so that you can begin your exploration of ggplot in more depth

The ggplot package has hundreds of features to explore. This is meant to provide an initial overview and a roadmap for your own explorations.

2 The graph creation process

The ggplot graph creation process is not a pick-from-these-existing-choices kind of process. It’s much more a process related to answering a series of questions, and then possibly revisiting and re-answering earlier questions as you learn more during your exploration.

Let’s look at those questions before we continue.

  1. What specific pieces of data do you need?

    1. Is all of the data in one table (data frame)?
    2. If not, can you construct a query that would bring all of the data together?
    3. In either case, write a query and then inspect the results to ensure that everything looks right.
  2. What are the basic parameters of how you want the graphs to look?

    1. What are the x-axis and y-axis?
    2. If multiple sets of data are to be plotted, how will you distinguish those graphs?
      1. Color?
      2. Shape?
      3. Size?
    3. If further dimensions are to be plotted, do you want to have separate graphs for each?
  3. Given how large the data set is, and given the strengths of different types of graphs, how do you want to display the graphs?

    1. Point plots
    2. Line plots
    3. Regression lines
    4. Bar graphs
    5. Histograms
    6. Boxplot
    7. etc.
  4. How should the axes look? How should the scales be marked? What range of values should they cover?

  5. What should all of the labels around the graph be? The title and subtitle? The x-axis? The y-axis? The legend (for color or size or shape)?

  6. What should the color and font scheme of the graph be?

For all the graphs you create, you will have to go through steps 1–3 above. The importance of the graph (external scrutiny, expected lifespan, etc.) will determine how much you focus on steps 4–6.

3 Structure of data visualization commands

The following are the pieces of a ggplot command for creating a graphic. Don’t worry—more details on other pages! Specifically, the illustrative example is meant to provide immediate illumination around all of these parts, and the graphing details page goes into more detail about each of the pieces.

data

The data you’re going to use for the analysis. It’s generally included once, on the first line of the command, and then it is assumed for the remainder, without the need to type it again.

mapping (aesthetics)

This name is not particularly helpful, but it’s what the R tidyverse has landed on, so we’ll have to go with that. This and the following part is the most vital for defining the purpose and meaning of the graph. You might think of it as the universal architecture of the graph. It tells R how which data is going to be represented in what way (axis, color, shape, etc.). The information you enter here will apply throughout the graph; sometimes this is what you want while sometimes it isn’t.

geometric representation

Here you are telling R which data will be represented by a line, which will be represented by a bar, etc.

facets

Facets are ggplot’s method for using multiple graphs with the same form to compare different values of one variable. E.g., you could use facets to compare income distribution by state—with y as number of people and x as income levels, the facet could be state. There would then be one bar graph per state showing the income distribution for that state.

coordinate space

These functions typically relate to the axes and the display of the grid on which the data is plotted.

labels

These functions manipulate the text around the graph—the titles, the names of the axes, and the legend.

theme

This is the final tweaking of the overall output: the coloring of text and the appearance of the graphs themselves.