Vocabulary
This page contains much of the very fundamental information about R that you need to interpret and use it effectively. You should look through all of it now.
1 “Observations” (rows) and “Variables” (columns)
When you look at R datasets, you will immediately note that they look like spreadsheets with rows and columns. In R, they refer to these as observations and variables. We will use these interchangeably, but you should be able to use both means of describing datasets.
2 Data types
You can think of “data types” as the smallest type of information that R works with. Specifically, you need to be familiar with a couple of them:
- chr
-
A character-based sequence. This is also known as a “string”. Generally, these are the data that you will not do numeric calculations on — words in the dictionary, sentences in a book, names, passwords, zip codes, phone numbers, etc.
- int
-
An integer value. These are integers—that is, numbers without a decimal portion. Technically, they are 32-bit integers and thus range in value from approximately \((-2 \times 10^9, -2 \times 10^9)\). If you want to enter an integer, you need to type, e.g.,
3L
,-6L
, etc. Otherwise, R will consider it to be a double (see below). You will generally never ever have a need to use this data type. - dbl
-
A numeric value. These can be either integers (numbers without a decimal portion; e.g., -5, 0, 12) or real numbers (decimal numbers; e.g., 4.5, 0.72, -156.29). Since these are 53-bit numbers, they range in precision from about \(2 \times 10^{-308}\) to \(2 \times 10^{308}\).
- logical
-
Logical values. These form the very basics of digital computing. (You might see these called Boolean values, referring to George Boole, the inventor of Boolean algebra and the digital forefather of modern computing. Scott truly got goosebumps when he came upon his bust at UCC.) R refers to these values as
TRUE
(orT
) andFALSE
(orF
), but nottrue
orfalse
.
There are other data types in R that we will not discuss here. You can read about them here if you want to find out more.
You will notice that dates are not a special data type in R. This is a specialized topic that we will get to later. R does have special functions and tools for representing and manipulating dates but it does not have a special data type for them.
3 The NA
value
We don’t want to go into too much detail here, but we have to at least mention NA
, R’s indicator of “there’s nothing here!” It stands for “not available” and it’s a fairly universal term across all computing languages for the lack of a value.
Basically, anywhere that a value can go, an NA
indicator might appear. It says “no value has been assigned here.”
4 Data structures
A “data structure” is a compound block of data that has multiple pieces. R has a few data structures that you need to know about:
- list
-
Just as you might guess, this is a collection of data. It is ordered, and you can change it. The
list()
command returns a list. - vector
-
You can think of a vector as a special list in which all of the terms have the same type. Just like a list, it is ordered and changeable. The
c()
command returns a vector. Note that if the terms of thec()
function are of different types, R will change them (without telling you) into a single common type (usually a string)! - data frames
-
A data frame is a set of lists with the further requirement that the data type of each list position is the same across lists. For example, if the first position of the first row is a
character
, then the first position of every row must be acharacter
— similarly for other positions and other data types. Another way to think about this is that every column of the data frame must be composed of a vector, though the data types of each vector can differ.
Thus, there is a unique relationship among data frames, vectors, and lists: The columns/variables of a data frame are best represented by a vector, and the rows/observations of a dataframe are best represented by a list. This is something that you would do well to remember (and understand)!
- factors
-
Factors are used to categorize data. Examples might be race, gender, music genre, geographical area, etc. Using factors can make it easier for you to do some types of analysis within R. We discuss these elsewhere in this site in much more depth.
As with data frames, there are other data structures in R that we will not discuss here. You can read about them here if you want to find out more.
Elsewhere in this site, we discuss some functions for creating and manipulating the above data structures.