Examine data

1 Dimensions of the dataframe

The following command prints out the number of rows (observations) and columns (variables).

dim(st_info)
[1] 2000   10

This shows us that there are 2000 rows and 10 columns.

2 Display the table

2.1 In the console

If you want to see the dataframe, then use this very straight-forward command (i.e., type the name of the dataframe):

st_info
# A tibble: 2,000 × 10
   `Application ID`  Given Family Birthdate Email St    County Sex   Race    SAT
   <chr>             <chr> <chr>  <chr>     <chr> <chr> <chr>  <chr> <chr> <dbl>
 1 4563269562-RODR-… Teno… Rodri… 05/26/20… teno… GA    COWET… M     W      1436
 2 9221751846-ROEH-… Trav… Roe    03/31/20… trav… SC    GREEN… M     A      1398
 3 4290276249-ALLE-… Axel  Allen  06/23/20… axel… GA    BULLO… M     W      1090
 4 3398780452-HILT-… Just… Hilton 06/29/20… just… GA    DEKAL… M     W      1516
 5 7691897840-SMIT-… Mehr… Smith  11/03/20… mehr… GA    WHITF… M     O      1440
 6 1862245592-SHIP-… Cart… Shipl… 10/31/20… cart… GA    DEKAL… M     H      1438
 7 2085584835-CHAM-… Benj… Chamb… 02/08/20… benj… SC    DARLI… M     W      1452
 8 3918571924-STAH-… Robe… Stahl  06/16/20… robe… SC    CHARL… M     W      1536
 9 3023361776-ROGE-… Zaid… Rogers 09/11/20… zaid… SC    LANCA… M     B      1487
10 7386692838-SCOT-… Jakob Scott  05/25/20… jako… SC    PICKE… M     H      1373
# ℹ 1,990 more rows

As you can see, this shows the size of the dataframe, the variable/column names, the data types, and the first 10 rows of data.

2.2 In spreadsheet form

The View(dt) command (note the capitalization!!) shows the data table in spreadsheet form in a window in the top-left of RStudio.

View(st_info)

3 Look at the top rows

If the dataset has many rows, you probably won’t want to print it out in its entirety. That is what head() and tail() are for. These print out a few of the rows just so that you can get an idea of what the dataset contains.

The following command shows the top rows of st_info. It tries to format the data so that it fits on one printed line:

head(st_info)
# A tibble: 6 × 10
  `Application ID`   Given Family Birthdate Email St    County Sex   Race    SAT
  <chr>              <chr> <chr>  <chr>     <chr> <chr> <chr>  <chr> <chr> <dbl>
1 4563269562-RODR-2… Teno… Rodri… 05/26/20… teno… GA    COWET… M     W      1436
2 9221751846-ROEH-2… Trav… Roe    03/31/20… trav… SC    GREEN… M     A      1398
3 4290276249-ALLE-2… Axel  Allen  06/23/20… axel… GA    BULLO… M     W      1090
4 3398780452-HILT-2… Just… Hilton 06/29/20… just… GA    DEKAL… M     W      1516
5 7691897840-SMIT-2… Mehr… Smith  11/03/20… mehr… GA    WHITF… M     O      1440
6 1862245592-SHIP-2… Cart… Shipl… 10/31/20… cart… GA    DEKAL… M     H      1438

You can see that head() left off some data in a few of the variables (e.g., Application ID, Given (in some cases), Family (again, in some cases), etc.). Also, every variable contains character data except for SAT which contains dbl (that is, numeric) data.

Notice the line at the top of the above text:

# A tibble: 6 x 10

This is a comment, as indicated by the leading # character. This means that the text that follows is not executed, but is merely explanatory text.

Now, what is this about a tibble? Well, this is a bit of a play on words used by the tidyverse to refer to a table of data (i.e., a dataframe) that has some special features. We aren’t going to go into those features here but just know that these features make it easier to work with the data — which is the purpose of the tidyverse.

This text says that this tibble is 6 x 10 — that is, there are 6 observations (rows) and 10 variables (columns). Now, understand that this does not mean that st_info has 10 rows—it means that the dataframe returned by head() has 10 rows; st_info remains unchanged with 2,000 rows.

4 Look at the bottom rows

The following command shows the bottom rows of st_info. Just as with head(), it tries to format the data so that it fits on one printed line:

tail(st_info)
# A tibble: 6 × 10
  `Application ID`   Given Family Birthdate Email St    County Sex   Race    SAT
  <chr>              <chr> <chr>  <chr>     <chr> <chr> <chr>  <chr> <chr> <dbl>
1 3611829360-ROTH-2… Penny Roth   09/02/20… penn… SC    BERKE… F     W      1105
2 8814528045-MORI-2… Eris  Morin  04/05/20… eris… GA    CHERO… F     H      1266
3 3855136351-RODR-2… Ellis Rodri… 09/05/20… elli… SC    BEAUF… F     W      1519
4 8669152198-EWIN-2… Emma  Ewing  08/02/20… emma… SC    LEXIN… F     W      1507
5 5442220352-NAVA-2… Iris  Navar… 10/09/20… iris… SC    LANCA… F     B      1335
6 8438756330-FELI-2… Lill… Felix  11/09/20… lill… GA    DOUGH… F     W      1392

This also results in a 6 x 10 tibble.

5 Retrieve the column/variable types of the dataframe

If you want to know the column/variable types for a dataframe, use this command:

spec(st_info)
cols(
  `Application ID` = col_character(),
  Given = col_character(),
  Family = col_character(),
  Birthdate = col_character(),
  Email = col_character(),
  St = col_character(),
  County = col_character(),
  Sex = col_character(),
  Race = col_character(),
  SAT = col_double()
)

6 Display column/variable details with samples

Another tool that provides a way to get information about the dataframe follows. It displays column details along with some sample data:

glimpse(st_info)
Rows: 2,000
Columns: 10
$ `Application ID` <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290…
$ Given            <chr> "Tenoch", "Travis", "Axel", "Justice", "Mehran", "Car…
$ Family           <chr> "Rodriguez", "Roe", "Allen", "Hilton", "Smith", "Ship…
$ Birthdate        <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003…
$ Email            <chr> "[email protected]", "[email protected]", "…
$ St               <chr> "GA", "SC", "GA", "GA", "GA", "GA", "SC", "SC", "SC",…
$ County           <chr> "COWETAGA", "GREENVSC", "BULLOCGA", "DEKALBGA", "WHIT…
$ Sex              <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
$ Race             <chr> "W", "A", "W", "W", "O", "H", "W", "W", "B", "H", "W"…
$ SAT              <dbl> 1436, 1398, 1090, 1516, 1440, 1438, 1452, 1536, 1487,…

7 Get a list of column names

If you want to know the names of the columns (variables) in a dataframe, use the following:

names(st_info)
 [1] "Application ID" "Given"          "Family"         "Birthdate"     
 [5] "Email"          "St"             "County"         "Sex"           
 [9] "Race"           "SAT"           

8 Structure of the dataframe

A command that gives something of a combination of the spec() command and the glimpse() command is the following:

str(st_info)
spc_tbl_ [2,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Application ID: chr [1:2000] "4563269562-RODR-2021" "9221751846-ROEH-2021" "4290276249-ALLE-2021" "3398780452-HILT-2021" ...
 $ Given         : chr [1:2000] "Tenoch" "Travis" "Axel" "Justice" ...
 $ Family        : chr [1:2000] "Rodriguez" "Roe" "Allen" "Hilton" ...
 $ Birthdate     : chr [1:2000] "05/26/2003" "03/31/2003" "06/23/2003" "06/29/2003" ...
 $ Email         : chr [1:2000] "[email protected]" "[email protected]" "[email protected]" "[email protected]" ...
 $ St            : chr [1:2000] "GA" "SC" "GA" "GA" ...
 $ County        : chr [1:2000] "COWETAGA" "GREENVSC" "BULLOCGA" "DEKALBGA" ...
 $ Sex           : chr [1:2000] "M" "M" "M" "M" ...
 $ Race          : chr [1:2000] "W" "A" "W" "W" ...
 $ SAT           : num [1:2000] 1436 1398 1090 1516 1440 ...
 - attr(*, "spec")=
  .. cols(
  ..   `Application ID` = col_character(),
  ..   Given = col_character(),
  ..   Family = col_character(),
  ..   Birthdate = col_character(),
  ..   Email = col_character(),
  ..   St = col_character(),
  ..   County = col_character(),
  ..   Sex = col_character(),
  ..   Race = col_character(),
  ..   SAT = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

9 Summary of the dataframe

The command summary() provides something of a more display-focused alternative—while also providing some simple statistics—to those users who want to see a general overview of a dataframe:

summary(st_info)
 Application ID        Given              Family           Birthdate        
 Length:2000        Length:2000        Length:2000        Length:2000       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    Email                St               County              Sex           
 Length:2000        Length:2000        Length:2000        Length:2000       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     Race                SAT      
 Length:2000        Min.   : 978  
 Class :character   1st Qu.:1206  
 Mode  :character   Median :1308  
                    Mean   :1315  
                    3rd Qu.:1425  
                    Max.   :1600  

You can see above that it calculates some summary statistics for the numeric column.