dim(st_info)
[1] 2000 10
The following command prints out the number of rows (observations) and columns (variables).
dim(st_info)
[1] 2000 10
This shows us that there are 2000 rows and 10 columns.
If you want to see the dataframe, then use this very straight-forward command (i.e., type the name of the dataframe):
st_info
# A tibble: 2,000 × 10
`Application ID` Given Family Birthdate Email St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-… Teno… Rodri… 05/26/20… teno… GA COWET… M W 1436
2 9221751846-ROEH-… Trav… Roe 03/31/20… trav… SC GREEN… M A 1398
3 4290276249-ALLE-… Axel Allen 06/23/20… axel… GA BULLO… M W 1090
4 3398780452-HILT-… Just… Hilton 06/29/20… just… GA DEKAL… M W 1516
5 7691897840-SMIT-… Mehr… Smith 11/03/20… mehr… GA WHITF… M O 1440
6 1862245592-SHIP-… Cart… Shipl… 10/31/20… cart… GA DEKAL… M H 1438
7 2085584835-CHAM-… Benj… Chamb… 02/08/20… benj… SC DARLI… M W 1452
8 3918571924-STAH-… Robe… Stahl 06/16/20… robe… SC CHARL… M W 1536
9 3023361776-ROGE-… Zaid… Rogers 09/11/20… zaid… SC LANCA… M B 1487
10 7386692838-SCOT-… Jakob Scott 05/25/20… jako… SC PICKE… M H 1373
# ℹ 1,990 more rows
As you can see, this shows the size of the dataframe, the variable/column names, the data types, and the first 10 rows of data.
The View(dt)
command (note the capitalization!!) shows the data table in spreadsheet form in a window in the top-left of RStudio.
View(st_info)
If the dataset has many rows, you probably won’t want to print it out in its entirety. That is what head()
and tail()
are for. These print out a few of the rows just so that you can get an idea of what the dataset contains.
The following command shows the top rows of st_info
. It tries to format the data so that it fits on one printed line:
head(st_info)
# A tibble: 6 × 10
`Application ID` Given Family Birthdate Email St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-2… Teno… Rodri… 05/26/20… teno… GA COWET… M W 1436
2 9221751846-ROEH-2… Trav… Roe 03/31/20… trav… SC GREEN… M A 1398
3 4290276249-ALLE-2… Axel Allen 06/23/20… axel… GA BULLO… M W 1090
4 3398780452-HILT-2… Just… Hilton 06/29/20… just… GA DEKAL… M W 1516
5 7691897840-SMIT-2… Mehr… Smith 11/03/20… mehr… GA WHITF… M O 1440
6 1862245592-SHIP-2… Cart… Shipl… 10/31/20… cart… GA DEKAL… M H 1438
You can see that head()
left off some data in a few of the variables (e.g., Application ID
, Given
(in some cases), Family
(again, in some cases), etc.). Also, every variable contains character
data except for SAT
which contains dbl
(that is, numeric
) data.
Notice the line at the top of the above text:
# A tibble: 6 x 10
This is a comment, as indicated by the leading #
character. This means that the text that follows is not executed, but is merely explanatory text.
Now, what is this about a tibble
? Well, this is a bit of a play on words used by the tidyverse
to refer to a table of data (i.e., a dataframe) that has some special features. We aren’t going to go into those features here but just know that these features make it easier to work with the data — which is the purpose of the tidyverse
.
This text says that this tibble is 6 x 10
— that is, there are 6 observations (rows) and 10 variables (columns). Now, understand that this does not mean that st_info
has 10 rows—it means that the dataframe returned by head()
has 10 rows; st_info
remains unchanged with 2,000 rows.
The following command shows the bottom rows of st_info
. Just as with head()
, it tries to format the data so that it fits on one printed line:
tail(st_info)
# A tibble: 6 × 10
`Application ID` Given Family Birthdate Email St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 3611829360-ROTH-2… Penny Roth 09/02/20… penn… SC BERKE… F W 1105
2 8814528045-MORI-2… Eris Morin 04/05/20… eris… GA CHERO… F H 1266
3 3855136351-RODR-2… Ellis Rodri… 09/05/20… elli… SC BEAUF… F W 1519
4 8669152198-EWIN-2… Emma Ewing 08/02/20… emma… SC LEXIN… F W 1507
5 5442220352-NAVA-2… Iris Navar… 10/09/20… iris… SC LANCA… F B 1335
6 8438756330-FELI-2… Lill… Felix 11/09/20… lill… GA DOUGH… F W 1392
This also results in a 6 x 10
tibble.
If you want to know the column/variable types for a dataframe, use this command:
spec(st_info)
cols(
`Application ID` = col_character(),
Given = col_character(),
Family = col_character(),
Birthdate = col_character(),
Email = col_character(),
St = col_character(),
County = col_character(),
Sex = col_character(),
Race = col_character(),
SAT = col_double()
)
Another tool that provides a way to get information about the dataframe follows. It displays column details along with some sample data:
glimpse(st_info)
Rows: 2,000
Columns: 10
$ `Application ID` <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290…
$ Given <chr> "Tenoch", "Travis", "Axel", "Justice", "Mehran", "Car…
$ Family <chr> "Rodriguez", "Roe", "Allen", "Hilton", "Smith", "Ship…
$ Birthdate <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003…
$ Email <chr> "[email protected]", "[email protected]", "…
$ St <chr> "GA", "SC", "GA", "GA", "GA", "GA", "SC", "SC", "SC",…
$ County <chr> "COWETAGA", "GREENVSC", "BULLOCGA", "DEKALBGA", "WHIT…
$ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
$ Race <chr> "W", "A", "W", "W", "O", "H", "W", "W", "B", "H", "W"…
$ SAT <dbl> 1436, 1398, 1090, 1516, 1440, 1438, 1452, 1536, 1487,…
If you want to know the names of the columns (variables) in a dataframe, use the following:
names(st_info)
[1] "Application ID" "Given" "Family" "Birthdate"
[5] "Email" "St" "County" "Sex"
[9] "Race" "SAT"
A command that gives something of a combination of the spec()
command and the glimpse()
command is the following:
str(st_info)
spc_tbl_ [2,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Application ID: chr [1:2000] "4563269562-RODR-2021" "9221751846-ROEH-2021" "4290276249-ALLE-2021" "3398780452-HILT-2021" ...
$ Given : chr [1:2000] "Tenoch" "Travis" "Axel" "Justice" ...
$ Family : chr [1:2000] "Rodriguez" "Roe" "Allen" "Hilton" ...
$ Birthdate : chr [1:2000] "05/26/2003" "03/31/2003" "06/23/2003" "06/29/2003" ...
$ Email : chr [1:2000] "[email protected]" "[email protected]" "[email protected]" "[email protected]" ...
$ St : chr [1:2000] "GA" "SC" "GA" "GA" ...
$ County : chr [1:2000] "COWETAGA" "GREENVSC" "BULLOCGA" "DEKALBGA" ...
$ Sex : chr [1:2000] "M" "M" "M" "M" ...
$ Race : chr [1:2000] "W" "A" "W" "W" ...
$ SAT : num [1:2000] 1436 1398 1090 1516 1440 ...
- attr(*, "spec")=
.. cols(
.. `Application ID` = col_character(),
.. Given = col_character(),
.. Family = col_character(),
.. Birthdate = col_character(),
.. Email = col_character(),
.. St = col_character(),
.. County = col_character(),
.. Sex = col_character(),
.. Race = col_character(),
.. SAT = col_double()
.. )
- attr(*, "problems")=<externalptr>
The command summary()
provides something of a more display-focused alternative—while also providing some simple statistics—to those users who want to see a general overview of a dataframe:
summary(st_info)
Application ID Given Family Birthdate
Length:2000 Length:2000 Length:2000 Length:2000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Email St County Sex
Length:2000 Length:2000 Length:2000 Length:2000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Race SAT
Length:2000 Min. : 978
Class :character 1st Qu.:1206
Mode :character Median :1308
Mean :1315
3rd Qu.:1425
Max. :1600
You can see above that it calculates some summary statistics for the numeric column.