Chapter 2 Introduction to R

This unit spans Mon Feb 03 through Sat Feb 08.
At 11:59 PM on Sat Feb 08 the following items are due:

  • DC 4 - Introduction to R: Factors
  • DC 5 - Introduction to R: Data frames
  • DC 6 - Introduction to R: Lists

Unit 2 further builts on unit 1 and introduce you to remaining building blocks in R. It is a chance for you to breathe, assimilate what you have already learned and review whatever you think needs additional focus. We will start building up the pace again through unit 5. At that point, things should start to smooth out a bit. We’ll have covered enough R to take in a data set, clean it, and present it visually.

2.2 Lists

A list is a generic vector that doesn’t need to contain the same primitive data types. This makes referencing members a little more complex. We use the list function to create lists. Like matrices, we won’t be using lists much in this class so I’ll provide just a brief introduction.

## [[1]]
## [1] "Fred"  "Ethel"
## [1] "Ethel"

Lists can get pretty tricky but notice if we use a single bracket, the object returned is another list so l[2] returns a list of one which contains a character vector with two elements. Using the double-bracket returns the actual element so l[[2]] returns a character vector of two elements. This is why when we type l[[2]][2] we are saying return the second element of the character vector (i.e., Ethel) whereas if we typed l[2][2] we would get an out of bounds because the list returned with l[2] contains only one element. l[2][[1]][2] would return a list of one, then take the first item in that list (a character vector) and return the second item in that vector – which would be identical to typing l[[2]][2]. While we won’t be working much with lists directly this semester, it is helpful to go back and review them at a later date because we will be using a specialized form of a list throughout the semester, namely – the data frame.

2.2.1 Data Frames

Data frames are lists that we can metaphorically think of as spreadsheets with some restrictions:

  • variable names (i.e., column names) must be unique within the data frame
  • all elements (columns) in the data frame are vectors.
  • all elements (columns) in the data frame have equal length.

So when your data is in a format where columns represent variables, and rows typically represent an observation, data frames are the object that you most likely want to use to represent this data. Most forms of structured data fit nicely into a data frame.

We use the data.frame function to create a data frame.

We created a data frame from the vectors we worked with earlier. They are all equal length and have the same data within the vectors. We can also view the data by clicking on the table icon next to the house variable in the environment pane and see the table as shown below.

dataframe

dataframe

Each column can be referenced by the convention variable_name$column_name, for example, house$price will return the price column as a numeric vector. I can also interact with it as we did with mtcars in the last lesson. I can reference rows and columns in a few different ways.

  • by index - e.g., house[1,2] will return the first row, second column of the data frame
  • by name - e.g., house[1, "colors"] will return the first row of the colors column
  • by name - e.g., house$colors will return the colors column as a character vector as will house[,"colors"]
  • by list index - e.g., house[2] will return the second column as a data.frame object as will house["colors"]
  • by list index - e.g., house[[2]] will return the second column as a character vector as will house[,2]

You can also select multiple rows and columns by using combine and/or ranges (e.g., house[1:2, c("rooms", "colors")])

Finally, you can search for specific values in a data frame as shown below.

##         rooms       colors                 comments price
## 1 Living Room Navaho White Patch ceiling hole - Bob 245.3
##         rooms          colors                  comments price
## 1 Living Room    Navaho White  Patch ceiling hole - Bob 245.3
## 2 dining room Stonington Gray Use a tinted primer - Joe 300.0

Once again, you can see that there are many different ways to accomplish the same task in R. Everything that we’ve done so far has been using the base packages in R (i.e., the stuff installed by default). In the next unit, we’ll show how to search within a data frame using additional packages that aren’t installed by default. We’ll also explain packages, the concept of tidy data, and go into more depth with data frames.

2.3 Factors

When you first took statistics, you learned about different types or classification of variables. One reason why variable types become essential is that they determine the kind of analysis that can be performed. Generally speaking, we can classify data as being numeric (e.g., height, weight, salary), categorical (e.g., gender, color, hometown), or ordinal. Right now, we are not concerned with the numeric data.

Let’s start off by creating some data of houses we might have looked at when thinking about purchasing a property

##  description            price           color           initial_impressions
##  Length:5           Min.   :172000   Length:5           Length:5           
##  Class :character   1st Qu.:180000   Class :character   Class :character   
##  Mode  :character   Median :185000   Mode  :character   Mode  :character   
##                     Mean   :237400                                         
##                     3rd Qu.:250000                                         
##                     Max.   :400000

Looking at a summary of the houses data, I see some detailed statistics regarding price but the other summary information is relatively meaningless. Let’s make a minor modification to our data frame.

##  description            price           color   initial_impressions
##  Length:5           Min.   :172000   blue  :2   Length:5           
##  Class :character   1st Qu.:180000   green :1   Class :character   
##  Mode  :character   Median :185000   yellow:2   Mode  :character   
##                     Mean   :237400                                 
##                     3rd Qu.:250000                                 
##                     Max.   :400000

Notice that the summary for color now includes a count of the different colors. This is because we instructed R to make color a factor. Also notice when I created the data frame, I used the parameter stringsAsFactors = FALSE to tell R not to create factor variables from character data, which is the default behavior.

## 'data.frame':    5 obs. of  4 variables:
##  $ description        : chr  "blue cape near university" "small bungalo near ocean" "weird oval shaped home" "shag carpet place that smells like beer" ...
##  $ price              : num  250000 400000 185000 172000 180000
##  $ color              : Factor w/ 3 levels "blue","green",..: 1 1 3 3 2
##  $ initial_impressions: chr  "love" "love" "hate" "neutral" ...

Looking at the structure of color we can see that it is now defined as a factor with three levels (blue, green, and yellow). Internally, factors get stored as integers, but we now need to be careful when we treat them as strings. By default, a factor has levels that are in alphabetical order that corresponds to the integer values. In this case:

  • 1 = blue
  • 2 = green
  • 3 = yellow

Now we can go back and follow the same process to make initial_impressions a factor.

##  description            price           color   initial_impressions
##  Length:5           Min.   :172000   blue  :2   hate   :2          
##  Class :character   1st Qu.:180000   green :1   love   :2          
##  Mode  :character   Median :185000   yellow:2   neutral:1          
##                     Mean   :237400                                 
##                     3rd Qu.:250000                                 
##                     Max.   :400000

You’ll notice that the levels of initial_impressions, like color are also in alphabetical order. In reality, hate < neutral < love, which suggests that this might be an ordinal variable and not a categorical variable. If we want to have a worst-to-first type of order, we can specify this when creating the factor by setting the parameters levels and ordered.

##  description            price           color   initial_impressions
##  Length:5           Min.   :172000   blue  :2   hate   :2          
##  Class :character   1st Qu.:180000   green :1   neutral:1          
##  Mode  :character   Median :185000   yellow:2   love   :2          
##                     Mean   :237400                                 
##                     3rd Qu.:250000                                 
##                     Max.   :400000

We have told R to make initial_impressions an ordered factor.

## 'data.frame':    5 obs. of  4 variables:
##  $ description        : chr  "blue cape near university" "small bungalo near ocean" "weird oval shaped home" "shag carpet place that smells like beer" ...
##  $ price              : num  250000 400000 185000 172000 180000
##  $ color              : Factor w/ 3 levels "blue","green",..: 1 1 3 3 2
##  $ initial_impressions: Ord.factor w/ 3 levels "hate"<"neutral"<..: 3 3 1 2 1

If you have to clean and transform your data, it is often advisable not to use factors until right before you analyze data. This will become more apparent later in the semester. We’ll also show you a different way to order factors using the forcats package in a few units.

2.4 Tibbles

Finally, I want to briefly introduce you to an updated version of the data frame – the tibble. It isn’t going to make much sense yet, but tibbles make data frames slightly less frustrating to work with. For a full explanation, see the package definition, but it will become more evident after you have had more exposure to R. One thing you might notice is that they print a little more nicely. We won’t be interacting with tibbles too much, but they will pop up again during the semester.

##                               description  price  color initial_impressions
## 1               blue cape near university 250000   blue                love
## 2                small bungalo near ocean 400000   blue                love
## 3                  weird oval shaped home 185000 yellow                hate
## 4 shag carpet place that smells like beer 172000 yellow             neutral
## 5        block shaped home on busy street 180000  green                hate
## # A tibble: 5 x 4
##   description                              price color  initial_impressions
##   <chr>                                    <dbl> <fct>  <ord>              
## 1 blue cape near university               250000 blue   love               
## 2 small bungalo near ocean                400000 blue   love               
## 3 weird oval shaped home                  185000 yellow hate               
## 4 shag carpet place that smells like beer 172000 yellow neutral            
## 5 block shaped home on busy street        180000 green  hate

2.5 Packages

Packages are a collection of functions and data sets that are typically not a part of base R (i.e., they are developed by the community). While functions like mean, median, max, etc. are part of base R, the as_tibble function we just used is part of the tibble package which is loaded with dplyr.

To install a package, we can use the install.packages() command or we can select Tools –> Install Packages in RStudio. By default, the available packages listed are from the CRAN repository, but we can also install packages from Bioconductor or Github.

After a package is installed, if we want to use the functions or data sets that come with the package, we must load it using the library function as we have already done with dplyr and knitr.

Some R packages have overviews of the package, called a “vignette”. You can see what vignettes are available for packages that you have loaded using the browseVignettes function. This will launch a browser tab with links to the vignettes. You can also find out what packages and versions you have loaded, along with other useful information, by using the sessionInfo function as shown below. There will probably be more loaded than you expect as many packages have dependencies (i.e., they load other packages). We’ll probably be using sessionInfo to diagnose some problems students might have during the semester.

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.8.4 knitr_1.26 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4       rstudioapi_0.11  magrittr_1.5     tidyselect_0.2.5
##  [5] R6_2.4.1         rlang_0.4.5      fansi_0.4.1      stringr_1.4.0   
##  [9] highr_0.8        tools_3.6.2      xfun_0.11        png_0.1-7       
## [13] utf8_1.1.4       cli_2.0.2        htmltools_0.4.0  yaml_2.2.0      
## [17] digest_0.6.25    assertthat_0.2.1 tibble_2.1.3     crayon_1.3.4    
## [21] bookdown_0.16.5  purrr_0.3.3      vctrs_0.2.4      glue_1.3.2      
## [25] evaluate_0.14    rmarkdown_2.0    stringi_1.4.6    compiler_3.6.2  
## [29] pillar_1.4.3     pkgconfig_2.0.3

Most of the packages we will use in this course are part of the tidyverse, which we will introduce in the next unit.

2.6 DataCamp Exercises

This unit’s DataCamp exercises have you finishing their Introduction to R course, which gives you the prerequisite knowledge required for many of their other courses.