Chapter 3 Exploring Data - I

This unit spans Mon Feb 10 through Sat Feb 15.
At 11:59 PM on Sat Feb 15 the following items are due:

  • DC 7 - Introduction to the Tidyverse: Data wrangling
  • DC 8 - Introduction to the Tidyverse: Data Visualization

Now that we’ve covered the basics of R, it is time to introduce some packages that make life a little easier when manipulating and exploring data. We’ll also use an external package to visually explore our data (i.e., create simple visualizations).

3.1 Media

3.2 The tidyverse

The tidyverse is a set of external R packages that work together and support the analytics workflow we introduced in the first unit. It was originally casually referred to as the “hadleyverse” as the lead developer on all the initial packages was Hadley Wickham, but he preferred to refer to these packages as the tidyverse explicitly. There are many packages in the tidyverse, but for this unit, we will be covering two of the packages that are automatically loaded when you issue the command library(tidyverse), namely dplyr for data manipulation and ggplot2 for visualization. If you like to follow along with RStudio while you read these notes, make sure you install the tidyverse package (RStudio –> Tools –> Install Package –> tidyverse) and load the library.

Before we get started with dplyr, it is important to mention that all of the tidyverse package support the pipe operator %>% which is used to chain together statements and is technically part of the tidyverse package magrittr. We’ll start out this unit by loading the core packages in the tidyverse.

3.3 dplyr

dplyr is a grammar for data manipulation that uses specific verbs to manipulate data.

it is often difficult to get used to asking questions in dplyr instead of plain English. One way to help improve your thought process is to understand the verbs of dplyr and their purpose.

  • select chooses specific columns.
  • rename renames specific columns and selects all.
  • filter chooses specific rows.
  • arrange sorts rows.
  • mutate creates new columns.
  • transmute is like mutate but doesn’t keep your old columns.
  • distinct returns unique rows.
  • summarize aggregates or chunks.
  • slice selects rows by position.
  • sample takes samples of data (seldom used).

We won’t be discussing sample as it is more commonly used in the sciences, but the other verbs are all commonly used. The other two key non-verb actions in dplyr are group_by, which is typically applied when using summarize and the pipe operator %>% which is used to combine verbs. I give a better visual representation of the queries below in the video, but let’s start by reading in a comma separated values (csv) file from a url and having a quick look at it. Note: we will learn all about importing different filetypes later in the semester.

##     lastname firstname          major      year  gpa
## 1       Snow      John Nordic Studies    Junior 3.23
## 2  Lannister    Tyrion Communications Sophomore 3.83
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36
## 4     Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5      Stark    Eddard        History    Senior 2.78
## 6    Clegane    Gregor        Phys Ed Sophomore 3.23
## 7    Baelish     Peter Communications  Freshman 2.84
## 8  Baratheon   Joffrey        History  Freshman 1.87
## 9      Drogo      Khal        Zoology    Senior 3.38
## 10     Tarly   Samwise Nordic Studies  Freshman 2.39

got.csv is read into the data frame got using read.csv. We’ll use the pipe operator %>% to pipe the data frame the select verb to choose specific columns (e.g., lastname, firstname, gpa). Within select, I can also change column names. Please note: I am not storing the results of these queries in any variables…I am sending them directly out to output (i.e., printing them out). Below, we are explicitly saying “take the data frame got and select the columns lastname, firstname, and gpa and while you are at it, rename the lastname column to surname.”

##      surname firstname  gpa
## 1       Snow      John 3.23
## 2  Lannister    Tyrion 3.83
## 3  Targaryen  Daenerys 3.36
## 4     Bolton    Ramsay 2.24
## 5      Stark    Eddard 2.78
## 6    Clegane    Gregor 3.23
## 7    Baelish     Peter 2.84
## 8  Baratheon   Joffrey 1.87
## 9      Drogo      Khal 3.38
## 10     Tarly   Samwise 2.39

We can use rename to change column names…it selects all the columns in the data frame. So if I wanted to show the entire data frame using the more formal surname instead of lastname, I could do the following without having to specify all of the names in select.

##      surname firstname          major      year  gpa
## 1       Snow      John Nordic Studies    Junior 3.23
## 2  Lannister    Tyrion Communications Sophomore 3.83
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36
## 4     Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5      Stark    Eddard        History    Senior 2.78
## 6    Clegane    Gregor        Phys Ed Sophomore 3.23
## 7    Baelish     Peter Communications  Freshman 2.84
## 8  Baratheon   Joffrey        History  Freshman 1.87
## 9      Drogo      Khal        Zoology    Senior 3.38
## 10     Tarly   Samwise Nordic Studies  Freshman 2.39

If I wanted to filter the results above to show gpa’s that are greater than or equal to 3.5, I would pipe the results to filter to choose those specific rows.

##     surname firstname          major      year  gpa
## 1 Lannister    Tyrion Communications Sophomore 3.83

If instead, I just wanted to sort the selected columns from highest to lowest gpa, I would use arrange. I use desc because the default sort order is lowest to highest.

##      surname firstname          major      year  gpa
## 1  Lannister    Tyrion Communications Sophomore 3.83
## 2      Drogo      Khal        Zoology    Senior 3.38
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36
## 4       Snow      John Nordic Studies    Junior 3.23
## 5    Clegane    Gregor        Phys Ed Sophomore 3.23
## 6    Baelish     Peter Communications  Freshman 2.84
## 7      Stark    Eddard        History    Senior 2.78
## 8      Tarly   Samwise Nordic Studies  Freshman 2.39
## 9     Bolton    Ramsay        Phys Ed  Freshman 2.24
## 10 Baratheon   Joffrey        History  Freshman 1.87

Suppose I wanted to create a dean’s list column called dlist and set it to TRUE if the gpa >= 3.5 and FALSE otherwise. I would use mutate for that. Note: in this example, the column is only created in the output, and the data frame is unaltered.

##      surname firstname          major      year  gpa dlist
## 1       Snow      John Nordic Studies    Junior 3.23 FALSE
## 2  Lannister    Tyrion Communications Sophomore 3.83  TRUE
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36 FALSE
## 4     Bolton    Ramsay        Phys Ed  Freshman 2.24 FALSE
## 5      Stark    Eddard        History    Senior 2.78 FALSE
## 6    Clegane    Gregor        Phys Ed Sophomore 3.23 FALSE
## 7    Baelish     Peter Communications  Freshman 2.84 FALSE
## 8  Baratheon   Joffrey        History  Freshman 1.87 FALSE
## 9      Drogo      Khal        Zoology    Senior 3.38 FALSE
## 10     Tarly   Samwise Nordic Studies  Freshman 2.39 FALSE

If I just wanted to show my transformed variables and no other variables, I could use transmute

##                  name dlist
## 1           John Snow FALSE
## 2    Tyrion Lannister  TRUE
## 3  Daenerys Targaryen FALSE
## 4       Ramsay Bolton FALSE
## 5        Eddard Stark FALSE
## 6      Gregor Clegane FALSE
## 7       Peter Baelish FALSE
## 8   Joffrey Baratheon FALSE
## 9          Khal Drogo FALSE
## 10      Samwise Tarly FALSE

If we wanted to list the majors represented in the got data frame, we would use distinct, which restricts to unique(distinct) output.

##            major
## 1 Nordic Studies
## 2 Communications
## 3        Zoology
## 4        Phys Ed
## 5        History

Aggregation often adds the most complexity to a query, and it is quite common to see summarize combined with group_by. For example, if we wanted to show the average gpa for each major, we would use group_by to declare that we are doing a calculation for each major and use summarize to define the mean calculation. You’ll notice that instead of a data frame, we are outputting a tibble, which is essentially an enhanced data frame that can store more complex data.

## # A tibble: 5 x 2
##   major          average_gpa
##   <chr>                <dbl>
## 1 Communications        3.34
## 2 History               2.33
## 3 Nordic Studies        2.81
## 4 Phys Ed               2.74
## 5 Zoology               3.37

Suppose we wanted to show the name of the student with the highest gpa for each major. We could do this in a few different ways. In all cases, since we are doing it for each major, we will be using group_by(major). In the first case, after grouping, we sort in descending gpa order and slice out the first(1) instance of each student.

## # A tibble: 5 x 5
## # Groups:   major [5]
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Lannister Tyrion    Communications Sophomore  3.83
## 2 Stark     Eddard    History        Senior     2.78
## 3 Snow      John      Nordic Studies Junior     3.23
## 4 Clegane   Gregor    Phys Ed        Sophomore  3.23
## 5 Drogo     Khal      Zoology        Senior     3.38

In the second case, we decide we want to use the top_n function.

## Selecting by gpa
## # A tibble: 5 x 5
## # Groups:   major [5]
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Lannister Tyrion    Communications Sophomore  3.83
## 2 Drogo     Khal      Zoology        Senior     3.38
## 3 Snow      John      Nordic Studies Junior     3.23
## 4 Clegane   Gregor    Phys Ed        Sophomore  3.23
## 5 Stark     Eddard    History        Senior     2.78

In the third case, we use the min_rank function within filter.

## # A tibble: 5 x 5
## # Groups:   major [5]
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Snow      John      Nordic Studies Junior     3.23
## 2 Lannister Tyrion    Communications Sophomore  3.83
## 3 Stark     Eddard    History        Senior     2.78
## 4 Clegane   Gregor    Phys Ed        Sophomore  3.23
## 5 Drogo     Khal      Zoology        Senior     3.38

This should seem somewhat confusing, and perhaps it is best to describe what is going on here. top_n is an easier to use “wrapper” function that combines filter and min_rank. slice was added later to dplyr to make it simpler not just to select the top. For example, if I wanted to select positions 2 through 4, I would use slice(2:4) There is no equivalent top_n for this, and I would end up resorting to the harder to follow filter(min_rank(...) %in c(2:4) To simplify, you should try to get comfortable with slice but feel free to use top_n as well.

3.4 ggplot2

We aren’t wired to look at tons of numbers. In analytics, we tend to use visualizations to understand our data and observe patterns quickly. One of R’s primary strengths is its visualization libraries. For static visualizations, ggplot2 is possibly the most commonly used library.

We often think of visualizations as a way to tell stories to others involving data. In this case, we are merely using visualization to explore our data. With exploratory visualizations, we aren’t that focused on formatting and ease of interpretation by others because they are for our private consumption. As we’ll learn this semester, ggplot2 is commonly used for both exploratory visualizations and to communicate results.

3.5 Grammar of graphics

Base R graphics are conceptually like working from a blank canvas. If you’ve used Microsoft Excel to create a visualization, you typically select a chart from a library. Leland Wilkinson published The Grammar of Graphics in 1999 and described a framework for constructing visualizations. This structured framework falls nicely in between the unstructured blank canvas and the rigid “select a chart” model. The “gg” in ggplot2 actually stands for grammar of graphics. Another commonly used visualization software application, Tableau, also uses the grammar of graphics as a framework (Leland Wilkinson was the VP of Statistics for Tableau). It is important to note that this grammar doesn’t help you select what visualizations to use, it merely helps you construct them.

There are three critical components for every ggplot2 plot.

  1. data
  2. a set of aesthetic mappings between variables in the data and visual properties, and
  3. at least one layer which describes how to render each observation. Layers are usually created with a geom function.

First, we’ll take a look at a base graphics scatterplot of miles per gallon(mpg) and displacement(disp) using the built-in mtcars data frame:

If I want to create a similar visualization in ggplot2, I would start with the data and the aesthetics (aes) using the ggplot function.

You’ll notice that there aren’t any points in our graph. That is because we have yet to create a layer to render the observations. Recall that we typically use a geom function for this. Scatterplots are rendered using geom_point.

Looking back at the layered grammar we created:

  1. data - mtcars
  2. aesthetics - we map mpg to the x-axis and disp to the y-axis
  3. layer - we use points as the geometric object to render the values

You’ll probably notice that the visualization is a little more refined than the one we created with plot. One of the benefits of using ggplot2 is that the defaults are really good.

We can also apply other aesthetic mappings to our visualization, like mapping cylinder to color:

I used factor to effectively treat displacement as a factor (i.e., enumerated or categorical) variable. This creates a potentially more diverging color scheme and prevents a legend that might include values that don’t exist in the data (shown below without the use of factor).

Some other aesthetic mappings include size:

…and shape:

The aesthetics we use are somewhat dependent on how we choose to encode our data. Some aesthetics not used or not applicable here are fill, linetype, weight, alpha, and text. The visualization above is somewhat difficult to comprehend, and we might be better off rethinking what data we want to show and how we want to communicate it.

There are also a variety of geoms for bars, boxplots, smoothing lines, and others that you can use, some of which we will cover thoughout the semester.

The remaining grammatical elements that we have yet to cover, are:

  • The scales map values in the data space to values in an aesthetic space, whether it be color, or size, or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.
  • A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph. We normally use a Cartesian coordinate system, but some others are available, including polar coordinates and map projections.
  • A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.
  • A theme which controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot. A good starting place is Edward Tufte’s early works (Tufte, 1990, 1997, 2001).

3.6 DataCamp Exercises

The DataCamp exercises you are assigned cover both dplyr and ggplot2 in different ways than the lecture notes do. This breadth of coverage should help solidify your knowledge of these core packages which we’ll use throughout the semester. Both of these packages will also help clarify your thinking when manipulating data and visualizing it. This will make it easier to work with any data manipulation and visualization tools, including Excel.