Chapter 5 Communicating Visually

This unit spans Mon Feb 24 through Sat Feb 29.
At 11:59 PM on Sat Feb 29 the following items are due:

  • DC 11 - Communicating with Data in the Tidyverse: Custom ggplot2 themes
  • DC 12 - Communicating with Data in the Tidyverse: Creating a custom and unique visualization

We’ll continue our exploration of the tidyverse with a specific focus on preparing reports containing data and visualizations for others. dplyr and ggplot2. In the notes, I’ll show you how to use the forcats package fct_reorder to change levels in graphs. In the videos, I’ll do the same using dplyr’s mutate function instead. There are often lots of ways to do the same thing in R.

5.1 Media

5.2 joining data

Before we jump into communicating data, I want to cover a technique that is often used in analytics – joining data. We’ll use the babynames database to illustrate. I am a male born in 1967. You probably listened to people from my generation start out stories with “back in my day,” and usually expound on how much more difficult some aspect of life was. I’ll continue the tradition…

Back in my day, people in the US weren’t as creative in naming their children. Apparently, they picked mostly New Testament, Christian names because that is what most other people were named in this country and there was potentially more pressure to conform. My name is the third most popular baby name from 1967 and The top five names made up nearly 20% of all the male names that year.

## Selecting by prop
## # A tibble: 5 x 5
##    year sex   name        n   prop
##   <dbl> <chr> <chr>   <int>  <dbl>
## 1  1967 M     Michael 82445 0.0463
## 2  1967 M     David   66808 0.0375
## 3  1967 M     James   61699 0.0347
## 4  1967 M     John    61623 0.0346
## 5  1967 M     Robert  56380 0.0317

top_n also sends a message stating which variable it is using to determine the top n observations.

Compare that to 2013, which is the most recent year from the babynames package. We see an influx of Old Testament names like Noah and Jacob, non-religious names like “Mason” and shortened names like “Liam.” Even more important is that the top five names combined don’t even account for five percent of all males born in 2013, which is one-quarter of the cumulative proportion of the top five names from 1967. Just by looking at that statistic, we can pretty much tell that there is likely far less conformity in naming male children in 2013 vs. 1967.

## Selecting by prop
## # A tibble: 5 x 5
##    year sex   name        n    prop
##   <dbl> <chr> <chr>   <int>   <dbl>
## 1  2013 M     Noah    18241 0.00904
## 2  2013 M     Jacob   18148 0.00900
## 3  2013 M     Liam    18131 0.00899
## 4  2013 M     Mason   17688 0.00877
## 5  2013 M     William 16633 0.00825

Except for William, the top five names in 2013 were not at all popular in 1967.

## # A tibble: 20 x 5
##     year sex   name        n      prop
##    <dbl> <chr> <chr>   <int>     <dbl>
##  1  1967 M     Michael 82445 0.0463   
##  2  1967 M     David   66808 0.0375   
##  3  1967 M     James   61699 0.0347   
##  4  1967 M     John    61623 0.0346   
##  5  1967 M     Robert  56380 0.0317   
##  6  1967 M     William 37621 0.0211   
##  7  1967 M     Jacob     451 0.000253 
##  8  1967 M     Noah      156 0.0000876
##  9  1967 M     Mason      87 0.0000489
## 10  1967 M     Liam       60 0.0000337
## 11  2013 M     Noah    18241 0.00904  
## 12  2013 M     Jacob   18148 0.00900  
## 13  2013 M     Liam    18131 0.00899  
## 14  2013 M     Mason   17688 0.00877  
## 15  2013 M     William 16633 0.00825  
## 16  2013 M     Michael 15491 0.00768  
## 17  2013 M     James   13552 0.00672  
## 18  2013 M     David   12348 0.00612  
## 19  2013 M     John    10704 0.00531  
## 20  2013 M     Robert   6708 0.00333

You’ll notice I used the %in% clause to show the top five from each period’s popularity in 1967. Let’s assume I want to answer the following questions:

  • what are the top five male names that appear in both 1967 and 2013? - to keep it simple I’ll use totals and not proportions
  • what are the top five male names from 1967 that don’t appear in 2013?
  • what are the top five male names from 2013 that don’t appear in 1967?

5.2.1 inner joins

To answer the question: what are the top five names that appear in both 1967 and 2013?; I’m going first to create a data frame that joins the two periods. I’ll take multiple steps to illustrate, but I can make this syntactically simpler. inner_join creates a new data frame that “joins” the two objects by preferably some unique field that exists in both data frames, in this case, name.

## # A tibble: 6 x 9
##   year.x sex.x name      n.x prop.x year.y sex.y   n.y  prop.y
##    <dbl> <chr> <chr>   <int>  <dbl>  <dbl> <chr> <int>   <dbl>
## 1   1967 M     Michael 82445 0.0463   2013 M     15491 0.00768
## 2   1967 M     David   66808 0.0375   2013 M     12348 0.00612
## 3   1967 M     James   61699 0.0347   2013 M     13552 0.00672
## 4   1967 M     John    61623 0.0346   2013 M     10704 0.00531
## 5   1967 M     Robert  56380 0.0317   2013 M      6708 0.00333
## 6   1967 M     William 37621 0.0211   2013 M     16633 0.00825

You’ll notice that the “.x” represents 1967 data, and the “.y” represents 2013. I can add n.x and n.y to get a total, but I’m assuming things won’t change that much from 1967 due to the high concentration of names.

## Selecting by n
## # A tibble: 5 x 2
##   name        n
##   <chr>   <int>
## 1 Michael 97936
## 2 David   79156
## 3 James   75251
## 4 John    72327
## 5 Robert  63088

Yep…the addition of the 2013 names didn’t even budge the order. The next question is more interesting…

What are the top five male names from 1967 that don’t appear in 2013?

We can’t use our joined data to answer this because that data frame explicitly contains names that only occur in both periods. To accomplish this, we need to do an outer join.

5.2.2 outer joins

In dplyr, a left_join joins two tables using all the data from the “left” table and only matching data from the right table, so if I want to use all of the 1967 names, I make sure that table is syntactically to the left of left_join

## # A tibble: 6 x 9
##   year.x sex.x name       n.x     prop.x year.y sex.y   n.y      prop.y
##    <dbl> <chr> <chr>    <int>      <dbl>  <dbl> <chr> <int>       <dbl>
## 1   1967 M     Young        5 0.00000281   2013 M         8  0.00000397
## 2   1967 M     Zbigniew     5 0.00000281     NA <NA>     NA NA         
## 3   1967 M     Zebedee      5 0.00000281   2013 M        10  0.00000496
## 4   1967 M     Zeno         5 0.00000281   2013 M        14  0.00000694
## 5   1967 M     Zenon        5 0.00000281   2013 M        12  0.00000595
## 6   1967 M     Zev          5 0.00000281   2013 M       160  0.0000793

Looking at the last few names alphabetically, we can the name Zbigniew was used in 1967 and not in 2013 (evident by the NA values in the "*.y" columns). So rewording the question in r-speak, what we are asking is:

Show me the five highest n.x values for names where n.y is NA.

## Selecting by n.x
## # A tibble: 5 x 2
##   name    n.x
##   <chr> <int>
## 1 Bart    534
## 2 Tod     303
## 3 Kraig   172
## 4 Lon     162
## 5 Kirt    155

We have uncovered a reverse-Simpsons-effect. Nobody in 2013 named their son Bart!

To do the same for the 2013 data, we can either put the 2013 table to the left of the left_join, or the right of a right_join.

## Selecting by n.y
## # A tibble: 5 x 2
##   name      n.y
##   <chr>   <int>
## 1 Jayden  14756
## 2 Aiden   13615
## 3 Jaxon    7549
## 4 Brayden  7438
## 5 Ayden    6069

It looks like we can refer to 2013 as "the rise of the *dens." It also appears that maybe my rush to label 2013 as “less conformist” may be wrong.

We could have answered both of these questions doing a single full outer join as well.

## Selecting by n.x
## # A tibble: 5 x 2
##   name    n.x
##   <chr> <int>
## 1 Bart    534
## 2 Tod     303
## 3 Kraig   172
## 4 Lon     162
## 5 Kirt    155
## Selecting by n.y
## # A tibble: 5 x 2
##   name      n.y
##   <chr>   <int>
## 1 Jayden  14756
## 2 Aiden   13615
## 3 Jaxon    7549
## 4 Brayden  7438
## 5 Ayden    6069

5.3 Design Guidelines

Let’s look at one of the visual encodings described in Iliinsky’s table – size, area. He has it listed as “Good” for quantitative values. If we compare this to Few’s use of “points of varying size,” we can see that Few only recommends this for geospatial data, specifically to pinpoint specific locations for entire regions. Part of this difference is due to the two people using different classification systems for visual encodings. “Size, area” is an expansive concept and you might think that there is some overlap between that category and Few’s categories of “Horizontal and Vertical Bars/Boxes.” For horizontal and vertical bars, it is the length that allows us to make the comparison so the “size, area” category doesn’t really overlap here. Horizontal and vertical box plots do have a slight area component to them but, once again, most of the preattentive processing is accomplished by the length and the line markers on the box plots.

A common example of using size to encode quantitative information is the bubble chart (shown below).

The chart shows what agencies the top 100 public servants in British Columbia worked in 2012. Size represents the count of public servants at that particular agency. I don’t think Iliinsky would be particularly fond of this chart. Stephen Few expresses his displeasure with it in a blog post. There probably is a nuanced difference between Few and Iliinsky on the applicability of bubble charts to static graphs which further illustrates the point that the visual encoding guidelines provided by both authors are suggestions, not law.

For bar and column charts, you should only use points when the quantitative scale does not begin at zero. In general, you should have good reason not to have a zero start point as this can lead to a misleading graph. If we want to compare home sales for select Puget Sound counties, county is categorical (technically it is also geographic, but we aren’t mapping right now). We use the length of the bars to make comparisons. Kitsap county looks like it had about 2.4 times as many listings as Island county because the bar is roughly 2.4 times longer.

Let’s assume that for some reason, my quantitative scale doesn’t begin with zero (note: you should be extremely suspicious when you see this). The chart below becomes exceptionally misleading because now it appears as though Kitsap county has over ten times the listings of Island County.

You should never let this happen. If for some reason you are forced into using a non-zero start point (this sometimes happens in journalism), then you should use something that doesn’t force our brain into making comparisons via length or area. A dot plot is shown below, but the first bar chart is still the best option in this case.

5.4 geom selection

We are going to look at Rolling Stone’s 500 greatest albums of all time (1955-2011) from cooldatasets.com. I’ve made a copy locally.

##   Number Year                                 Album         Artist       Genre
## 1      1 1967 Sgt. Pepper's Lonely Hearts Club Band    The Beatles        Rock
## 2      2 1966                            Pet Sounds The Beach Boys        Rock
## 3      3 1966                              Revolver    The Beatles        Rock
## 4      4 1965                  Highway 61 Revisited      Bob Dylan        Rock
## 5      5 1965                           Rubber Soul    The Beatles   Rock, Pop
## 6      6 1971                       What's Going On    Marvin Gaye Funk / Soul
##                        Subgenre
## 1 Rock & Roll, Psychedelic Rock
## 2    Pop Rock, Psychedelic Rock
## 3    Psychedelic Rock, Pop Rock
## 4         Folk Rock, Blues Rock
## 5                      Pop Rock
## 6                          Soul

It looks like this dataset is mostly categorical data with Year and Number being exceptions. Let’s first see how the genres are distributed. If I don’t map a variable to the y axis, geom_bar will take discrete categorical variables and create a bin for each one and then provide a count for each discrete category. This can be explicitly specified by setting the attribute stat = bin but it is the default behavior for geom_bar.

That is far more genres than I thought there would be. Let’s take a look at the top five genres. geom_bar’s default behavior is to place a count of data (i.e., stat = bin) on the y axis. In this case, we’ll use dplyr to pre-aggregate our data and we don’t want ggplot to attempt to count it. We switch to stat = identity to represent the value in the data frame rather than a count of occurrences.

You might have noticed that even though I created top_5_genres with a descending sort, ggplot2 doesn’t arrange the categories in this manner. We’ll take a look at top_5_genres first.

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  2 variables:
##  $ Genre: chr  "Rock" "Funk / Soul" "Hip Hop" "Electronic, Rock" ...
##  $ count: int  249 38 29 19 18

Genre has not been defined as a factor and ggplot has no way to order it unless it is defined as a factor. We’ll use the fct_reorder function within mutate in the following manner:
fct_reorder(categorical_variable, sorted_quantitative_variable_to_order_by) or, in other words,
fct_reorder(Genre, count)

You might be inclined also to encode Genre by color. I’m not a big fan of this in that the color would serve no useful purpose. You would be better off using color as an attribute here and not an aesthetic.

In the last couple of charts, we used geom_bar with stat = "identity" to help reinforce the difference between that and stat = "bin". ggplot2 also has the geom geom_col that was explicitly designed for bar charts using values (i.e., the default is identity instead of bin). We could recreate the last chart in a slightly more straightforward manner as shown below.

Suppose we wanted to examine the count of top 500 records by year. Typically we show time series on the x-axis so we might do something like:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Video did indeed kill the radio star. I had to specify stat = bin in this case because geom_line uses stat = identity by default. Because I didn’t create an aesthetic for y, there is no identity value. You’ll also notice I get a warning informing me that it is creating 30 bins, which ends up misrepresenting the data because I likely have more than 30 discrete years given that the data runs from 1955 - 2011. The suggestion to Pick better value with 'binwidth' is a great one. Setting bindwidth = 1 gives me a bin, and the corresponding count, for each year. Having fewer bins would tend to smooth the data.

For our purposes, it looks like the genres don’t add much to the story, and the legend takes up quite a bit of space. I also want to add a title.

Feel free to review all of the geoms available in ggplot2. T

5.5 DataCamp Exercises

The DataCamp exercises are focused on creating visualizations for others. When we develop visualizations for ourselves, we are often creating them to explore data. When we build them for others, we are creating them to communicate.