Chapter 9 Tidy data

This unit spans Mon Mar 30 through Sat Apr 04.
At 11:59 PM on Sat Apr 04 the following items are due:

DC 18 - Cleaning Data in R: Introduction
DC 19 - Cleaning Data in R: Tidying data
DC 20 - Cleaning Data in R: Preparing data for analysis
DC 21 - Cleaning Data in R: Putting it all together

This unit introduces cleaning and tidying data. If you end up working in analytics, this is likely how you will spend most of your time because data is messy.

library(tidyverse)

9.1 Media

Reading: Tidy Data, Wickham, Journal of Statistical Software

9.2 Tidy data

If you have had some exposure to relational database normalization, the concept of tidy data is relatively simple to understand – it is essentially E.F. Codd’s third normal form. I’m going to go out on a limb and assume that most students aren’t familiar with normalization or what the third normal form is. Through the lens of an analyst, tidy data has the following attributes:

each variable forms a column
each observation forms a row.
each type of observational unit forms a table.

The assigned Wickham article is very well written and goes into great detail with multiple examples of tidy and messy data. We are going to use a slightly messy data set to help reinforce the understanding you’ll gain from reading the article.

9.3 Reading in a text file

We will use the file Maine population 1960 - 2010 by counties in this section. So far, we haven’t spent any time discussing file management in R. For this unit; we’ll keep it simple and keep our .R script file and our .csv data file in the same working directory. In the next unit, we’ll introduce the project management features of RStudio.

The working directory is merely the directory/folder that assumes that you are working in. One of the advantages of this is that I can refer to a file in my working directory without giving the entire file path to R. You can always check what your working directory is set to by typing getwd(). You can also set your working directory with the setwd() command, but since we are using RStudio, we’ll do that using the user interface.

After you launch RStudio, use the files pane to navigate to a directory/folder where you want to store your work. If you’re going to create a new folder, click “New Folder” in the file pane after you have navigated to the folder you want to put the new folder in. Next, navigate to that folder via the files pane and under the more drop-down menu, select “set as working directory.”

Now move the CSV file you downloaded to your working directory. If you forgot the path to your working directory, remember you can see it by typing getwd().

Now go ahead and create a new R Script file in your working directory (File –> New File –> R Script). Give it any name you like but I recommend you use the script file for this unit instead of the console so you can save and review your work – reminder: you can click run to run any line from your script file.

We are going to use read.csv to read the Maine population CSV file into a data frame. You’ll need to remember the file name and whether or not the file has a header line.

county_pop <- read.csv("./data/Maine_Population_1960-2010_by_Counties.csv", header = TRUE, 
                stringsAsFactors = FALSE)

We didn’t name the CSV file, and it does have a rather long name, but it is an acceptable file name. Generally speaking, you want to have meaningful filenames and avoid using spaces, special characters (other than underscore _ and hyphen -).

dim(county_pop)

## [1] 16  8

names(county_pop)

## [1] "County"                 "Total.Population..1960" "Total.Population..1970"
## [4] "Total.Population..1980" "Total.Population..1990" "Total.Population..2000"
## [7] "Total.Population..2010" "Location"

head(county_pop)

##                       County Total.Population..1960 Total.Population..1970
## 1 Androscoggin County, Maine                  86312                  91279
## 2    Aroostook County, Maine                 106064                  94078
## 3   Cumberland County, Maine                 182751                 192528
## 4     Franklin County, Maine                  20069                  22444
## 5      Hancock County, Maine                  32293                  34590
## 6     Kennebec County, Maine                  89150                  95306
##   Total.Population..1980 Total.Population..1990 Total.Population..2000
## 1                  99509                 105259                 103793
## 2                  91344                  86936                  73938
## 3                 215789                 243135                 265612
## 4                  27447                  29008                  29467
## 5                  41781                  46948                  51791
## 6                 109889                 115904                 117114
##   Total.Population..2010                  Location
## 1                 107702 (44.1970089, -70.2376963)
## 2                  71870 (46.7270816, -68.6494264)
## 3                 281674 (43.8491436, -70.3617521)
## 4                  30768 (45.0633843, -70.3617521)
## 5                  54418 (44.6682975, -68.4046119)
## 6                 122151 (44.4140565, -69.7526525)

We’ve introduced a couple of new functions that are a little redundant to more robust functions like str. dim shows us the dimensions of the data frame by rows and columns. names shows us the variable names in the data frame (it returns the column names as a character vector and can also be used to set/rename columns). Looking at the data frame, we can tell that it isn’t tidy. If I look at the six variables dealing with population, I can see that there are several issues related to Wickham’s Tidy Data article (section 3).

The Total.Population... columns contain both a variable name population and a value (e.g., 1970).
The population columns store multiple variables (population and year).
The County column stores multiple variables (county and state).
The Location column stores multiple variables (latitude and longitude)…also, we won’t be using this data.

When analysts are confronted with a data set like this, they often call it a “wide” data set. While Wickham considers wide datasets un-tidy, there are some occasions where wide data is desirable, including: * some tabular reports * some specific types of statistical analysis (e.g., repeated measures, MANCOVA)

When we transform wide data to long (or tall) data, we are making it tidy (note: Wickham states he explicitly avoids using the terms wide and tall because they are imprecise). We are going to tidy our county population data using the tidyr package, which loads with the tidyverse.

Before we do any of that, I’m going to rename my columns using all lowercase and change the population variable names to just their year.

names(county_pop) <- c("county", "1960", "1970", "1980", "1990", "2000", "2010", "location")
names(county_pop)

## [1] "county"   "1960"     "1970"     "1980"     "1990"     "2000"     "2010"    
## [8] "location"

9.3.1 Wide to long and back

We are going to transform county_pop from wide to long in steps and then show you how to do it in one fell swoop. First, I’m going to address the population/year combined variable.

library(tidyr)
county_pop2 <- gather(county_pop, year, population, -c(county, location))
head(county_pop2)

##                       county                  location year population
## 1 Androscoggin County, Maine (44.1970089, -70.2376963) 1960      86312
## 2    Aroostook County, Maine (46.7270816, -68.6494264) 1960     106064
## 3   Cumberland County, Maine (43.8491436, -70.3617521) 1960     182751
## 4     Franklin County, Maine (45.0633843, -70.3617521) 1960      20069
## 5      Hancock County, Maine (44.6682975, -68.4046119) 1960      32293
## 6     Kennebec County, Maine (44.4140565, -69.7526525) 1960      89150

str(county_pop2)

## 'data.frame':    96 obs. of  4 variables:
##  $ county    : chr  "Androscoggin County, Maine" "Aroostook County, Maine" "Cumberland County, Maine" "Franklin County, Maine" ...
##  $ location  : chr  "(44.1970089, -70.2376963)" "(46.7270816, -68.6494264)" "(43.8491436, -70.3617521)" "(45.0633843, -70.3617521)" ...
##  $ year      : chr  "1960" "1960" "1960" "1960" ...
##  $ population: int  86312 106064 182751 20069 32293 89150 28575 18497 44345 126346 ...

The gather function in tidyr collapses your data into key-value pairs. This is going to be difficult to grasp until you’ve done it a few times. In this case, we were dealing with a single column containing two variables. For example, the column we renamed to 1960 contains the year - 1960 and the population for that county (which is the value). The key:value pair is year:population so if you look at the arguments to gather, they are:

the data frame
the key year
the value population
the columns that aren’t part of the key:value pair - county and location

There is a lot of complexity to gather that we aren’t covering but the primary challenge to using gather is to identify the key:value pair.

One other thing to note is that the key (i.e., year) is converted to a factor variable. We haven’t discussed factors until now, but factors are stored in two components. The original values are stored as an internal vector of character strings (e.g., “1960”, “1970”, …) and the values are stored as integers (e.g., 1, 2, …). The internal vector of character strings is then mapped to the integers (e.g., 1 = “1960”, 2 = “1970”, etc.). Because we might want actually to perform arithmetic calculations on year, we don’t want to have a factor variable in this situation, so we use the convert = TRUE argument, which we normally want to do when the key is numeric, integer, or logical.

county_pop2 <- gather(county_pop, year, population, -c(county, location), 
                      convert = TRUE)
str(county_pop2)

## 'data.frame':    96 obs. of  4 variables:
##  $ county    : chr  "Androscoggin County, Maine" "Aroostook County, Maine" "Cumberland County, Maine" "Franklin County, Maine" ...
##  $ location  : chr  "(44.1970089, -70.2376963)" "(46.7270816, -68.6494264)" "(43.8491436, -70.3617521)" "(45.0633843, -70.3617521)" ...
##  $ year      : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ population: int  86312 106064 182751 20069 32293 89150 28575 18497 44345 126346 ...

The next item we will clean in our data frame is they way county contains extraneous information (e.g., county should simply contain “Androscoggin” and not “Androscoggin County, Maine”.

county_pop2 <- mutate(county_pop2, county = str_replace(county, " County, Maine", ""))
head(county_pop2)

##         county                  location year population
## 1 Androscoggin (44.1970089, -70.2376963) 1960      86312
## 2    Aroostook (46.7270816, -68.6494264) 1960     106064
## 3   Cumberland (43.8491436, -70.3617521) 1960     182751
## 4     Franklin (45.0633843, -70.3617521) 1960      20069
## 5      Hancock (44.6682975, -68.4046119) 1960      32293
## 6     Kennebec (44.4140565, -69.7526525) 1960      89150

The mutate function in dplyr can be used to change the values of a variable. The str_replace function is part of the pattern matching and replacement functions in tidyverse’s stringr that use regular expressions. Here we are using the arguments to define where the string resides (county), the pattern (" County, Maine"), and the replacement (""). If we still wanted to keep the state name and add a variable state we could use mutate to add that variable to our dataframe with county_pop2 <- mutate(county_pop2, state = "Maine") but I’ll assume we don’t need to add that data.

Finally, we want to drop the location data from our data frame.

county_pop2 <- select(county_pop2, -location)
head(county_pop2)

##         county year population
## 1 Androscoggin 1960      86312
## 2    Aroostook 1960     106064
## 3   Cumberland 1960     182751
## 4     Franklin 1960      20069
## 5      Hancock 1960      32293
## 6     Kennebec 1960      89150

By using -location, I’m telling select to select all the columns except the location column. I could have also used select(county_pop2, county, year, location) for the same results.

If you look at what we did to our data frame after changing the variable names, we transformed it from wide to long using a key:value pair, then removed the extraneous " County, Maine" from the county names, and finally removed the location column. Below, I’ll perform the same operations in a different sequence using the pipe operator %>%

Drop location
Remove " County, Maine"
Transform from wide to long (i.e., tidy the data frame).

county_pop3 <- county_pop %>% select(-location) %>% mutate(county = str_replace(county, " County, Maine", "")) %>%
  gather(year, population, -county, convert = TRUE)
head(county_pop3)

##         county year population
## 1 Androscoggin 1960      86312
## 2    Aroostook 1960     106064
## 3   Cumberland 1960     182751
## 4     Franklin 1960      20069
## 5      Hancock 1960      32293
## 6     Kennebec 1960      89150

If I want to convert the data frame back to a wide format, I can use the spread function in tidyr fairly easily…all I have to do is identify the key and value (which we did earlier). I’ll use the pipe notation from here on out when applicable.

county_pop_wide <- county_pop3 %>% spread(year, population)
head(county_pop_wide)

##         county   1960   1970   1980   1990   2000   2010
## 1 Androscoggin  86312  91279  99509 105259 103793 107702
## 2    Aroostook 106064  94078  91344  86936  73938  71870
## 3   Cumberland 182751 192528 215789 243135 265612 281674
## 4     Franklin  20069  22444  27447  29008  29467  30768
## 5      Hancock  32293  34590  41781  46948  51791  54418
## 6     Kennebec  89150  95306 109889 115904 117114 122151

I encourage you to review the vignettes (which are long-form documentation with examples) for dplyr and tidyr. You can access them with vignette("introduction", package="dplyr") and vignette("tidy-data", package="tidyr"). They are usually a good overview of the package with lots of examples.

9.4 String manipulation

We already used str_replace in the example above. Some other common stringr functions include:

str_trim - trims trailing and leading whitespace, often an issue with fixed-width files.
str_detect - used for detecting patterns (commonly used in data validation and cleaning, also in text analytics)

str_trim is fairly strightforward:

padded_text <- ("  Hello World!      ")
str_trim(padded_text)

## [1] "Hello World!"

For str_detect we’ll introduce a pattern matching language called regular expressions.

9.5 Regular expressions

The ability to manipulate strings is an important skill to have when cleaning data and also when doing lexical analysis. One of the standard languages for finding patterns in strings is regular expressions. There are a variety of ways to implement a regular expression (also known as regex or regexp) in R. The stringr functions we just introduced support regex commands. In the examples below, we use str_detect in combination with regular expressions. In the first example, we see an issue – the input vector, block_of_text, has only one element so as long as the search is matched, it will always return the only element [1]

block_of_text <- "Aluminum, as it is known in the United States and Canada 
  is typically referred to as Aluminium by the rest of the world. Unbeknowst 
  to most North American's, Aluminium is the more common spelling of the word.
  It is important to note that iron is a nonaluminum."
str_detect(block_of_text, "[Aa]lumini?um")

## [1] TRUE

We can fix this problem by performing tokenization, which breaks the string down to individual parts – in this case words. We’ll use the strsplit() function in R to accomplish this. Since the strsplit() function returns a list, we use unlist to store it as a vector. Note, we are using a single space as a separator so the punctuation will get attached to words. Because str_detect is part of the tidyverse, we can use the pipe operator.

word_vector <- unlist(strsplit(block_of_text, " "))
word_vector %>% str_detect("[Aa]lumini?um")

##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE  TRUE

If we want to see the actual words that are matching our pattern, we could use the boolean vector as the index reference to word_vector.

matches <- word_vector %>% str_detect("[Aa]lumini?um")
word_vector[matches]

## [1] "Aluminum,"    "Aluminium"    "Aluminium"    "nonaluminum."

You’ll notice that our regular expression is matching on the word nonaluminum as well. If this isn’t intended and we want the word itself we should look at delimiting the boundary of the word. In regex, word boundaries are defined by the metacharacter \b so our regular expression becomes \b[Aa]lumini?um\b where \b can represent any non-word character (e.g., space or punctuation). One thing to keep in mind is that R uses the backslash character to start escape sequences, so inside of a string enclosed in quotes, we write the expression as \\b[Aa]lumini?um\\b where the double backslash tells R we are using the backslash character.re

matches <- word_vector %>% str_detect("\\b[Aa]lumini?um\\b")
word_vector[matches]

## [1] "Aluminum," "Aluminium" "Aluminium"

Regular expressions are commonly used in analytics to turn text into data for analysis. To do this, we typically use the more meaningful parts of text to impart meaning. This involves finding particular characters, words, or patterns that we want to include or exclude from analysis. In these tutorials, we are only going to scratch the surface of what can be accomplished with regular expressions.

Below is a list of potential email addresses:

If I asked you which emails are from educational domains, the correct answer would be 1,6. We will build our regular expression to accomplish this in a stepwise manner. First I want to capture all emails that end in “.edu”. Using syntax we have already learned, you might first attempt .edu\b, but you would immediately run into a problem because . is a meta character that means “any character except line break”. Therefore, we would use \. to escape out the meta character and tell regex we are using the character .. This gives us \.edu\b which would exclude email addresses 2-4, but you might not expect it to return address 7. It would because the \b word boundary accepts space or punctuation as the end of the word. We would use the end of string anchor $ instead, giving us \.edu$, which would exclude address 7 as well.

Well-formed email addresses have the format “@.” So far, we have addressed the top level domain (.edu) and at this point, we’ll over-simplify and present a solution that will catch a majority of the malformed emails, but probably not all of them. The lower domain may contain numbers, letters, and certain characters (e.g., ., -, _) but must end with a character or number, and the local part allows the same set of characters plus (%, +). There are some other constraints for email addresses and character set issues that we aren’t going to discuss so we can simplify the problem and provide a solution that will still catch an overwhelming majority of the malformed emails. To summarize, the remaining sections we have to account for are:

the character before the .edu letter or number
the rest of the lower domain part letter, number, ., -, _
the @ character
the local part letter, number, ., -, _, %, +

To address the character before the .edu (letter or number), we would use the character class brackets [] and use the syntax A-Z to represent letters and 0-9 to represent numbers (when we implement this in R, we will tell str_detect to accept upper or lower case letters. Our regular expression now becomes [A-Z0-9]\.edu$.

For the rest of the lower domain part, we need to add the additional characters ., -, _ to our regular expression and apply it to multiple characters. Inside of character class brackets, the only metacharacters that have meaning are backslash and hyphen (which designates a character range). If we put the hyphen as the last character in the class brackets, we don’t need to escape it out, and the character set becomes [A-Z0-9._-] and to specify multiple characters, we use the + quantifier which means “one or more of the patterns to the left”. Our updated expression becomes [A-Z0-9._-]+[A-Z0-9]\.edu$.

Adding the @ character is very simple @[A-Z0-9._-]+[A-Z0-9]\.edu$.

Finally, the local part just adds the characters %, + to our character class [A-Z0-9._%+-] and using the + quantifier again, along with a start of string character ^ we get ^[A-Z0-9._%+-]@[A-Z0-9._-]+[A-Z0-9]\.edu$.

Now when we str_detect this in R, we just need to escape out the backslash to indicate that we mean backslash and not escape (i.e., \\) and we use the ignore_case=TRUE option nested in the regex function to tell str_detect to conduct a case-insensitive search based on our regular expression. We could also use case conversion functions to accomplish the same thing.

emails <- c("student@maine.edu", "student.last@yahoo.com", "student@maine.educ", "student@maine.redu", 
            "student@.edu", "student2@usm.edu", "student@maine.edu.", "student.edu", "student@..edu", 
            "@maine.edu")
matches <- emails %>% str_detect(regex("^[A-Z0-9._%+-]+@[A-Z0-9._-]+[A-Z0-9]\\.edu$", ignore_case = TRUE))
emails[matches]

## [1] "student@maine.edu" "student2@usm.edu"

Finally, we’ll take a quick look back at our county_pop example.

head(county_pop)

##                       county   1960   1970   1980   1990   2000   2010
## 1 Androscoggin County, Maine  86312  91279  99509 105259 103793 107702
## 2    Aroostook County, Maine 106064  94078  91344  86936  73938  71870
## 3   Cumberland County, Maine 182751 192528 215789 243135 265612 281674
## 4     Franklin County, Maine  20069  22444  27447  29008  29467  30768
## 5      Hancock County, Maine  32293  34590  41781  46948  51791  54418
## 6     Kennebec County, Maine  89150  95306 109889 115904 117114 122151
##                    location
## 1 (44.1970089, -70.2376963)
## 2 (46.7270816, -68.6494264)
## 3 (43.8491436, -70.3617521)
## 4 (45.0633843, -70.3617521)
## 5 (44.6682975, -68.4046119)
## 6 (44.4140565, -69.7526525)

Notice the pattern is <county name> County, <state name>. In this case, there is only one state, but if we wanted a more generic pattern that would work with all states, we would be looking at removing the word " County" and everything to the right. The pattern for this is straightforward: "\sCounty.*"

head(county_pop %>% mutate(county = str_replace(county, "\\sCounty.*", "")))

##         county   1960   1970   1980   1990   2000   2010
## 1 Androscoggin  86312  91279  99509 105259 103793 107702
## 2    Aroostook 106064  94078  91344  86936  73938  71870
## 3   Cumberland 182751 192528 215789 243135 265612 281674
## 4     Franklin  20069  22444  27447  29008  29467  30768
## 5      Hancock  32293  34590  41781  46948  51791  54418
## 6     Kennebec  89150  95306 109889 115904 117114 122151
##                    location
## 1 (44.1970089, -70.2376963)
## 2 (46.7270816, -68.6494264)
## 3 (43.8491436, -70.3617521)
## 4 (45.0633843, -70.3617521)
## 5 (44.6682975, -68.4046119)
## 6 (44.4140565, -69.7526525)

9.6 DataCamp Exercises

The DataCamp exercises add some more depth to tidyr, but give more straightforward examples of string manipulation. There are also some helpful tips on working with dates.