Chapter 8 Importing Data

This unit spans Mon Mar 23 through Sat Mar 28.
At 11:59 PM on Sat Mar 28 the following items are due:

DC 15 - Importing Data in R: Importing from flat files with utils
DC 16 - Importing Data in R: readr & data.table
DC 17 - Importing Data in R: Importing Excel data

For the majority of this class, we’ll be working with CSV files. Your DataCamp exercises introduce a few other file import methods that are always good to know. There are lots of different packages that deal with importing data. For text files, we will primarily be following the same process over and over with slightly different variants (i.e., this might get a little boring). We’ll be using the readr and readxl packages in this unit.

library(readr)
library(readxl)

8.1 Media

Podcast: Spreadsheets in Data Science (with Jenny Bryan), DataFramed(59:21)

8.2 Delimited text files

Delimited text files are a standard way to transfer data from one system to another. They share a few common characteristics.

They are text files.
They may or may not have a “header” representing column names.
Each line typically represents a record.
They have delimiters.

Delimiters are characters that are used to separate or delimit each field in a line. Tabs and semicolons are common delimiters, but the most common is the comma. Comma-delimited text files are also called CSV files (comma-separated values).

When reading in any files, you should know the answers to the following questions:

what is the file format? (e.g., delimited text, Excel)
does the first line represent a header?
do I want character data to be converted to factors on import? (usually, the answer is no)

8.2.1 read.csv and read_csv

There isn’t much to CSV files. Below is the first few lines of the got.csv data we used in a prior unit.

lastname,firstname,major,year,gpa
Snow,John,Nordic Studies,Junior,3.23
Lannister,Tyrion,Communications,Sophomore,3.83
Targaryen,Daenerys,Zoology,Freshman,3.36

We have already been using read.csv. It is part of the utils package that comes with R and is always loaded when you launch R (the other default packages are base, graphics and stats). We never specify any of these packages using library because they are always in our R session. So far, we have been using read.csv to read files that are stored on a website and wrapping the file names with url(). We can also copy text files to our working directory or a folder in our working directory. In this example, I downloaded the got.csv file to a folder called data in my working directory. Notice I’m no longer using url() because I’m reading in the file from my hard drive.

got <- read.csv("./data/got.csv", stringsAsFactors = FALSE)
head(got)

##    lastname firstname          major      year  gpa
## 1      Snow      John Nordic Studies    Junior 3.23
## 2 Lannister    Tyrion Communications Sophomore 3.83
## 3 Targaryen  Daenerys        Zoology  Freshman 3.36
## 4    Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5     Stark    Eddard        History    Senior 2.78
## 6   Clegane    Gregor        Phys Ed Sophomore 3.23

I read in the file without converting the character data to factors. From the tidyverse package readr, I can also use read_csv to read in the CSV file as well. Not converting to factors is the default behavior.

got <- read_csv("./data/got.csv")
head(got)

## # A tibble: 6 x 5
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Snow      John      Nordic Studies Junior     3.23
## 2 Lannister Tyrion    Communications Sophomore  3.83
## 3 Targaryen Daenerys  Zoology        Freshman   3.36
## 4 Bolton    Ramsay    Phys Ed        Freshman   2.24
## 5 Stark     Eddard    History        Senior     2.78
## 6 Clegane   Gregor    Phys Ed        Sophomore  3.23

European delimited files sometimes use the semicolon ; as a delimiter. In that case, we can use read.csv2 or read_csv2 in the same manner.

8.2.2 other delimiters

Tabs are also common delimiters. Looking at the first few lines of got_tab.txt

lastname    firstname    major    year    gpa
Snow    John    Nordic Studies    Junior    3.23
Lannister    Tyrion    Communications    Sophomore    3.83
Targaryen    Daenerys    Zoology    Freshman    3.36

We can use utils::read.delim or readr::read_tsv to read in a tab-delimited file in pretty much the same manner as read.csv and read_csv.

got <- read.delim("./data/got_tab.txt", stringsAsFactors = FALSE)
head(got)

##    lastname firstname          major      year  gpa
## 1      Snow      John Nordic Studies    Junior 3.23
## 2 Lannister    Tyrion Communications Sophomore 3.83
## 3 Targaryen  Daenerys        Zoology  Freshman 3.36
## 4    Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5     Stark    Eddard        History    Senior 2.78
## 6   Clegane    Gregor        Phys Ed Sophomore 3.23

got <- read_tsv("./data/got_tab.txt")
head(got)

## # A tibble: 6 x 5
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Snow      John      Nordic Studies Junior     3.23
## 2 Lannister Tyrion    Communications Sophomore  3.83
## 3 Targaryen Daenerys  Zoology        Freshman   3.36
## 4 Bolton    Ramsay    Phys Ed        Freshman   2.24
## 5 Stark     Eddard    History        Senior     2.78
## 6 Clegane   Gregor    Phys Ed        Sophomore  3.23

For other delimiters, we would use read.table or read_delim. What is the delimiter in the following file?

lastname|firstname|major|year|gpa
Snow|John|Nordic Studies|Junior|3.23
Lannister|Tyrion|Communications|Sophomore|3.83
Targaryen|Daenerys|Zoology|Freshman|3.36

The | is called a pipe and is being used as a delimiter in this case. The process is pretty straightforward for both versions using the file got_pipe.txt.

got <- read.table("./data/got_pipe.txt", header = TRUE,
                  sep = "|", stringsAsFactors = FALSE)
head(got)

##    lastname firstname          major      year  gpa
## 1      Snow      John Nordic Studies    Junior 3.23
## 2 Lannister    Tyrion Communications Sophomore 3.83
## 3 Targaryen  Daenerys        Zoology  Freshman 3.36
## 4    Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5     Stark    Eddard        History    Senior 2.78
## 6   Clegane    Gregor        Phys Ed Sophomore 3.23

got <- read_delim("./data/got_pipe.txt", delim = "|")
head(got)

## # A tibble: 6 x 5
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Snow      John      Nordic Studies Junior     3.23
## 2 Lannister Tyrion    Communications Sophomore  3.83
## 3 Targaryen Daenerys  Zoology        Freshman   3.36
## 4 Bolton    Ramsay    Phys Ed        Freshman   2.24
## 5 Stark     Eddard    History        Senior     2.78
## 6 Clegane   Gregor    Phys Ed        Sophomore  3.23

Older, mainframe data, sometimes comes in fixed-width files. For example, below is the first few lines of got.prn which is a fixed-width file. Note: fixed width files usually don’t have headers so I’ll need to create column names. If you are lucky, they come with a definition file that tells you the column widths. Often you won’t have this file and will have to count spaces manually.

Snow      John      Nordic StudiesJunior          3.23
Lannister Tyrion    CommunicationsSophomore       3.83
Targaryen Daenerys  Zoology       Freshman        3.36

For this file, we can use read.fwf or read_fwf

got <- read.fwf("./data/got.prn", widths = c(10,10,14,16,4), 
                stringsAsFactors = FALSE)
colnames(got) <- c("lastname", "firstname", "major", "year", "gpa")
head(got)

##     lastname  firstname          major             year  gpa
## 1 Snow       John       Nordic Studies Junior           3.23
## 2 Lannister  Tyrion     Communications Sophomore        3.83
## 3 Targaryen  Daenerys   Zoology        Freshman         3.36
## 4 Bolton     Ramsay     Phys Ed        Freshman         2.24
## 5 Stark      Eddard     History        Senior           2.78
## 6 Clegane    Gregor     Phys Ed        Sophomore        3.23

got <- read_fwf("./data/got.prn", fwf_widths(c(10,10,14,16,4),
                col_names = c("lastname", "firstname", "major", "year", "gpa")))
head(got)

## # A tibble: 6 x 5
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Snow      John      Nordic Studies Junior     3.23
## 2 Lannister Tyrion    Communications Sophomore  3.83
## 3 Targaryen Daenerys  Zoology        Freshman   3.36
## 4 Bolton    Ramsay    Phys Ed        Freshman   2.24
## 5 Stark     Eddard    History        Senior     2.78
## 6 Clegane   Gregor    Phys Ed        Sophomore  3.23

Notice that in the readr version, the parameter syntax is slightly different and I can set column names inline in the read_fwf statement. I still often use the utils versions out of habit.

8.3 Excel data

Generally speaking, Excel data can be a pain to work with in R. If you haven’t already done so, please listen to the podcast assigned for this unit as it does a great job discussing spreadsheets in the analytics process.

Reading Excel data in R, provided it is formatted similarly to the CSV files we worked with is relatively straightforward. I have an Excel workbook got.xlsx that has two worksheets. “got,” which represents the data we’ve been working with and “iron_throne,” which contains the characters from the “got” data set that have sat on the Iron Throne. By default, readxl’s read_excel function will read in the first worksheet in a workbook.

got <- read_excel("./data/got.xlsx")
head(got)

## # A tibble: 6 x 5
##   lastname  firstname major          year        gpa
##   <chr>     <chr>     <chr>          <chr>     <dbl>
## 1 Snow      John      Nordic Studies Junior     3.23
## 2 Lannister Tyrion    Communications Sophomore  3.83
## 3 Targaryen Daenerys  Zoology        Freshman   3.36
## 4 Bolton    Ramsay    Phys Ed        Freshman   2.24
## 5 Stark     Eddard    History        Senior     2.78
## 6 Clegane   Gregor    Phys Ed        Sophomore  3.23

If we want to read in the second worksheet, we could either specify sheet = 2 or use the worksheet name as shown below.

iron_throne <- read_excel("./data/got.xlsx", sheet = "iron_throne")
head(iron_throne)

## # A tibble: 2 x 5
##   lastname  firstname major   year       gpa
##   <chr>     <chr>     <chr>   <chr>    <dbl>
## 1 Stark     Eddard    History Senior    2.78
## 2 Baratheon Joffrey   History Freshman  1.87

8.4 DataCamp Exercises

We aren’t going to be working with too many messy Excel files in this course, but the DataCamp exercises go into some more specifics with the parameters in all of the read files functions. Skipping lines and identifying spreadsheet ranges to read is very helpful when you are dealing with larger, messy files.