Chapter 12 Web scraping
This unit spans Mon Apr 20 through Sat Apr 25.
At 11:59 PM on Sat Apr 25 the following items are due:
- Assignment 3
Assignment 3, has you working with Twitter data, which we covered in the last unit. The following section on web scraping is not needed to complete assignment 3. It merely rounds out your “working with web data” knowledge.
12.1 Media
Reading: Web scraping, legal issues (US), Wikipedia
rvest, easy web scraping with R, Wickham
Video: SelectorGadget
12.2 Assignment 3
Assignment 3 is exceptionally unstructured and feel free to get creative. The instructions are simple – do something cool with Twitter data. There should be some visual elements to your report and please remember to use codechunk option echo = FALSE to mask (i.e., not show) your authentication keys for Twitter. To keep your creative juices flowing, I’m not going to break down this assignment with a point allocation. Remember to post your RPubs link in Piazza like you normally do.
12.3 Introduction
If you find yourself trying to consistently pull information from websites for an analysis project, web-scraping can be invaluable. APIs like the one we used with Twitter are always preferred but not all sites have APIs designed to interact with their site. Please keep in mind the following caveats.
- while the act of scraping is relatively harmless, reposting others intellectual property as your own can present some ethical and legal issues.
- you should avoid placing too much of a load on the servers (e.g., continual scraping). Many sites will ban you for potentially degrading their site performance.
- scraping programs are brittle. If a website design changes, it is highly likely that your scraping program will no longer work.
We’ll be scraping the USM athletics site for sporting event information. First, we’ll load the libraries we’ll be working with in this unit. If you don’t already have the XML
package installed, rvest
will not function properly unless you install it. Note that every time I update these notes, I am rescraping the website so the video, the notes, and what you might experience will not have the same data.
Next we’ll use the read_html()
function from the xml2
package that rvest
depends on to take the entire html file into a list variable called husky_events
Since we do not have any event scheduled in April we cannot see any event on default page. So I am using February 2020 instead. You can change the month in the URL to scrap data from any month.
Before we proceed, you need to install the chrome extension SelectorGadget. If you don’t have chrome, install it first. Finally, if you haven’t already done so, watch the SelectorGadget video. The video is for the bookmarklet, but the extension functions in the same manner.
After we use SelectorGadget on the USM athletic events page, we get the following mappings:
- event name -
.description-container a
- month name (short) -
.month-display .date-display-single
- day of month -
.date-display
- time -
.time-location-container .date-display-single
SelectorGadget is returning the html/css tag specific information, also known as the CSS selector, for the information we want. If you look at the source, the shorthand for referencing the selector is:
- class is referenced with a
.
- id is referenced with a
#
- tags without classes, or ids are referenced by name (i.e., no special character).
We use rvests
html_nodes()
and html_text()
functions to retrieve the data we want. We’ll describe these functions in a little more detail later in this unit.
event_sport_name <- husky_events %>% html_nodes(".description-container a") %>% html_text()
month <- husky_events %>% html_nodes(".month-display .date-display-single") %>% html_text()
day <- husky_events %>% html_nodes(".date-display") %>% html_text()
time <- husky_events %>% html_nodes(".time-location-container .date-display-single") %>% html_text()
If we look at event_sport_name
we can see that we’ll need to do a little cleaning.
## [1] "(Wrestling) Roger Williams"
## [2] "(Wrestling) SUNY Brockport"
## [3] "(Wrestling) WPI"
## [4] "National Girls & Women in Sports Day Event"
## [5] "(Women's Basketball) Keene St. vs. Southern Me."
## [6] "(Women's Ice Hockey) Suffolk vs. Southern Me."
Reference the last unit if you are not familiar with what we are doing here. colsplit()
is a function from reshape2
that splits strings into multiple columns in a data frame.
event_sport_name <- event_sport_name %>% str_replace_all("\\(", "")
events <- colsplit(event_sport_name, "\\) ", names = c("sport", "event"))
head(events)
## sport event
## 1 Wrestling Roger Williams
## 2 Wrestling SUNY Brockport
## 3 Wrestling WPI
## 4 National Girls & Women in Sports Day Event
## 5 Women's Basketball Keene St. vs. Southern Me.
## 6 Women's Ice Hockey Suffolk vs. Southern Me.
## [1] "(All day) Feb 1" "(All day) Feb 1" "(All day) Feb 1" "7:45 AM Feb 1"
## [5] "1:00 PM Feb 1" "1:00 PM Feb 1"
It might be helpful to combine our time, month, and day variables into a single POSIXct
column.
event_time <- as.POSIXct(paste(time, month, day, "2016"), format = "%H:%M %p %b %d %Y")
head(event_time)
## [1] NA NA
## [3] NA "2016-02-01 07:45:00 AST"
## [5] "2016-02-01 01:00:00 AST" "2016-02-01 01:00:00 AST"
Now we have a data frame containing usable information. Depending on what we were going to do with the data, we could parse it further by breaking out the team that is coming to visit (this gets more difficult for tournaments), or whether the sport is Women’s or Men’s.
## event_time sport
## 1 <NA> Wrestling
## 2 <NA> Wrestling
## 3 <NA> Wrestling
## 4 2016-02-01 07:45:00 National Girls & Women in Sports Day Event
## 5 2016-02-01 01:00:00 Women's Basketball
## 6 2016-02-01 01:00:00 Women's Ice Hockey
## event
## 1 Roger Williams
## 2 SUNY Brockport
## 3 WPI
## 4
## 5 Keene St. vs. Southern Me.
## 6 Suffolk vs. Southern Me.
We’ll assume that this is our desired format and discuss what we just did in a little more detail.