Chapter 12 Web scraping

This unit spans Mon Apr 20 through Sat Apr 25.
At 11:59 PM on Sat Apr 25 the following items are due:

Assignment 3

Assignment 3, has you working with Twitter data, which we covered in the last unit. The following section on web scraping is not needed to complete assignment 3. It merely rounds out your “working with web data” knowledge.

12.1 Media

Reading: Web scraping, legal issues (US), Wikipedia
rvest, easy web scraping with R, Wickham
Video: SelectorGadget

12.2 Assignment 3

Assignment 3 is exceptionally unstructured and feel free to get creative. The instructions are simple – do something cool with Twitter data. There should be some visual elements to your report and please remember to use codechunk option echo = FALSE to mask (i.e., not show) your authentication keys for Twitter. To keep your creative juices flowing, I’m not going to break down this assignment with a point allocation. Remember to post your RPubs link in Piazza like you normally do.

12.3 Introduction

If you find yourself trying to consistently pull information from websites for an analysis project, web-scraping can be invaluable. APIs like the one we used with Twitter are always preferred but not all sites have APIs designed to interact with their site. Please keep in mind the following caveats.

while the act of scraping is relatively harmless, reposting others intellectual property as your own can present some ethical and legal issues.
you should avoid placing too much of a load on the servers (e.g., continual scraping). Many sites will ban you for potentially degrading their site performance.
scraping programs are brittle. If a website design changes, it is highly likely that your scraping program will no longer work.

We’ll be scraping the USM athletics site for sporting event information. First, we’ll load the libraries we’ll be working with in this unit. If you don’t already have the XML package installed, rvest will not function properly unless you install it. Note that every time I update these notes, I am rescraping the website so the video, the notes, and what you might experience will not have the same data.

library(rvest)
library(stringr)
library(reshape2)

Next we’ll use the read_html() function from the xml2 package that rvest depends on to take the entire html file into a list variable called husky_events Since we do not have any event scheduled in April we cannot see any event on default page. So I am using February 2020 instead. You can change the month in the URL to scrap data from any month.

library(xml2)
husky_events <- read_html("https://usm.maine.edu/athletics/events?month=2020-02")

Before we proceed, you need to install the chrome extension SelectorGadget. If you don’t have chrome, install it first. Finally, if you haven’t already done so, watch the SelectorGadget video. The video is for the bookmarklet, but the extension functions in the same manner.

After we use SelectorGadget on the USM athletic events page, we get the following mappings:

event name - .description-container a
month name (short) - .month-display .date-display-single
day of month - .date-display
time - .time-location-container .date-display-single

SelectorGadget is returning the html/css tag specific information, also known as the CSS selector, for the information we want. If you look at the source, the shorthand for referencing the selector is:

class is referenced with a .
id is referenced with a #
tags without classes, or ids are referenced by name (i.e., no special character).

We use rvests html_nodes() and html_text() functions to retrieve the data we want. We’ll describe these functions in a little more detail later in this unit.

event_sport_name <- husky_events %>% html_nodes(".description-container a") %>% html_text()
month <- husky_events %>% html_nodes(".month-display .date-display-single") %>% html_text()
day <- husky_events %>% html_nodes(".date-display") %>% html_text()
time <- husky_events %>% html_nodes(".time-location-container .date-display-single") %>% html_text()

If we look at event_sport_name we can see that we’ll need to do a little cleaning.

head(event_sport_name)

## [1] "(Wrestling) Roger Williams"                     
## [2] "(Wrestling) SUNY Brockport"                     
## [3] "(Wrestling) WPI"                                
## [4] "National Girls & Women in Sports Day Event"     
## [5] "(Women's Basketball) Keene St. vs. Southern Me."
## [6] "(Women's Ice Hockey) Suffolk vs. Southern Me."

Reference the last unit if you are not familiar with what we are doing here. colsplit() is a function from reshape2 that splits strings into multiple columns in a data frame.

event_sport_name <- event_sport_name %>% str_replace_all("\\(", "")
events <- colsplit(event_sport_name, "\\) ", names = c("sport", "event"))
head(events)

##                                        sport                      event
## 1                                  Wrestling             Roger Williams
## 2                                  Wrestling             SUNY Brockport
## 3                                  Wrestling                        WPI
## 4 National Girls & Women in Sports Day Event                           
## 5                         Women's Basketball Keene St. vs. Southern Me.
## 6                         Women's Ice Hockey   Suffolk vs. Southern Me.

head(paste(time, month, day))

## [1] "(All day) Feb 1" "(All day) Feb 1" "(All day) Feb 1" "7:45 AM Feb 1"  
## [5] "1:00 PM Feb 1"   "1:00 PM Feb 1"

It might be helpful to combine our time, month, and day variables into a single POSIXct column.

event_time <- as.POSIXct(paste(time, month, day, "2016"), format = "%H:%M %p %b %d %Y")
head(event_time)

## [1] NA                        NA                       
## [3] NA                        "2016-02-01 07:45:00 AST"
## [5] "2016-02-01 01:00:00 AST" "2016-02-01 01:00:00 AST"

Now we have a data frame containing usable information. Depending on what we were going to do with the data, we could parse it further by breaking out the team that is coming to visit (this gets more difficult for tournaments), or whether the sport is Women’s or Men’s.

events <- cbind(event_time, events)
head(events)

##            event_time                                      sport
## 1                <NA>                                  Wrestling
## 2                <NA>                                  Wrestling
## 3                <NA>                                  Wrestling
## 4 2016-02-01 07:45:00 National Girls & Women in Sports Day Event
## 5 2016-02-01 01:00:00                         Women's Basketball
## 6 2016-02-01 01:00:00                         Women's Ice Hockey
##                        event
## 1             Roger Williams
## 2             SUNY Brockport
## 3                        WPI
## 4                           
## 5 Keene St. vs. Southern Me.
## 6   Suffolk vs. Southern Me.

We’ll assume that this is our desired format and discuss what we just did in a little more detail.

12.4 Navigating html files

Let’s look at the relevant html from the USM events url. To keep it short, we have the first two events listed below. You’ll notice that in some instances, there is a class name that maps to a specific piece of information. For example:

month-display always contains the short month name in an inner tag.
date-display and day-display also always map to the day of month and day of week.

The description-container div contains two information items, both of which also apply to other elements in the web page.

the <a> anchor tag that contains the event name can apply to any linked item in the entire web page.
the date-display-single class containing the time can also map to the month, day of month, and day of week.

In these cases, we need to be more specific about What we are attempting to map.

the <a> anchor tag that is nested inside <h2> tag nested in the description-container div contains the event name.
the date-display-single class nested inside of the time-location-container contains the time of the event.

We are using what are known as the CSS selectors for navigating the html file. There is an alternative reference known as XPath that accomplishes the same thing but css selectors are a little easier to use. If you work with XML file, XPath is valuable to learn. SelectorGadget can also return the XPath.

          <div class="view-content">
        <div class="item-group">
        <div class="item">
        <div class="date-container">
    <div class="date-container-inner">
      <div class="month-display"><span class="date-display-single">Feb</span></div>
      <div class="date-display"><span class="date-display-single">1</span></div>
      <div class="day-display"><span class="date-display-single">Sat</span></div>
    </div>
  </div>
    <div class="description-container">
            <h2><a href="/athletics/wrestling-roger-williams">(Wrestling) Roger Williams</a></h2>
                    <div class="time-location-container">
            <span class="date-display-single">(All day)</span>          </div>
            </div>

    </div>
      <div class="item">
        <div class="date-container">
    <div class="date-container-inner">
      <div class="month-display"><span class="date-display-single">Feb</span></div>
      <div class="date-display"><span class="date-display-single">1</span></div>
      <div class="day-display"><span class="date-display-single">Sat</span></div>
    </div>
  </div>
    <div class="description-container">
            <h2><a href="/athletics/wrestling-suny-brockport">(Wrestling) SUNY Brockport</a></h2>
                    <div class="time-location-container">
            <span class="date-display-single">(All day)</span>          </div>
            </div>

    </div>
      <div class="item">
        <div class="date-container">
    <div class="date-container-inner">
      <div class="month-display"><span class="date-display-single">Feb</span></div>
      <div class="date-display"><span class="date-display-single">1</span></div>
      <div class="day-display"><span class="date-display-single">Sat</span></div>
    </div>
  </div>
    <div class="description-container">
            <h2><a href="/athletics/wrestling-wpi">(Wrestling) WPI</a></h2>
                    <div class="time-location-container">
            <span class="date-display-single">(All day)</span>          </div>
            </div>

    </div>
      <div class="item">
        <div class="date-container">
    <div class="date-container-inner">
      <div class="month-display"><span class="date-display-single">Feb</span></div>
      <div class="date-display"><span class="date-display-single">1</span></div>
      <div class="day-display"><span class="date-display-single">Sat</span></div>
    </div>
  </div>
    <div class="description-container">
            <h2><a href="/athletics/national-girls-women-sports-day-event">National Girls &amp; Women in Sports Day Event</a></h2>
                    <div class="time-location-container">
            <span class="date-display-single">7:45 AM</span>, Gorham, Maine | Costello Sports Complex       </div>
            </div>

    </div>

The html_nodes function returns the entire node (i.e., tag and anything nested in the tag including text) for the css selector you are referencing. So if we want to reference the event, we can use .description-container h2 a as shown below.

head(husky_events %>% html_nodes(".description-container h2 a"))

## {xml_nodeset (6)}
## [1] <a href="/athletics/wrestling-roger-williams">(Wrestling) Roger Williams</a>
## [2] <a href="/athletics/wrestling-suny-brockport">(Wrestling) SUNY Brockport</a>
## [3] <a href="/athletics/wrestling-wpi">(Wrestling) WPI</a>
## [4] <a href="/athletics/national-girls-women-sports-day-event">National Girls ...
## [5] <a href="/athletics/womens-basketball-keene-st-vs-southern-me">(Women's B ...
## [6] <a href="/athletics/womens-ice-hockey-suffolk-vs-southern-me">(Women's Ic ...

We can also use a shorter syntax of .description-container a which will search for the <a> tag nested inside the description-container even if it is nested inside of something else (in this case <h2>). SelectorGadget attempts to always return the shortest syntax.

head(husky_events %>% html_nodes(".description-container a"))

## {xml_nodeset (6)}
## [1] <a href="/athletics/wrestling-roger-williams">(Wrestling) Roger Williams</a>
## [2] <a href="/athletics/wrestling-suny-brockport">(Wrestling) SUNY Brockport</a>
## [3] <a href="/athletics/wrestling-wpi">(Wrestling) WPI</a>
## [4] <a href="/athletics/national-girls-women-sports-day-event">National Girls ...
## [5] <a href="/athletics/womens-basketball-keene-st-vs-southern-me">(Women's B ...
## [6] <a href="/athletics/womens-ice-hockey-suffolk-vs-southern-me">(Women's Ic ...

The html_text() function simply extracts the text from a node.

head(husky_events %>% html_nodes(".description-container a") %>% html_text())

## [1] "(Wrestling) Roger Williams"                     
## [2] "(Wrestling) SUNY Brockport"                     
## [3] "(Wrestling) WPI"                                
## [4] "National Girls & Women in Sports Day Event"     
## [5] "(Women's Basketball) Keene St. vs. Southern Me."
## [6] "(Women's Ice Hockey) Suffolk vs. Southern Me."

The more dynamic and complex a site is, the more difficult it is to scrape consistently. This unit should have given you enough to get started with any web scraping tasks you might want to pursue in the future.