Chapter 10 Open data

This unit spans Mon Apr 06 through Sat Apr 11.
At 11:59 PM on Sat Apr 11 the following items are due:

  • Assignment 2

This unit is mostly about assignment 2 and working with open data, which is described below.

10.1 Media

10.2 Assignment 2

In assignment 1, you worked with credit appraisal data. For this assignment, you must find an open dataset that is from a country, city, or region outside of the United States. Browse some of the sites below or find something on your own. Pick any data you find interesting and create a summary report of what you find interesting in the data. Your report must contain the following elements:

  • a description of how you verified that the data is open and a link back to the website where you found it (10 points)
  • code folding for all code, initially set to hide (10 points)
  • at least two ggplot2 charts (20 points)
  • at least one nicely formatted table that is not too long so that it overwhelms the report formatted using kable (10 points)
  • a narrative discussing what you find interesting along with any issues you might have had preparing the data (10 points)
  • published on RPubs (30 points)
  • a clickable link posted in Piazza as a note titled “yourname’s Assignment 2,” where yourname is your actual name (10 points)

10.3 Public vs. open

Most of the web is publically available. You can get data by scraping sites and some companies have an interface to access their data, like Twitter. Just because data is publically available, doesn’t mean it is open. Datacrunch had a visualization contest where they archived all of President Trump’s tweets. They were forced to take down the dataset due to Twitter’s terms of service. This raises an important point. While we tend to think of all the data on the web as “open,” it very clearly isn’t. When in doubt, check the terms of service for using data from a website or service and when scraping data from the web, always check robots.txt.

Open data is free to access, reuse, and redistribute. The only potential restriction on open data is attribution and share-alike. Governments and non-profit organizations provide much of the open data on the web. Not all government data is open. For example, the US government has data containing social security numbers for all of its citizens. That data is not open, and it would be irresponsible for the government to release that data. There are several benefits to open data. From an analytics perspective, two of the larger ones are:

  • transparency - sunlight is the best disinfectant
  • crowdsourcing - large numbers of people can help identify and solve problems

Chicago has one of the better developed open data portals resulting in several successful civic projects - you can browse the project site. Notice that several of the models were built in R.

10.4 Open examples

The following is a sampling of some of the better English-language open data sources.