Chapter 7 Assignment: CredX

This unit spans Mon Mar 09 through Sat Mar 14.
At 11:59 PM on Sun Mar 22 the following items are due:

Assignment 1 (A01): CredX

We finally get to start doing things in R! Assignment 1 is due in this unit. There are no readings or DataCamp exercises. You get to apply what you have been learning over last six units. Get started as early as possible. You don’t want to get stuck on some technical issue at the eleventh hour.

7.1 Assignment 1 (A01)

Video: Assignment Clarification
Video: How to identify and list duplicates
Video: Assignment Office Hour

7.1.1 Business Context

CredX is a leading credit card provider that gets thousands of credit card applications every year. But in the past few years, it has experienced an increase in credit loss. The CEO believes that the best strategy to mitigate credit risk is to ‘acquire the right customers’.

In this assignment, your task is limited to process the data files for analysis, perform exploratory data analysis and publish your preliminary findings.

7.1.2 Dataset

There are two data sets in this project: credit bureau data and demographics data.

Credit Bureau data: This is taken from the credit bureau and contains variables such as ‘number of times 30 DPD or worse in last 3/6/12 months’, ‘outstanding balance’, ‘number of trades’, etc.

Demographics data: This is obtained from the information provided by the applicants at the time of credit card application. It contains customer-level information on age, gender, income, marital status, etc.

You can import them with the following commands:

credit <- read.csv("./data/Credit_Bureau.csv")
demo <- read.csv("./data/Demogs.csv")

Some abbreviations and terms that will help you understand the variables are:

DPD - Days Past Due
CC - Credit Card
PL - Personal Loan
Performance Tag - Whether the customer defaulted of not. 0 means not defaulted and 1 means defaulted.

Prepare a report that has an interesting narrative that focuses on a subset of some or all columns in the data you find interesting that includes both credit bureau and demographics data. Your report should be uploaded to RPubs, and you should post a link to your RPubs report in Piazza. You are required to join (choose appropriate join method and join variable) the data. It is up to you to determine how to handle missing values. Explicitly state your assumptions on how you treated missing values. Your document title should be exactly A01_<firstname>_<lastname> where <firstname> and <lastname> are your actual name. (5 points). Also, the HTML document you publish to RPubs must have the following elements:

at least two level two headers ## and at least one bulleted list with at least two items * (5 points)
you must create a data frame or tibble that joins both credit bureau data and demographics data. (10 points)
Identify and display (with suitable formatting) the duplicate records in a table. The code that creates the portion of the table must be displayed. (10 points)
Identify a variable with about 50 or more missing values and treat it suitably. State what you did and why The code that creates the portion of the table must be displayed. (10 points)
Display a comparative box plot of outstanding balance (in millions) based on any factor variable that you deem revealing some insight. The code that created it must not be displayed. (10 points)
Display a frequency histogram of any variable (find a revealing variable) grouped by performance tag. The code that created it must not be displayed. (10 points)
a narrative discussing what you find interesting along with any issues you might have had preparing the data (10 points)
published on RPubs (20 points)
clickable link posted in Piazza as a note titled “yourname’s Assignment 1,” where yourname is your actual name (10 points)

The checklist above is implemented as a grading rubric in Blackboard. If you want to know where you received deductions, if any, click on your grade in the grade book, then select “View Rubric.”