Lab - Project Proposal

Lab
Important

This lab is due October 20th at 11:59pm.

Learning Goals

  • Collaborate with your team to outline your project
  • Think critically and craft questions that can be answered from your data

Introduction

Pick a data set and do something with it. That is your final project.

The final project for this class will consist of analysis on a data set of your own choosing. The data set may already exist, or you may collect your own data by scraping the web.

Choose the data based on your group’s interests. The goal of this project is for you to demonstrate proficiency in the techniques we have covered, and will be covering in this class (and beyond, if you like!), and apply them to a data set to analyze it in a meaningful way

Data Sources

In order for you to have the greatest chance of success with this project it is important that you choose a manageable data set. This means that the data should be readily accessible and large enough that multiple relationships can be explored. Your data set must have at least 500 observations and at least eight variables (or has been approved by your instructor). The data set should include a rich mix of categorical, discrete numeric, and continuous numeric data.

If you are using a data set that comes in a format that we haven’t encountered in class (for instance, a .DAT file), make sure that you are able to load it into RStudio as this can be tricky depending on the source. If you are having trouble, ask for help before it is too late.

Data sets that can not be used

– Data sets that have been used for class examples or assignments.

– Data sets from Kaggle.

– Data sets analyzed in another course.

There will be limits on the number of groups that can use a given data set, so I encourage you to be creative!

Some resources that may be helpful:

R Data Sources for Regression Analysis

FiveThirtyEight

TidyTuesday

Additions:

World Health Organization

The National Bureau of Economic Research

International Monetary Fund

General Social Survey

United Nations Data

United Nations Statistics Division

U.K. Data

U.S. Data

U.S. Census Data

European Statistics

Statistics Canada

Pew Research

UNICEF

CDC

World Bank

Election Studies

All analyses must be done in RStudio, and your final written report and analysis must be reproducible. This means that you must create a Quarto document attached to a GitHub repository that will create your written report exactly upon rendering.

Project proposal

There are two main purposes of the project proposal:

– To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.

– To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will help you be successful for this project.

Include the following in the proposal:

Introduction and Data

(a) Choose three substantially different data sets you are interested in analyzing. You must scrape at least 1 data set for your project proposal. For each, identify the components below.

For each data set, include the following:

– Identify the source of the data,

– When and how it was originally collected (by the original data curator, not necessarily how you found the data)

– A brief description of the observations

Research question

(b) Your research question should contain at least three variables, and should be a mix of categorical and quantitative variables. When writing a research question, please think about the following:

  • Target population.

  • Is the question original?

  • Can the question be answered?

For each data set, include the following

– A well thought out formulated research question. You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.

– Describe the research topic along with a concise statement of your hypotheses.

– Identify the types of variables in your research question. Categorical? Quantitative?

Data set

(c) For each data set, include the following

– Use the glimpse function to provide a glimpse of the data set.

– Place the file containing your data in the data folder of the project repo

Notes

– It is critical to check feedback on your project proposal. Even if you earn a 10, it may not mean that your proposal is perfect.

– You must use one of the data sets in the proposal for the final project, unless instructed otherwise when given feedback.

Submission

Once you are finished with the lab, you will your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to render and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.

Grading

Component Points
Introduction/data 3
Research question 3
Data sets 3
Workflow and Formatting 1
Total 10