Project description
Timeline
Proposal due Thur, Oct 20
Draft 1 due M, Nov 7
Draft 2 due Tue, Nov 29 (optional)
Peer review due Wed, Nov 30
Presentation + slides and final GitHub repo due Thu, Dec 8
Final report due Thu, Dec 8
Introduction
TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.
May be too long, but please do read
The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your teams’ interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, andappropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use but sticking to packages we learned in class is required. You do not need to visualize all of the data at once. A single high-quality visualization will receive a much higher grade than a large number of poor-quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible (with the exception of the presentation).
You will work on the project with your lab teams.
The four primary deliverables for the final project are
- A project proposal with three dataset ideas.
- A reproducible project writeup of your analysis, with one required and one optional draft along the way.
- Formal peer review on another team’s project.
- A presentation with slides.
You will not be submitting anything on Gradescope for the project. Submission of these deliverables will happen on GitHub and feedback will be provided as GitHub issues that you need to engage with and close. The collection of the documents in your GitHub repo will create a webpage for your project. To create the webpage go to the Build tab in RStudio, and click on Render Website.
Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 3 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.
Criteria for datasets
The data sets should meet the following criteria:
At least 500 observations (or approved by me)
At least 8 columns (or approved by me)
At least 6 of the columns must be useful and unique explanatory variables (or approved by me)
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
We strongly recommend curating at least one of your datasets via web scraping.
Please ask a member of the teaching team if you’re unsure whether your data set meets the criteria.
If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be ok; use these numbers as guidance for a successful proposal, not as minimum requirements.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
- Awesome public datasets
- Bikeshare data portal
- CDC
- Data.gov
- Data is Plural
- Durham Open Data Portal
- Edinburgh Open Data
- Election Studies
- European Statistics
- CORGIS: The Collection of Really Great, Interesting, Situated Datasets
- General Social Survey
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- PRISM Data Archive Project
- Statistics Canada
- The National Bureau of Economic Research
- UCI Machine Learning Repository
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- US Government Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Proposal components
For each data set, include the following:
Introduction and data
For each data set:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Research question
Your research question should contain at least three variables, and should be a mix of categorical and quantitative variables. When writing a research question, please think about the following:
What is your target population?
Is the question original?
Can the question be answered?
For each data set, include the following:
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Statement on why this question is important.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Identify the types of variables in your research question. Categorical? Quantitative?
Glimpse of data
For each data set:
- Place the file containing your data in the
data
folder of the project repo. - Use the
glimpse()
function to provide a glimpse of the data set.
Proposal grading
Total | 10 pts |
---|---|
Introduction and data | 3 |
Research question | 3 |
Glimpse of data | 3 |
Workflow and formatting | 1 |
Each component will be graded as follows:
Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.
Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.
It is critical to check feedback on your project proposal. Even if you earn full credit, it may not mean that your proposal is perfect.
Draft
The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory data analysis and initial interpretations.
Write the draft in the report.qmd
file in your project repo.
As you work on the draft, the focus should be on the analysis and less on crafting the final report. Your draft must include a reasonable attempt at each analysis component - exploratory data analysis, inference or modeling, and deriving initial results and conclusions.
Below is a brief description of the sections to focus on in the draft.
Draft components
Introduction and data
The introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses.
Then identify the source of the data, when and how it was collected, the cases, a general description of relevant variables.
Methodology
The methodology section should include visualizations and summary statistics relevant to your research question. You should also justify the choice of statistical method(s) used to answer your research question.
Results
Showcase how you arrived at answers to your research question using the techniques we have learned in class (and beyond, if you’re feeling adventurous).
Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and perform every possible procedure for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not better.
Draft grading
Your first draft will be reviewed and graded by your TAs. We recommend you incorporate their suggestions into your second (optional) draft before the second round of feedback by your peers.
Peer review
Critically reviewing others’ work is a crucial part of the scientific process, and STA 199 is no exception. You will be assigned two teams to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.
During the peer feedback process, you will be provided read-only access to your partner team’s GitHub repo. You will provide your feedback in the form of GitHub issues to your partner team’s GitHub repo.
Process and questions
Spend ~30 mins to review each team’s project.
- Find your team name on the Reviewer 1 and Reviewer 2 columns.
- For each of the columns, find the name of the team to review in the Team being reviewed column. You should already have access to this team’s repo.
- Open the repo of the team you’re reviewing, read their project draft, and browser around the rest of their repo.
- Then, go to the Issues tab in that repo, click on New issue, and click on Get started for the Peer review issue. Fill out this issue, answering the following questions:
Peer review by: [NAME OF TEAM DOING THE REVIEW]
Names of team members that participated in this review: [FULL NAMES OF TEAM MEMBERS DOING THE REVIEW]
Describe the goal of the project.
Describe the data used or collected, if any. If the proposal does not include the use of a specific dataset, comment on whether the project would be strengthened by the inclusion of a dataset.
Describe the approaches, tools, and methods that will be used.
Is there anything that is unclear from the proposal?
Provide constructive feedback on how the team might be able to improve their project. Make sure your feedback includes at least one comment on the statistical modeling aspect of the project, but do feel free to comment on aspects beyond the modeling.
What aspect of this project are you most interested in and would like to see highlighted in the presentation.
Provide constructive feedback on any issues with file and/or code organization.
(Optional) Any further comments or feedback?
Peer review grading
Peer reviews will be graded on the extent to which it comprehensively and constructively addresses the components of the partner team’s report: the research context and motivation, exploratory data analysis, and any inference, modeling, or conclusions.
Report
Your written report must be completed in the report.qmd
file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
Before you finalize your write up, make sure the printing of code chunks is off with the option echo: false
in the YAML.
The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis in your report.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
Report components
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.
Grading criteria
The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The explanatory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.
Methodology
This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.
Grading criteria
The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to determine analyses types and addressed any concerns over appropriateness of analyses chosen.
Results
This is where you will discuss your overall finding and describe the key results from your analysis. The goal is not to interpret every single element of an output shown, but instead to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
Grading criteria
The analysis results are clearly assessed and interesting findings from the analysis are described. Interpretations are used to to support the key findings and conclusions, rather than merely listing, e.g., the interpretation of every model coefficient.
Discussion
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Grading criteria
Overall conclusions from analysis are clearly described, and the analysis results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.
Organization + formatting
This is an assessment of the overall presentation and formatting of the written report.
Grading criteria
The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.
Report grading
The written report is worth 40 points, broken down as follows
Total | 40 pts |
---|---|
Introduction/data | 6 pts |
Methodology | 10 pts |
Results | 14 pts |
Discussion | 6 pts |
Organization + formatting | 4 pts |
Presentation + slides
Slides
In addition to the written report, your team will also create presentation slides and record a video presentation that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.
You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.). We recommend choosing an option that’s easy to collaborate with, e.g., Google Slides.
The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
- Title Slide
- Slide 1: Introduce the topic and motivation
- Slide 2: Introduce the data
- Slide 3: Highlights from EDA
- Slide 4-5: Inference/modeling/other analysis
- Slide 6: Conclusions + future work
Presentation
Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes. You can choose to present live in class (recommended) or pre-record a video to be shown in class. Either way you must attend the lab session for the Q&A following your presentation.
If you choose to pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:
- Recording presentations in Zoom
- Apple Quicktime for screen recording
- Windows 10 built-in screen recording functionality
- Kap for screen recording
Once your video is ready, upload the video to Warpwire or another video platform (e.g., YouTube), then add a link to your video in your repo README.
To upload your video to Warpwire:
- Click the Warpwire tab in the course Sakai site.
- Click the “+” and select “Upload files”.
- Locate the video on your computer and click to upload.
- Once you’ve uploaded the video to Warpwire, click to share the video and copy the video’s URL. You will need this when you post the video in the discussion forum.
Reproducibility + organization
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Teamwork
You will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly and penalties may apply beyond the teamwork component of the grade.
If you have concerns with the teamwork and/or contribution from any team members, please email me by the project presentation deadline. You only need to email me if you have concerns. Otherwise, I will assume everyone on the team equally contributed and will receive full credit for the teamwork portion of the grade.
Overall grading
The grade breakdown is as follows:
Total | 100 pts |
---|---|
Project proposal | 10 pts |
Draft report | 10 pts |
Peer review | 5 pts |
Final report | 40 pts |
Slides + presentation | 20 pts |
Reproducibility + organization | 5 pts |
Teamwork | 10 pts |
Grading summary
Grading of the project will take into account the following:
- Content - What is the quality of research and/or policy question and relevancy of data to those questions?
- Correctness - Are statistical procedures carried out and explained correctly?
- Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
- Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
A general breakdown of scoring is as follows:
- 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
- 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
- 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
- 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
- Below 60%: Student is not making a sufficient effort.
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.