Lab 6 - Single Predictor
Learning Goals
– Use tidymodels framework to build a linear model and estimate regression parameters
– Visualize your linear model
Intro
Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.
In this lab we will see how much evolutionary history predicts parasite similarity.
The Data
Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.
Each row of the data contains two species, species1 and species2.
Subsequent columns describe metrics that compare the species:
– divergence_time: how many (millions) of years ago the two species diverged. i.e. how many million years ago they were the same species.
– distance: geodesic distance between species geographic range centroids (in kilometers)
– BMdiff: difference in body mass between the two species (in grams)
– precdiff: difference in mean annual precipitation across the two species geographic ranges (mm)
– parsim: a measure of parasite similarity (proportion of parasites shared between species, ranges from 0 to 1.)
The data are available in parasites.csv
in the data folder.
Packages
We’ll use the tidyverse package for much of the data wrangling and visualization.
Exercises
To get started, load the data and save the data frame as parasites
.
Let’s start by examining the relationship between divergence_time and parsim.
1.(a) Based on the goals of the analysis, what is the response variable?
Visualize the relationship between the two variables. Be sure to put informative axes labels and a title.
Use the visualization to describe the relationship between the two variables.
2.(a) Write the regression equation using proper notation.
Interpret the slope and the intercept in the context of the data.
Recreate the visualization from Exercise 1, this time adding a regression line to the visualization.
What do you notice about the prediction (regression) line that may be strange, particularly for very large divergence times?
3.(a) Using mutate, crate a new variable that transforms the variable parsim
from being between 0 and 1, so that it can range between (−∞,+∞). i.e. Create a new variable transformed_parsim that is calculated as log(parsim/(1-parsim)). Add this variable to your data frame. This will be better suited for fitting a regression model (and interpreting predicted values!).
Note: log in R represents taking the nautral log
Then, visualize the relationship between divergence_time and transformed_parsim. Be sure to put informative axes labels and a title. Add a regression line to your visualization.
Write a 1-2 sentence description of what you observe in the visualization.
- Which variable is the strongest individual predictor of parasite similarity between species? To answer this question, begin by fitting a linear regression model to each pair of variables. Do not report the model outputs in a tidy format but save each one as dt_model, dist_model, BM_model and prec_model, respectively.
– divergence_time and transformed_parsim
– distance and transformed_parsim
– BMdiff and transformed_parsim
– precdiff and transformed_parsim
Report the slops for each of these models. Use proper notation.
To answer our question of interest, would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?
- Now, what if calculated \(R^2\) to help answer our question? To compare the explanatory power of each individual predictor, we will look at \(R^2\) between the models. \(R^2\) is a measure of how much of the variability in the response variable is explained by the model.
As you may have guessed from the name \(R^2\) can be calculated by squaring the correlation when we have a simple linear regression model. The correlation r takes values -1 to 1, therefore, \(R^2\) takes values 0 to 1. Intuitively, if r=1 or −1, then \(R^2\)=1, indicating the model is a perfect fit for the data. If r≈0 then \(R^2\)≈0, indicating the model is a very bad fit for the data.
You can calculate \(R^2\) using the glance function. For example, you can calculate \(R^2\) for dt_model using the code glance(dt_model)$r.squared
.
Calculate and report \(R^2\) for each model fit in the previous exercise.
To answer our question of interest, would it be useful to compare the \(R^2\) in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?
Submission
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
Component | Points |
---|---|
Ex 1 | 8 |
Ex 2 | 14 |
Ex 3 | 8 |
Ex 4 | 8 |
Ex 4 | 7 |
Workflow & formatting | 5 |
Total | 50 |