Lab 7

Analyzing Bivariate Data with R (III)

We finalize our discussion of bivariate data analysis. In particular, we focus on linear regression analysis using the \(\texttt{\color{brown}{survey dataset}}\) with continuous variables that we discussed earlier.

Beginning Semester Survey Data

The original data contains 136 observations and 15 variables. But, we omit the observations with a missing value, so the final data dimension becomes \(108\times 15\).

01. Sports:    What is your favorite professional sport? 
02. States:    How many states in the US have you traveled to? 
03. Cat_Dog:   Are you more of a cat person or a dog person? 
04. Pets:      What is the number of pets you and your family currently have? 
05. Gender:    What is your gender? 
06. Browser:   What is your preferred internet browser? 
07. Shoes:     How many pairs of shoes do you have? 
08. Height:    How tall are you (inch)? 
09. MO_Height: What is your mothers height (inch)? 
10. FA_Height: What is your fathers height (inch)? 
11. Arrival:   How many minutes does it take for you to reach the recitation classroom from your residence? 
12. Sleep:     How many HOURS did you sleep last night, to the nearest half-hour? 
13. Number:    What is your favorite whole number between 1 and 10? 
14. Hand:      Are you right- or left-handed? 
15. Credit:    How many credits are you taking this semester? 

Note that the survey dataset is recorded in \(\texttt{\color{brown}{SURVEY}}\).

Heights relationship between parents and children

Do you believe your height is inherited from your father or mother? If your answer is YES, it might make sense to predict a child’s height based on one parent’s height using linear regression. Let’s explore this idea through the survey dataset.

Our first step is to examine the correlation and scatter plots. We’ll create these for the mother’s height (\(\texttt{\color{brown}{MO\_Height}}\)) and father’s height (\(\texttt{\color{brown}{FA\_Height}}\)) variables, respectively.

Do you believe the variables demonstrate a sufficient level of linear association to justify proceeding with linear regression? Based on the results, the answer appears to be “NO.” In general, as a common rule of thumb, a moderately strong correlation — around \(\pm0.7\), or at least \(\pm0.6\) — is recommended. (Note that the threshold depends heavily on the nature of the data, e.g., observational vs. experimental, and the field of study, e.g., natural science vs. social science.)

So, should we stop our analysis here?

Searching for the relationship within subsets.

This type of issue is commonly encountered in data analysis. However, we can still move forward by identifying meaningful relationships within specific subsets. In this lab, we will use the \(\texttt{\color{brown}{subset()}}\) function to accomplish this.

Example: Subsetting with \(\texttt{\color{brown}{subset()}}\)

The \(\texttt{\color{brown}{subset()}}\) function takes three arguments: the first is the dataset to subset, the second is the condition for selecting observations (rows in data table), and the third specifies the variables (columns in data table) to retain. For example,

Now, let us subset our data for further regression analysis. A common way of subsetting is by using the gender variable. So,

Then, let us check the correlations for two datasets.

Based on the result, we can confirm a reasonable level of correlation between \(\texttt{Height}\) and \(\texttt{MO\_Height}\) within the male group. So, let us use it for the linear regression analysis.

Regression analysis with the \(\texttt{lm()}\) built-in function

In the previous lab, we manually calculated the regression coefficients using the formulas. While that approach worked as intended, it’s not practical to repeat it every time. Here, we’ll use the built-in \(\texttt{\color{brown}{lm()}}\) function for convenience. With \(\texttt{\color{brown}{lm()}}\), we use a “tilde” to link the response variable (\(\texttt{\color{brown}{Height}}\)) with the explanatory variable (\(\texttt{\color{brown}{MO\_Height}}\)) within the male group.

In fact, the \(\texttt{\color{brown}{lm()}}\) function provides much more information when used with the \(\texttt{\color{brown}{summary()}}\) function. While most of these details are beyond the scope of this class, we can identify the coefficient of determination (\(\mathbf{R}^2\)) — which is 0.3926 in this case — from the “Multiple R-squared” value in the summary.

Please recall that it represents the proportion of variation in a student’s height explained by the mother’s height through the regression analysis. Currently, the R-squared value is not particularly high, indicating that the model’s predictive power is fairly limited. (This is still a common situation with observational data when it is not collected from a controlled lab study.)

What if we use both the mother’s height (\(\texttt{\color{brown}{MO\_Height}}\)) and the father’s height (\(\texttt{\color{brown}{FA\_Height}}\)) for prediction? This approach is known as multiple regression analysis. Even though we didn’t learn this procedure in class, it can be easily implemented as follows:

Note that we can increase \(\mathbf{R}^2\), but the change is not too much. Lastly, let’s make predictions. Since we are using data from male students, all interpretations should be limited to this subset.

Lab Questions

Larger animals tend to have larger brains, but is the increase in brain size proportional to the increase in body size? Allison and Cicchetti (1976) compiled data on the body and brain sizes of 62 mammal species, available in the dataset \(\texttt{\color{brown}{Mammal}}\). This dataset includes columns for the species name, average body mass (in kg), and average brain size (in g). Since both variables are highly right-skewed, we apply a log transformation for analysis.

  1. Calculate the correlation between the log body mass and log brain size.
  1. Perform the linear regression analysis for log brain size based on the log average body mass, using the \(\texttt{lm()}\) function.
  1. Predict the log of brain size when the log of body mass is 3.12 (approximately 50 lb).

Click HERE to submit your answers.