Lab 6

Analyzing Bivariate Data with R (II)

We continue our discussion of bivariate data analysis. In particular, we focus on linear regression analysis using a dataset with continuous variables in this lab.

MLB Data: 2023 Regular Season Team Statistics

We again use the dataset of the MLB 2023 regular season, as follows:

01. Team:  Name of the team 
02. Win_P: Winning Percentage, "(# of Wins"/ "Total # of games)*100".
03. RPG:   Runs per Game, the average number of runs a team or player scores in each game. 
04. BA:    Batting Average, "the # of times a batter hits a ball and reaches the base"/"the # of at bats by a batter".   
05. OPS:   On-base plus slugging, the sum of on-base percentage and slugging percentage. 
06. ORPG:  Opponent Runs per Game, the runs per game by the opponent.   
07. ERA:   Earned Run Average, the number of earned runs a pitcher allows per nine innings.    
08. WHIP:  Walks And Hits Per Inning Pitched, the statistic shows how well a pitcher has kept runners off the basepaths, one of his main goals.

In the data set, the variable \(\texttt{\color{brown}{Win\_P}}\) measures a MLB team’s performance throughout the 2023 season. For offensive capability, we can use the variables \(\texttt{\color{brown}{RPG}}\), \(\texttt{\color{brown}{BA}}\), and \(\texttt{\color{brown}{OPS}}\); whereas, for defensive capability, we can use \(\texttt{\color{brown}{ORPG}}\), \(\texttt{\color{brown}{ERA}}\), and \(\texttt{\color{brown}{WHIP}}\).

From correlation analysis to regression analysis

One of the key differences between correlation and regression analysis lies in how we handle two continuous variables. In correlation analysis, both variables are treated equally, whereas in regression analysis, we distinctly classify one variable as explanatory (independent) and the other as response (dependent). In other words, correlation treats the two variables symmetrically, while regression treats them asymmetrically. For example, in the context of regression analysis with the MLB dataset, our goal is to predict or explain the response variable, \(\texttt{\color{brown}{Winning Percentage}}\), based on an explanatory variable chosen from measures of defensive, offensive, or combined capability.

Regression equation

We have learned the formulas for the regression analysis: \[\begin{align*} \hat{y}\;=&\; a+bx;\\ \text{such that}&\begin{cases} b\;=& \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2};\\ a\;=& \bar{y}-b\bar{x},\end{cases} \end{align*}\] where \(y\) is the response variable and \(x\) is the explanatory variable.
Now, consider the \(\texttt{\color{brown}{Batting Average}}\) as our explanatory variable with the corresponding response variable \(\texttt{\color{brown}{Winning Percentage}}\). Then, we can use \(\textbf{R}\) to calculate the coefficients, a and b:

Interpretation of coefficients

As emphasized in class, it is crucial to understand how to interpret the slope and intercept coefficients correctly. Below is the general interpretations:

Slope coefficient (b): As the explanatory (\(x\)) variable increases by one unit, the response variable (\(y\)) increases/decreases by \(b\) units on average.
Intercept (a): When the explanatory (\(x\)) is zero, the response variable value is \(a\) on average.

Please note that regression analysis describes average behavior, and the interpretations reflect this. In our example, the intercept is not of interest because we do not observe a batting average of 0 for any team. For the slope, we interpret it as follows:
“As Batting Average increases by one unit for a MLB team, the Winning Percentage increases by 4.27% on average.”

Visualization of regression line

Let’s visualize the regression line by overlaying it on the corresponding scatter plot:

Prediction

Now it’s time for prediction! We can simply substitute the value of the explanatory variable into the regression formula. Suppose the Lehigh Valley IronPigs would have a team batting average of 0.18 if they play in the MLB regular league. What would their expected winning percentage be?

(In principle, this is not a recommended prediction due to extrapolation.)

Evaluation of Linear Regression

We used the linear regression line to make the prediction, so the effectiveness of the prediction depends entirely on the performance of the linear regression model. But how can we evaluate its performance? One way is through the coefficient of determination, \(\textbf{R}^2\), which measures the proportion of variability in the response variable explained by the linear regression model. In our case, this is equivalent to the squared correlation between the response variable and the explanatory variables.

Example

Let us instead of use \(\texttt{\color{brown}{ERA}}\) (Earned Run Average) as a explanatory variable. Derive the regression equation and visualize the regression line with corresponding scatter plot.

Suppose the IronPigs had a team ERA of 7 if they were playing in the MLB. What would their expected winning percentage be? Also, evaluate the performance by computing \(\textbf{R}^2\).

Lab Questions

In the previous lab, we considered the “net runs per game” variable as a measure of MLB team’s overall capability:
\[\texttt{\color{brown}{NRPG}}=\texttt{\color{brown}{RPG}}-\texttt{\color{brown}{ORPG}}.\]

Let us perform a linear regression analysis to predict the winning percentage based on the NRPG. Derive the slope and intercept coefficients, respectively.

Predict the winning percentage of the IronPigs when their hypothetical NRPG is -3.

Compute the coefficient of determination \(\textbf{R}^2\).

Provide the relevant interpretations for the slope and intercept coefficients.

Click HERE to submit your answers.