01. Team: Name of the team
02. Win_P: Winning Percentage, "(# of Wins"/ "Total # of games)*100".
03. RPG: Runs per Game, the average number of runs a team or player scores in each game.
04. BA: Batting Average, "the # of times a batter hits a ball and reaches the base"/"the # of at bats by a batter".
05. OPS: On-base plus slugging, the sum of on-base percentage and slugging percentage.
06. ORPG: Opponent Runs per Game, the runs per game by the opponent.
07. ERA: Earned Run Average, the number of earned runs a pitcher allows per nine innings.
08. WHIP: Walks And Hits Per Inning Pitched, the statistic shows how well a pitcher has kept runners off the basepaths, one of his main goals.Lab 5
Analyzing Bivariate Data with R (I)
There are many situations where we are interested in the relationship between two variables. For the discrete bivariate case, a conditional frequency table can be used. For continuous variables, correlation and linear regression analyses are appropriate. In this lab, we focus on the correlation analysis based on a dataset with continuous variables.
MLB Data: 2023 Regular Season Team Statistics
In this lab, let’s use the complete dataset from the 2023 regular season. As depicted in the movie Moneyball (2011), the use of baseball statistics and analytics to measure player and team performance has become widespread. While numerous baseball analytics are used in the field, we will focus on some of the key statistics that measure the offensive and defensive capabilities of MLB teams:
In the data set, the variable \(\texttt{\color{brown}{Win\_P}}\) measures an MLB team’s performance throughout the 2023 season. For offensive capability, we can use the variables \(\texttt{\color{brown}{RPG}}\), \(\texttt{\color{brown}{BA}}\), and \(\texttt{\color{brown}{OPS}}\); whereas, for defensive capability, we can use \(\texttt{\color{brown}{ORPG}}\), \(\texttt{\color{brown}{ERA}}\), and \(\texttt{\color{brown}{WHIP}}\).
Offensive Capability
In baseball, offensive capability is closely tied to the performance of batters. We can expect a positive association between offensive measures and winning percentage. This relationship can be quantified using correlation, which provides a unit-less value between -1 and 1. A value close to 1 or -1 indicates a strong positive or negative association, respectively, while a value near 0 indicates no linear association between the variables. In , the function \(\texttt{\color{brown}{cor()}}\) can be used to calculate correlation:
Based on the correlation value, we can describe the level of linear association (e.g., a moderately high positive relationship). However, it’s also important to visually confirm the actual association, as this helps us better understand the data. For this purpose, we can use the \(\texttt{\color{brown}{plot()}}\) function in R:
We can rewrite labels by using the \(\texttt{\color{brown}{xlab}}\) and \(\texttt{\color{brown}{ylab}}\) arguments.
Also, change the plotting character and colors with the \(\texttt{\color{brown}{col}}\) argument.

Defensive Capability
Unlike offensive capability, defensive capability in baseball is closely tied to the performance of pitchers. For defensive measures, smaller values indicate better performance, leading us to expect a negative association with winning percentage. For example, \(\texttt{\color{brown}{ERA}}\) (Earned Run Average) is a classic measure of a pitcher’s ability. Let’s examine the correlation and create a scatter plot to confirm this relationship.
Overall Capability
The variables ‘runs per game’ and ‘opponent runs per game’ are valid offensive and defensive metrics in baseball. Let’s examine the correlation and create scatter plots to confirm their relationship.
Now, let’s say we want to combine these two analytics to measure the overall capability of an MLB team. The idea is very simple. By subtracting \(\texttt{\color{brown}{ORPG}}\) from \(\texttt{\color{brown}{RPG}}\), we quantify the actual “net runs per game” compared to opponent team: \[\texttt{\color{brown}{NRPG}}=\texttt{\color{brown}{RPG}}-\texttt{\color{brown}{ORPG}}.\] Please create the new variable in R, and check the correlation and display the scatter plot.
Do you think the created \(\texttt{\color{brown}{NRPG}}\) would be a good overall performance measure?
Lab Questions
- Among the six variables, \(\texttt{\color{brown}{RPG}}\), \(\texttt{\color{brown}{BA}}\), \(\texttt{\color{brown}{OPS}}\), \(\texttt{\color{brown}{ORPG}}\), \(\texttt{\color{brown}{ERA}}\), and \(\texttt{\color{brown}{WHIP}}\), we wish to choose the best offensive measure and the best defensive measure, respectively. Please perform correlation analysis and provide your conclusion.
- Note that the \(\texttt{\color{brown}{NRPG}}\) we have defined is just one way to combine \(\texttt{\color{brown}{RPG}}\) and \(\texttt{\color{brown}{ORPG}}\). We can also consider the “ratio of runs per game” by dividing \(\texttt{\color{brown}{ORPG}}\) from \(\texttt{\color{brown}{RPG}}\). Create a new variable \(\texttt{\color{brown}{RRPG}}\) for the “ratio of runs per game” and compare the result with \(\texttt{\color{brown}{NRPG}}\). Which one would be more effective in measuring the overall capability of an MLB team?
Click HERE to submit your answers.