Lab 3
Basic R Object: Data Frame
A data frame in R stores a data set where columns represent variables and rows correspond to individuals. Each element in the data frame is an observation or measurement of a variable for a specific individual. (It’s similar to an Excel sheet.)
A data frame is typically given a name, which tells R’s functions which dataset to work with. For example, a built-in data set \(\texttt{\color{brown}{mtcars}}\) contains car specification information in 1970’s. This data frame has 11 variables and 32 observations (individual cars). The following \(\texttt{\color{brown}{head()}}\) function shows the variables names and first six observations. Also, we can use the \(\texttt{\color{brown}{help()}}\) function for the variable descriptions.
Subsetting Data Frame
Sometimes, we are interested in extracting one of the variables (columns) from a data frame for further analysis.
When your data frame has variable names, then this can be easily done by \(\texttt{\color{brown}{"\$"}}\) command. For example, use \(\texttt{\color{brown}{mtcars\$hp}}\) to extract \(\texttt{\color{brown}{horsepower}}\) variable.
Another way to extract a variable is to use an index with the \(\texttt{\color{brown}{[]}}\) operator. Recall that we are able to select a specific element (or elements) in a vector using its index, e.g., \(\texttt{\color{brown}{HP[2]}}\) returns the second element of the \(\texttt{\color{brown}{HP}}\), and \(\texttt{\color{brown}{HP[4:8]}}\) returns the five elements from the fourth to the eighth. Similarly, we can use \(\texttt{\color{brown}{mtcars[,4]}}\) to extract the variable in the fourth column, \(\texttt{\color{brown}{horsepower}}\).
Note that we intentionally leave the first argument empty, as it is used to select a specific row (an individual car). By leaving it empty, we are selecting all rows (cars) of the \(\texttt{\color{brown}{horsepower}}\) variable.
We can certainly extract the value of a particular variable for a specific car, e.g. \(\texttt{\color{brown}{mtcars[3,4]}}\) retrieves the horsepower of “Datsun 710”. Similarly, we can extract the five elements from the fourth to the eighth of \(\texttt{\color{brown}{horsepower}}\) variable as follows:
Lastly, we can simultaneously select or deselect multiple variables and/or observations by combining \(\texttt{\color{brown}{[]}}\), \(\texttt{\color{brown}{c()}}\) and \(\texttt{\color{brown}{-}}\).
I believe you can now recognize the patterns. Mastering data subsetting is a crucial skill in data analysis, and we have just learned the fundamental procedures using R.
Example
Consider the \(\texttt{\color{brown}{mtcars}}\) data again.
Extract the weight variable using \(\texttt{\color{brown}{\$}}\) command.
Extract the Miles/gallon of Honda Civic using \(\texttt{\color{brown}{[]}}\) command.
You can check the car names with the \(\texttt{\color{brown}{rownames(mtcars)}}\) command.Extract the \(\texttt{\color{brown}{mpg}}\) and \(\texttt{\color{brown}{hp}}\) variables using \(\texttt{\color{brown}{[]}}\) command.
Extract the \(\texttt{\color{brown}{mpg}}\) and \(\texttt{\color{brown}{hp}}\) variables except for the \(\texttt{\color{brown}{Merc}}\) cars.
Measure of Variation
During the lectures, we’ve learned several measures of variation, including “variance”, “standard deviation”, “range”, and “interquartile Range (IQR)”. While calculating these by hand can be tedious, R provides quick and easy results using the built-in functions: \(\texttt{\color{brown}{var()}}\), \(\texttt{\color{brown}{sd()}}\), \(\texttt{\color{brown}{range()}}\), and \(\texttt{\color{brown}{IQR()}}\).
Lab Questions
- The famous “iris” data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
Extract the \(\texttt{\color{brown}{Sepal.Length}}\) and \(\texttt{\color{brown}{Petal.Length}}\) variables using the \(\texttt{\color{brown}{\$}}\) command, and save those in \(\texttt{\color{brown}{SL}}\) and \(\texttt{\color{brown}{PL}}\), respectively.
Compare the variances of \(\texttt{\color{brown}{SL}}\) and \(\texttt{\color{brown}{PL}}\). Which variable exhibits greater variation in terms of variance?
Extract the \(\texttt{\color{brown}{Sepal.Width}}\) and \(\texttt{\color{brown}{Petal.Width}}\) variables using the \(\texttt{\color{brown}{[]}}\) command, and save those in \(\texttt{\color{brown}{SW}}\) and \(\texttt{\color{brown}{PW}}\), respectively.
Compare the IQRs of \(\texttt{\color{brown}{SW}}\) and \(\texttt{\color{brown}{PW}}\). Which variable exhibits greater variation in terms of IQR?
- Let us recall the fomulas for the sample variance.
- The sample variance can be computed from the following formula: \[s^2=\frac{1}{n-1}\sum_{i=1}^n\left(x_i-\bar{x}\right)^2,\] where \(x\) is a numerical variable of length \(n\). Calculate the variance of \(\texttt{\color{brown}{SL}}\) without using \(\texttt{\color{brown}{var()}}\) function. (You could use \(\texttt{\color{brown}{mean()}}\), \(\texttt{\color{brown}{sum()}}\), and \(\texttt{\color{brown}{length()}}\) functions.) Verify that your result matches the value provided in 1-ii.
- Additionally, we have learned an alternative formula for variance calculation (also known as the ‘computing formula’): \[s^2=\frac{1}{n-1}\left\{\sum_{i=1}^nx_i^2-n\bar{x}^2\right\}.\] Similar to the previous case, calculate the variance of \(\texttt{\color{brown}{PL}}\) using this formula, based on the \(\texttt{\color{brown}{mean()}}\), \(\texttt{\color{brown}{sum()}}\), and \(\texttt{\color{brown}{length()}}\) functions. Verify that your result matches the value provided in 1-ii.
Click HERE to submit your answers.