Lab 3

Basic R Object: Data Frame

A data frame in R stores a data set where columns represent variables and rows correspond to individuals. Each element in the data frame is an observation or measurement of a variable for a specific individual. (It’s similar to an Excel sheet.)

A data frame is typically given a name, which tells R’s functions which dataset to work with. For example, a built-in data set \(\texttt{\color{brown}{mtcars}}\) contains car specification information in 1970’s. This data frame has 11 variables and 32 observations (individual cars). The following \(\texttt{\color{brown}{head()}}\) function shows the variables names and first six observations. Also, we can use the \(\texttt{\color{brown}{help()}}\) function for the variable descriptions.

Subsetting Data Frame

Sometimes, we are interested in extracting one of the variables (columns) from a data frame for further analysis.

When your data frame has variable names, then this can be easily done by \(\texttt{\color{brown}{"\$"}}\) command. For example, use \(\texttt{\color{brown}{mtcars\$hp}}\) to extract \(\texttt{\color{brown}{horsepower}}\) variable.

Another way to extract a variable is to use an index with the \(\texttt{\color{brown}{[]}}\) operator. Recall that we are able to select a specific element (or elements) in a vector using its index, e.g., \(\texttt{\color{brown}{HP[2]}}\) returns the second element of the \(\texttt{\color{brown}{HP}}\), and \(\texttt{\color{brown}{HP[4:8]}}\) returns the five elements from the fourth to the eighth. Similarly, we can use \(\texttt{\color{brown}{mtcars[,4]}}\) to extract the variable in the fourth column, \(\texttt{\color{brown}{horsepower}}\).

Note that we intentionally leave the first argument empty, as it is used to select a specific row (an individual car). By leaving it empty, we are selecting all rows (cars) of the \(\texttt{\color{brown}{horsepower}}\) variable.

We can certainly extract the value of a particular variable for a specific car, e.g. \(\texttt{\color{brown}{mtcars[3,4]}}\) retrieves the horsepower of “Datsun 710”. Similarly, we can extract the five elements from the fourth to the eighth of \(\texttt{\color{brown}{horsepower}}\) variable as follows:

Lastly, we can simultaneously select or deselect multiple variables and/or observations by combining \(\texttt{\color{brown}{[]}}\), \(\texttt{\color{brown}{c()}}\) and \(\texttt{\color{brown}{-}}\).

I believe you can now recognize the patterns. Mastering data subsetting is a crucial skill in data analysis, and we have just learned the fundamental procedures using R.

Example

Consider the \(\texttt{\color{brown}{mtcars}}\) data again.

  1. Extract the weight variable using \(\texttt{\color{brown}{\$}}\) command.

  2. Extract the Miles/gallon of Honda Civic using \(\texttt{\color{brown}{[]}}\) command.
    You can check the car names with the \(\texttt{\color{brown}{rownames(mtcars)}}\) command.

  3. Extract the \(\texttt{\color{brown}{mpg}}\) and \(\texttt{\color{brown}{hp}}\) variables using \(\texttt{\color{brown}{[]}}\) command.

  4. Extract the \(\texttt{\color{brown}{mpg}}\) and \(\texttt{\color{brown}{hp}}\) variables except for the \(\texttt{\color{brown}{Merc}}\) cars.

Measure of Variation

During the lectures, we’ve learned several measures of variation, including “variance”, “standard deviation”, “range”, and “interquartile Range (IQR)”. While calculating these by hand can be tedious, R provides quick and easy results using the built-in functions: \(\texttt{\color{brown}{var()}}\), \(\texttt{\color{brown}{sd()}}\), \(\texttt{\color{brown}{range()}}\), and \(\texttt{\color{brown}{IQR()}}\).

Lab Questions

  1. The famous “iris” data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
  1. Extract the \(\texttt{\color{brown}{Sepal.Length}}\) and \(\texttt{\color{brown}{Petal.Length}}\) variables using the \(\texttt{\color{brown}{\$}}\) command, and save those in \(\texttt{\color{brown}{SL}}\) and \(\texttt{\color{brown}{PL}}\), respectively.

  2. Compare the variances of \(\texttt{\color{brown}{SL}}\) and \(\texttt{\color{brown}{PL}}\). Which variable exhibits greater variation in terms of variance?

  3. Extract the \(\texttt{\color{brown}{Sepal.Width}}\) and \(\texttt{\color{brown}{Petal.Width}}\) variables using the \(\texttt{\color{brown}{[]}}\) command, and save those in \(\texttt{\color{brown}{SW}}\) and \(\texttt{\color{brown}{PW}}\), respectively.

  4. Compare the IQRs of \(\texttt{\color{brown}{SW}}\) and \(\texttt{\color{brown}{PW}}\). Which variable exhibits greater variation in terms of IQR?

  1. Let us recall the fomulas for the sample variance.
  1. The sample variance can be computed from the following formula: \[s^2=\frac{1}{n-1}\sum_{i=1}^n\left(x_i-\bar{x}\right)^2,\] where \(x\) is a numerical variable of length \(n\). Calculate the variance of \(\texttt{\color{brown}{SL}}\) without using \(\texttt{\color{brown}{var()}}\) function. (You could use \(\texttt{\color{brown}{mean()}}\), \(\texttt{\color{brown}{sum()}}\), and \(\texttt{\color{brown}{length()}}\) functions.) Verify that your result matches the value provided in 1-ii.
  1. Additionally, we have learned an alternative formula for variance calculation (also known as the ‘computing formula’): \[s^2=\frac{1}{n-1}\left\{\sum_{i=1}^nx_i^2-n\bar{x}^2\right\}.\] Similar to the previous case, calculate the variance of \(\texttt{\color{brown}{PL}}\) using this formula, based on the \(\texttt{\color{brown}{mean()}}\), \(\texttt{\color{brown}{sum()}}\), and \(\texttt{\color{brown}{length()}}\) functions. Verify that your result matches the value provided in 1-ii.

Click HERE to submit your answers.