Lab 10

Inferential Statistics (I)

In inferential statistics, we assume there is an unknown target parameter that characterizes the population. In this class, our target parameter is usually a population mean or proportion; however, it could be any characteristic of the population that interests us. Our goal is to estimate or make a claim about this parameter using various statistical procedures, such as interval estimation or hypothesis testing. These procedures are widely used across disciplines, and understanding these concepts will be greatly beneficial for your future studies in your major.

Random Sample

Let’s start with the concept of a random sample. Our population represents a LARGE set of data, and we typically assume we have limited or no information about it. For example, suppose we are interested in the average height of male students at Lehigh. In this case, our population consists of the height measurements of all male students at Lehigh. In this lab, we will use a hypothetical population with a mean \(\mu=70.77\) and a standard deviation \(\sigma=3.5\) for this scenario. Note that, for illustration purposes, we have access to all measurement data in this example.

From the population, we can randomly draw a sample of size \(n=50\) using the \(\texttt{\color{brown}{sample()}}\) function. Since the population is large enough, we are able to assume the obtained data is independent and identically distributed.

Each time you repeat the sampling, the 50 measurements will differ due to sampling variability, causing the histogram’s shape to change.

Confidence Interval (CI)

Form of confidence interval

The confidence interval is the most widely used interval estimation procedure for a target parameter. When a population mean \(\mu\) is the target parameter, we use the following formula for constructing the interval: \[ [\text{Lower bound},\;\text{Upper bound}]=\left[\bar{x}-|z_{\alpha/2}|\times \tfrac{\sigma}{\sqrt{n}},\;\bar{x}+|z_{\alpha/2}|\times \tfrac{\sigma}{\sqrt{n}}\right] \]

  • The sample mean \(\bar{x}\) serves as the point estimate, representing the best estimate of \(\mu\) based on the observed sample.
  • \(|z_{\alpha/2}|\) is the cut-off value that reflects the confidence level.
  • \(\sigma/\sqrt{n}\) is the standard error, which reflects the variability of the sample mean.

Let us construct the 90% confidence interval for the population mean \(\mu\) (i.e., the population average height).

Since each of us has a different sample, the corresponding confidence interval (CI) will also vary.

In real-world settings, it’s impossible to know whether a computed confidence interval truly contains the target parameter \(\mu\). However, using our hypothetical population, we can directly assess this!

Does your confidence interval contain the target parameter? Or does your fellow student’s interval contain it?

Confidence level and coverage

We have constructed the 90% confidence interval, but what does “90% confidence” mean? It means that 90% of intervals from repeated samples will contain the true target parameter \(\mu\).

The cut-off value of 1.645 was chosen to ensure 90% coverage. Similarly, we use larger values 1.96 and 2.58 to achieve 95% and 99% coverage, respectively.

Let us actually confirm this result by creating 50 CIs in the following simple simulation.

You can run the code a few times, as the exact coverage may not be 90%, 95%, or 99% due to randomness. However, on average, you should observe that approximately “5” CIs, “2 or 3” CIs, and “0 or 1” CI will not contain the true parameter out of a total of 50 CIs, respectively.

Interpretation of confidence interval

Returning to the real-world scenario, note that we only collect a SINGLE sample of size \(n\). Since the target parameter is UNKNOWN, we can never know for sure whether the computed confidence interval (CI) contains the true parameter. Once a CI is computed, it consists of a pair of constants (lower and upper bounds), and since we cannot assign a probability to these constants, we use the term “confidence” instead. This indirect expression reflects our earlier observation that, if the CI were constructed repeatedly, it would contain the true parameter approximately the proportion of times specified by the confidence level.

Finally, we use one of the following interpretations:

  • “We are \(100(1-\alpha)\)% confident that the true parameter falls into the resulting confidence interval, [computed lower bound] and [computed upper bound].”
  • “We have \(100(1-\alpha)\)% confidence that the resulting confidence interval, [computed lower bound] and [computed upper bound], covers the true parameter.”

Lab Questions

Suppose we are interested in the average mileage (in miles/gallon, or mpg) of passenger cars in 1970s. Let us assume that we know the population standard deviation, \(\sigma=6\). Our goal is to construct a 95% confidence interval for the population mean mileage. To achieve this goal, we use the variable \(\texttt{\color{brown}{mpg}}\) from the built-in data set \(\texttt{\color{brown}{mtcars}}\) as our sample.

  1. What is the sample size?
  1. Compute the three components: point estimate, cut-off, and standard error.
  1. Construct the 95% CI for population mean mileage.
  1. Provide a careful interpretation withing the context of the problem.

Click HERE to submit your answers.