Chapter 4 Bootstrap Interval Estimation

Learning Outcomes:

  1. Explain the impact of sampling variability on interval estimation.
  2. Define sampling distribution and standard error.
  3. Interpret confidence intervals, or explain why it is inappropriate to do so.
  4. Explain the purpose of bootstrapping.
  5. Explain how to obtain a bootstrap distribution for a statistic (such as a sample mean, median, standard deviation, proportion, difference in means or proportions, or regression coefficient).
  6. Determine when it is appropriate to use standard error and percentile confidence intervals.
  7. Explain whether or not the results of a confidence interval are consistent with the conclusion of a hypothesis test.
  8. Interpret confidence intervals, or explain why it is inappropriate to do so.
  9. Define standard error of a statistic and interpret it in context.
  10. Explain how sample size and level of confidence impact the width of a confidence interval.
  11. Explain how sample size impacts variability in individual observations, and the sampling distribution for a test statistic.
  12. Define standard error of a statistic and interpret it in context.
  13. Explain when it is appropriate to use “theory-based” standard error formulas.

4.1 Sampling Distributions

4.1.1 Sampling From a Population

In statistics, we often do not have the time, money, or means to collect data on all individuals or units on which we want to draw conclusions. Instead, we might collect data on only a subset of the individuals, and then make inferences about all individuals we are interested in, using the information we collected.

Vocabulary:

  1. A population is the entire set of individuals that we want to draw conclusions about.
  2. A sample is a subset of a population
  3. A parameter is a numerical quantity pertaining to an entire population
  4. A statistic is a numerical quantity calculated from a sample

We’ll work with a dataset containing information on all 20,591 flights from New York to Chicago in 2013. Our population of interest is all 20,591 flights.

In this situation, we have information on the entire population, but suppose temporarily that we didn’t. Instead, suppose we had only information on a random sample of 75 flights.

The parameter of interest is the proportion of on-time arrivals out of all flights in the population of 20,591. When the parameter is proportion, we’ll denote it with the letter \(p\).

We take a sample of 75 flights. The first 6 flights in the sample are shown below. The ontime variable tells whether or not the flight arrived on time.

set.seed(08082023)
S1 <- sample_n(Flights_NY_CHI, 75)
head(S1)
## # A tibble: 6 × 9
##    year month   day carrier origin dest  sched_dep_time arr_delay ontime
##   <int> <int> <int> <chr>   <chr>  <chr>          <int>     <dbl> <chr> 
## 1  2013     3     8 AA      LGA    ORD             1720       106 N     
## 2  2013    12    15 AA      JFK    ORD             1715       -24 Y     
## 3  2013    10    22 UA      EWR    ORD             1300       -12 Y     
## 4  2013     8    26 UA      EWR    ORD             2110        15 N     
## 5  2013     7    23 AA      LGA    ORD             1359        66 N     
## 6  2013     5    13 UA      EWR    ORD              900       -21 Y

We’ll calculate the number, and proportion of flights that arrived on time.

num_ontime <- sum(S1$ontime == "Y") # count number of on-time arrivals

Number of on-time arrivals in the sample.

num_ontime
## [1] 39

Proportion of on-time arrivals in the sample.

p_hat <- num_ontime/75
p_hat
## [1] 0.52

In out sample 52 percent of flights arrived on-time. We’ll denote this quantity \(\hat{p}\), keeping with our convention of using the \(\hat{}\) symbol to represent a quantity that is calculated from data (like the predictions and estimates we saw in the previous chapter). The sample statistic \(\hat{p}\) is an estimate of the population proportion \(p\).

Of course, this was just one sample of 75 flights. We should not expect the proportion of on-time flights in our sample (\(\hat{p}\)) to exactly match the proportion of on-time flights in the entire population (\(p\)). Nevertheless we can use the sample to estimate the proportion of all flights in the population that arrive on time. Perhaps, we could say that we would expect between 42 and 62 percent of all 2013 flights from New York to arrive on time. Or, perhaps we could be more precise and estimate that the percentage of on-time flights to be between 47 and 57. We’ll need to figure out how precise we can make our range, while still being confident that it does, in fact, contain the true population parameter.

To summarize:

  • The population is all 20,591 flights from New York to Chicago in 2013.
  • The sample is the flights that we randomly selected.
  • The parameter \(p\) is the proportion of on-time arrivals among all 2013 flights from New York to Chicago, which we do not know (though in this particular example, we could find it, since we have data on all flights in the population.)
  • The sample statistic is the proportion of flights in our sample that arrived on time, which we know to be \(\hat{p}\) = 0.52.

Now, let’s take a different sample of 75 flights and see how the proportion of on-time arrivals compares.

S2 <- sample_n(Flights_NY_CHI, 75)
num_ontime2 <- sum(S2$ontime == "Y") # count number of on-time arrivals
p_hat2 <- num_ontime2/75
p_hat2
## [1] 0.5066667

By studying the behavior of the proportion of on-time arrivals in different samples we can gauge how close the proportion in a given sample is likely be to the unknown population parameter. If all of our samples produce very similar estimates, then it is likely that the population parameter is close to these estimates. If the sample proportion varies considerably from sample to sample, then it is possible that the proportion in any given sample might be very different than the population parameter.

Let’s take 10,000 more random samples of 75 flights and record the proportion of on-time arrivals in each sample.

nreps <- 10000  # number of repetitions
p_hat_val <- rep(NA, nreps) # create vector to hold proportion of on-time arrivals
Sample <- 1:nreps

for(i in 1:nreps){
S <- sample_n(Flights_NY_CHI, 75) # take sample of 75
N_ontime <- sum(S$ontime == "Y") # count number of on-time arrivals
p_hat_val[i] <- N_ontime/75 # record proportion on-time
}

Samples_df <- data.frame(Sample, p_hat_val) # store results in a data frame

The table shows the proportion of on-time arrivals in the first 20 samples of 75 flights.

kable(head(Samples_df, 20))
Sample p_hat_val
1 0.6533333
2 0.5333333
3 0.5866667
4 0.6933333
5 0.5866667
6 0.6666667
7 0.6133333
8 0.5866667
9 0.6533333
10 0.6533333
11 0.6133333
12 0.5866667
13 0.6133333
14 0.4800000
15 0.6000000
16 0.4400000
17 0.4933333
18 0.5733333
19 0.6400000
20 0.5600000

The histogram below shows the distribution of the proportion of on-time arrivals in the 10,000 different samples.

Prop_Samp_Dist<- ggplot(data=Samples_df, aes(x=p_hat_val)) +
  geom_histogram(color="white", fill="blue") + 
  ggtitle("Sampling Distribution for Proportion On Time") + 
  xlab("Prop. on time in sample")
Prop_Samp_Dist

We notice that most of our 10,000 samples yielded proportions of on-time arrivals between 0.5 and 0.7, The distribution of proportion of on-time arrivals is roughly symmetric and bell-shaped.

The distribution shown in this histogram is called the sampling distribution for \(\hat{p}\). We can gauge how much the proportion of on-time arrivals varies between samples by calculating the standard deviation of this sampling distribution. The standard deviation of a sampling distribution for a statistic is also called the standard error of the statistic. In this case it represents the standard error \(\hat{p}\) (the proportion of on-time arrivals), and is denoted \(\text{SE}(\hat{p})\). This standard error is shown below.

SE_p_hat <- sd(Samples_df$p_hat_val)
SE_p_hat
## [1] 0.05659102

Vocabulary:

  • The sampling distribution of a statistic is the distribution of values the statistic takes on across many different samples of a given size.
  • The standard error of a statistic is the standard deviation of that statistic’s sampling distribution. It measures how much the statistic varies between different samples of a given size.

In this rare situation, we actually have data on all 20,591 flights from New York to Chicago in 2013 (our entire population), Let’s calculate the true value of the population parameter \(p\), the proportion of flights that arrived on-time in our actual population.

p <- sum(Flights_NY_CHI$ontime == "Y")/20591
p
## [1] 0.6079841

In fact, just over 60% of all flights in the population arrived on time.

The sampling distribution for the proportion of on-time flights is shown again below. The true proportion of on-time flights is marked by the green dotted line. The gold bar at the bottom of the histogram represents the range of sample proportions that lie within \(\pm 2\) standard errors of the true population proportion of flights that arrived on time.

0.6079841 - 2(0.056591) to 0.6079841 + 2(0.056591)

Prop_Samp_Dist+ geom_vline(xintercept=p, color="green", linetype="dotted", linewidth=2) + geom_segment(aes(x=p - 2*SE_p_hat,xend=p + 2*SE_p_hat, y=50, yend=50), color="gold", size=10, alpha=0.01) 

We calculate the proportion of samples whose proportion of on-time arrivals lies within \(\pm 2\) standard errors of the true proportion.

Lower <- p - 2*SE_p_hat
Upper <- p + 2*SE_p_hat
sum((Samples_df$p_hat_val >=Lower) & (Samples_df$p_hat_val <= Upper))
## [1] 9539

Approximately 95% 10,000 samples produced proportions within \(\pm 2\) standard errors of the true population proportion of on-time flights.

4.1.2 Confidence Intervals

In a real situation, we won’t have access to the entire population of flights, only the flights in a single sample. For example, recall our original sample of 75 flights, in which we observed a proportion of on-time arrivals of \(\hat{p}=\) 0.52.

Since we now know that 95% of all samples produce proportions that lie within two standard errors of the population proportion, we can obtain an estimate of the population proportion \(p\) by adding and subtracting \(2\times \text{SE}(\hat{p})\) from our observed sample proportion \(\hat{p}\). Such an interval is called an approximate 95% confidence interval for the true population proportion \(p\).

Approximate 95% Confidence Interval for \(\hat{p}\)

\[ \hat{p} \pm 2\times \text{SE}(\hat{p}) \]

The confidence interval, based on our original sample, is calculated below.

c(p_hat - 2*SE_p_hat, p_hat + 2*SE_p_hat) 
## [1] 0.406818 0.633182

Based on our sample of 75 flights, we can be 95% confident that the true proportion of on-time arrivals among all 2013 flights from New York to Chicago is between 0.406818 and p_hat + 2*SE_p_hat.

In fact, knowing what we do about the true value of the population parameter \(p\), we can see that our confidence interval does indeed contain this value. Of course, in a real situation, we won’t know the true value of the population parameter, so we won’t know for sure whether or not our confidence interval contains this true parameter value.

A pertinent question at this stage would be ``What does 95% confidence mean?“. To answer that, let’s explore what happens when we calculate confidence intervals based on estimates many different samples. For each of our 10,000 different samples taken from our population, we’ll add and subtract two standard errors from the sample proportion \(\hat{p}\) corresponding to that sample.

The table below displays the value of \(\hat{p}\), for the first 20 samples we took, along with the lower and upper bounds of the confidence interval, and whether or not the confidence interval contains the true parameter value \(p\) (either TRUE or FALSE).

Samples_df <- Samples_df %>% mutate( Lower = p_hat_val - 2*SE_p_hat, 
                                     Upper = p_hat_val + 2*SE_p_hat,
                                     Contains = p >= Lower & p <= Upper)
kable(head(Samples_df, 20))
Sample p_hat_val Lower Upper Contains
1 0.6533333 0.5401513 0.7665154 TRUE
2 0.5333333 0.4201513 0.6465154 TRUE
3 0.5866667 0.4734846 0.6998487 TRUE
4 0.6933333 0.5801513 0.8065154 TRUE
5 0.5866667 0.4734846 0.6998487 TRUE
6 0.6666667 0.5534846 0.7798487 TRUE
7 0.6133333 0.5001513 0.7265154 TRUE
8 0.5866667 0.4734846 0.6998487 TRUE
9 0.6533333 0.5401513 0.7665154 TRUE
10 0.6533333 0.5401513 0.7665154 TRUE
11 0.6133333 0.5001513 0.7265154 TRUE
12 0.5866667 0.4734846 0.6998487 TRUE
13 0.6133333 0.5001513 0.7265154 TRUE
14 0.4800000 0.3668180 0.5931820 FALSE
15 0.6000000 0.4868180 0.7131820 TRUE
16 0.4400000 0.3268180 0.5531820 FALSE
17 0.4933333 0.3801513 0.6065154 FALSE
18 0.5733333 0.4601513 0.6865154 TRUE
19 0.6400000 0.5268180 0.7531820 TRUE
20 0.5600000 0.4468180 0.6731820 TRUE

The graphic below visualizes the confidence intervals produced using the estimates from the first 100 samples. The green dotted line indicates the true value of \(p\). The black dots indicate the value of \(\hat{p}\) for each sample. Intervals that do in fact contain the true value of \(p\) are shown in blue, and intervals that do not contain the true value of \(p\) are shown in green.

ggplot(data=Samples_df[1:100,], aes(y=Sample, x=p_hat_val)) +    
  geom_point() +
  geom_errorbar(aes(xmin = Lower, xmax = Upper, color=Contains))  + 
  xlab("Confidence Interval") + 
  ylab("Sample") + 
  geom_vline(xintercept = p, color="green", linetype="dotted", size=2) + theme_bw()

Out of these 100 samples, 93 contain the true value of the population parameter \(p\). This is close to the 95% confidence level.

The picture shows confidence intervals produced by the first 100 samples, but we actually took 10,000 different samples of 75 flights. Let’s calculate how many of these samples produced confidence intervals that contain the true value of \(p\).

sum(Samples_df$Contains == TRUE)
## [1] 9539

Again, notice that close to 95% of the samples produced confidence intervals contain the true population parameter \(p\). Note that for the red intervals that do not contain \(p\) nothing was done incorrectly. The sample was taken at random, and the confidence interval was calculated using the correct formula. It just happened that by chance, we obtained a sample proportion \(\hat{p}\) that was unusually high or low, leading to an interval that did not capture the true population parameter. This, of course, happens rarely, and approximately 95% of the samples do, in fact, result in intervals that contain the true value of \(p\).

This brings us back to the question “what does 95% confidence mean?”. An approximate 95% confidence interval means that if we take a large number of samples and calculate confidence intervals from each of them, then approximately 95% of the samples will produce intervals containing the true population parameter. In reality, we’ll only have on sample, and won’t know whether or not our interval contains the true parameter value. Assuming we have taken the sample and calculated the interval correctly, we can rest assured in the knowledge that that 95% of all intervals taken would contain the true parameter value, and hope that ours is among that 95%.

We calculated the confidence interval by taking our sample statistic \(\hat{p}\) plus/minus two standard errors. Confidence intervals that are calculated by adding and subtracting a certain number of standard errors from the sample statistic are called standard error confidence intervals. This approach will work as long as the sampling distribution is symmetric and bell-shaped. Probability theory tells us that in a symmetric and bell-shaped distribution, approximately 95% of the area lies within two standard errors of the center of the distribution, given by the true parameter value. We will, however, see that this approach will not work in all cases. Not all statistics produce sampling distributions that are symmetric and bell-shaped, and we will need an alternative way to calculate confidence intervals in these situations.

knitr::include_graphics("Emp_Rule.png")

If we want to use a level of confidence that is different than 95%, we can adjust the value we multiply the standard error by. In general, a standard error confidence interval has the form

\[ \text{Statistic } \pm m\times \text{Standard Error}, \]

where the value of \(m\) depends on the desired level of confidence.

Of course, you might ask why we needed to calculate a confidence interval for the proportion of on-time flights in the first place, since we actually have data on all 20,591 flights in the population and already know the true proportion of on-time arrivals to be 0.608. The answer is that we don’t. But, in most real situations, we will only have data from a single sample, not the entire population, and we won’t know the true population parameter. We’ll be able to build on the ideas of sampling distributions and standard error that we learned about in this section to calculate confidence intervals in those scenarios.

4.2 Bootstrapping

4.2.1 Mercury Levels in Florida Lakes

A 2004 study by Lange, T., Royals, H. and Connor, L. examined Mercury accumulation in large-mouth bass, taken from a sample of 53 Florida Lakes. If Mercury accumulation exceeds 0.5 ppm, then there are environmental concerns. In fact, the legal safety limit in Canada is 0.5 ppm, although it is 1 ppm in the United States.

Recall that our data come from a sample of 53 lakes, out of more then 30,000 in the entire state of Florida.

The mercury levels of the first 10 lakes in the sample are shown in the table below.

data("FloridaLakes")
FloridaLakes <- FloridaLakes %>% rename(Mercury = AvgMercury)
kable(head(FloridaLakes %>% select(ID, Lake, Mercury), 10))
ID Lake Mercury
1 Alligator 1.23
2 Annie 1.33
3 Apopka 0.04
4 Blue Cypress 0.44
5 Brick 1.20
6 Bryant 0.27
7 Cherry 0.48
8 Crescent 0.19
9 Deer Point 0.83
10 Dias 0.81

The histogram shows the distribution of mercury levels in the 53 lakes in the sample. Lakes exceeding the US standard of 1 ppm are shown in red.

Lakes_Hist <- ggplot(data=FloridaLakes, aes(x=Mercury)) + 
  geom_histogram(aes(fill=Mercury<=1), color="white", binwidth = 0.1) + 
  ggtitle("Mercury Levels in Sample of 53 Florida Lakes") + 
  xlab("Mercury Level") + ylab("Frequency") + theme_bw()
Lakes_Hist

The proportion of lakes with mercury levels exceeding 1 ppm is calculated below.

p_hat <- sum(FloridaLakes$Mercury > 1)/53
p_hat
## [1] 0.1132075

We see that in our sample of 53 lakes, approximately 11% have mercury levels exceeding the US standard of 1 ppm. Suppose we want to estimate the proportion of all Florida Lakes whose mercury level exceeds this standard. As we saw in the previous section, we would not expect the population proportion to exactly match the sample, due to random variability between samples. We can use the sample proportion as an estimate (\(\hat{p} = 0.1132\)), and construct a confidence interval for the unknown population proportion \(p\).

In order to construct the confidence interval, we need to know how much the sample proportion of lakes exceeding 1 ppm \(\hat{p}\) could vary between different samples of size 53. That is, we need to know the standard error of \(\hat{p}\). In the previous section, we calculated the standard error by taking 10,000 different samples of the same size as ours from the population, calculating the proportion for each sample, and then calculating the standard deviation of the proportions obtained from these 10,000 different samples. This procedure will not work here, however, because unlike the previous example where we really did have data on the entire population of all flights from New York to Chicago, we do not have data on all 30,000+ lakes in Florida. We cannot take a lot of different samples of size 53 from the population of all lakes, and thus, cannot obtain the sampling distribution for the the proportion of lakes exceeding 1 ppm, or estimate the standard error of \(\hat{p}\).

4.2.2 Bootstrap Sampling

All we have is a single sample of 53 lakes. We need to figure out how much the proportion of lakes with mercury levels exceeding 1 ppm would vary between different samples of size 53, using only the information contained in our one sample.

To do this, we’ll implement a popular simulation-based strategy, known as bootstrapping.

Let’s assume our sample is representative of all Florida lakes. Then, we’ll duplicate the sample many times to create a large set that will look like the population of all Florida Lakes. We can then draw samples of 53 from that large population, and record the mean mercury level for each sample of 53.

An illustration of the bootstrapping procedure is shown below, using a sample of 12 colored dots, instead of the 53 lakes.

In fact, duplicating the sample many times and selecting new samples of size \(n\) has the same effect as drawing samples of size \(n\) from the original sample, by putting the item drawn back in each time, a procedure called sampling with replacement. Thus, we can skip the step of copying/pasting the sample many times, and instead draw our samples with replacement.

This means that in each new sample, some lakes will be drawn multiple times and others not at all. It also ensures that each sample is different, allowing us to estimate variability in the sample mean between the different samples of size 53.

An illustration of the concept of bootstrapping, using sampling with replacement is shown below.

The variability in sample means in our newly drawn samples is used to approximate the variability in proportion \(\hat{p}\) that would occur between different samples of 53 lakes, drawn from the population of all Florida Lakes.

The point of bootstrapping is to observe how much a statistic (in this case the proportion of lakes with Mercury levels exceeding 1 ppm) varies between bootstrap samples. This can act as an estimate of how much that statistic would vary between different samples of size \(n\), drawn from the population.

The steps of bootstrap sampling can be summarized in the following algorithm.

Bootstrap Algorithm

For an original sample of size \(n\):

  1. Take a sample size \(n\) by randomly sampling from the original, with replacement. Thus, some observations will show up multiple times, and others not at all. This sample is called a bootstrap sample.

  2. Calculate the statistic of interest in the bootstrap sample (in this case \(\hat{p}\), the proportion of lakes whose mercury levels exceed 1 ppm).

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the statistic of interest that is calculated in each bootstrap sample.

  4. Look at the distribution of the statistic across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the statistic of interest.

4.2.3 Bootstrap Samples of Lakes

The sample_n() function samples the specified number rows from a data frame, with or without replacement.

The lakes in the first sample are shown below. Notice that some lakes occur multiple times, and others not at all.

Bootstrap Sample 1

BootstrapSample1 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample1 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake       Mercury
##    <int> <chr>        <dbl>
##  1     1 Alligator     1.23
##  2     2 Annie         1.33
##  3     3 Apopka        0.04
##  4     5 Brick         1.2 
##  5     5 Brick         1.2 
##  6     6 Bryant        0.27
##  7     7 Cherry        0.48
##  8     9 Deer Point    0.83
##  9     9 Deer Point    0.83
## 10    12 Down          0.5 
## # ℹ 43 more rows

We calculate the proportion of lakes with mercury levels exceeding 1 ppm in this bootstrap sample. Note that if a lake shows up more than once in the bootstrap sample, then it is counted however many times it shows up.

sum(BootstrapSample1$Mercury > 1) / 53
## [1] 0.0754717

Bootstrap Sample #2

We take a second bootstrap sample. Notice that the lakes chosen and omitted differ from the first sample.

BootstrapSample2 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample2 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake      Mercury
##    <int> <chr>       <dbl>
##  1     1 Alligator    1.23
##  2     2 Annie        1.33
##  3     2 Annie        1.33
##  4     2 Annie        1.33
##  5     2 Annie        1.33
##  6     3 Apopka       0.04
##  7     6 Bryant       0.27
##  8    11 Dorr         0.71
##  9    11 Dorr         0.71
## 10    11 Dorr         0.71
## # ℹ 43 more rows

Proportion exceeding 1 ppm:

sum(BootstrapSample2$Mercury > 1) / 53
## [1] 0.1698113

Bootstrap Sample #3

We’ll take one more bootstrap sample and calculate the proportion of lakes with mercury levels exceeding 1 ppm.

BootstrapSample3 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample3 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake       Mercury
##    <int> <chr>        <dbl>
##  1     1 Alligator     1.23
##  2     1 Alligator     1.23
##  3     2 Annie         1.33
##  4     3 Apopka        0.04
##  5     3 Apopka        0.04
##  6     5 Brick         1.2 
##  7     6 Bryant        0.27
##  8     7 Cherry        0.48
##  9     9 Deer Point    0.83
## 10     9 Deer Point    0.83
## # ℹ 43 more rows

Proportion exceeding 1 ppm:

sum(BootstrapSample3$Mercury > 1) / 53
## [1] 0.09433962

4.2.4 Bootstrap Distribution

Now that we have seen how bootstrap sampling works, we’ll take a large number (10,000) different bootstrap samples and examine how the proportion of lakes with mercury levels exceeding 1 ppm varies between samples.

We’ll use a for-loop to take many different bootstrap samples and record the observed proportion in a vector called p_hat_b

p_hat <- sum(FloridaLakes$Mercury > 1)/53 #calculate sample statistic
Bootstrap_prop <- rep(NA, 10000)   #setup vector to hold bootstrap statistics

for (i in 1:10000){
BootstrapSample <- sample_n(FloridaLakes, 53, replace=TRUE) #take bootstrap sample
Bootstrap_prop[i] <- sum(BootstrapSample$Mercury > 1)/53 # calc. prop exceeding 1
}
Lakes_Bootstrap_Prop <- data.frame(Bootstrap_prop)  #store values in a dataframe

The distribution of proportions observed in the 10,000 different bootstrap samples is shown below. This distribution is called the bootstrap distribution.

Lakes_Bootstrap_Prop_plot <- ggplot(data=Lakes_Bootstrap_Prop, aes(x=Bootstrap_prop)) +  
  geom_histogram(color="white", fill="lightblue") +
  xlab("Prop > 1 in Bootstrap Sample ") + ylab("Frequency") +
  ggtitle("Bootstrap Distribution for Prop. of Lakes Exeeding 1 ppm Hg") + 
  theme(legend.position = "none")
Lakes_Bootstrap_Prop_plot

4.2.5 Bootstrap SE Confidence Interval

We calculate the standard deviation of this bootstrap distribution, which is an estimate of the standard error of \(\hat{p}\).

SE_p_hat <- sd(Lakes_Bootstrap_Prop$Bootstrap_prop)

Since the bootstrap distribution is roughly symmetric and bell-shaped, we can calculate a 95% confidence interval for the proportion of all Florida lakes with mercury levels exceeding 1 ppm, using bootstrap standard error confidence interval method

\[ \hat{p} \pm 2\times\text{SE}(\hat{p}) \]

c(p_hat - 2*SE_p_hat, p_hat + 2*SE_p_hat)
## [1] 0.02684504 0.19957005

We are 95% confident that the proportion of all Florida lakes with mercury levels exceeding 1 ppm is between 0.026845 and 0.1995701.

The gold bar at the bottom of the bootstrap distribution represents this 95% confidence interval.

Lakes_Bootstrap_Prop_plot + 
  geom_segment(aes(x=p_hat - 2*SE_p_hat,xend=p_hat + 2*SE_p_hat, y=50, yend=50),
               color="gold", size=10, alpha=0.01) 

4.2.6 Bootstrap Distribution vs Sampling Distribution

We stated that the standard error of the bootstrap distribution is meant to give us and estimate of the standard error of the statistic of interest (\(\hat{p}\) in this case). We do note, however, that the bootstrap distribution is not the same as the sampling distribution for a statistic illustrated in the previous section. The sampling distribution shows the distribution of values the statistic would take on accross many different samples drawn from the population. As such, it will be centered at the true population parameter (\(p\) in this case). The bootstrap distribution, on the other hand, shows us the distribution of values the statistic takes on across different bootstrap samples drawn from the original sample, using replacement. Since it comes entirely from the original sample, rather than the population, it will be centered at the sample statistic ( in this case \(\hat{p}\)), rather than the unknown \(p\). Nevertheless, the amount of variability in a statistic, calculated between different bootstrap samples, is often a reasonable approximation of the amount of variability we would observe in that statistic if we could take many samples of the same size from the entire population.

4.3 Bootstrap Confidence Interval Example

4.3.1 Bootstrapping Other Statistics

We’ve seen how to use bootstrapping to calculate confidence intervals for an unknown population parameter \(p\), using an estimate \(\hat{p}\), calculated from a sample of size \(n\). This procedure can be applied to calculate confidence intervals for a wide range of population parameters, using statistics calculated from a sample.

For example, we could calculate confidence intervals any of the following parameters, using the corresponding sample statistic.

Context Parameter Statistic
Proportion \(p\) \(\hat{p}\)
Mean \(\mu\) \(\bar{x}\)
Standard Deviation \(\sigma\) \(s\)
Median no comon abbreviations
Difference in Means \(\mu_2-\mu_1\) \(\bar{x}_2 - \bar{x}_1\)
Regression Coefficient \(\beta_j\) \(b_j\)
Estimated Regression Response \(\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}\) \(b_0 + b_1x_{i1} + \ldots + b_px_{ip}\)

We follow the same algorithm as we did when working with a proportion, and simply calculate whatever statistic we are interested in step 2, in place of \(\hat{p}\), as we did previously.

The bootstrap algorithm is given again, below.

Bootstrap Algorithm

For an original sample of size \(n\):

  1. Take a sample of size \(n\) by randomly sampling from the original sample with replacement. (Thus some observations will show up multiple times and others not at all.)

  2. Calculate the statistic of interest in the bootstrap sample.

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the statistic of interest that is calculated in each bootstrap sample.

  4. Look at the distribution of the statistic across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the statistic of interest.

We’ll now go through examples, calculating bootstrap confidence intervals for each of the parameters listed above.

4.3.2 CI for Mean

The histogram shows the distribution of mercury levels of the 53 lakes in our sample. The mean and standard deviation in mercury levels for these 53 lakes is shown.

Lakes_Hist <- ggplot(data=FloridaLakes, aes(x=Mercury)) + 
  geom_histogram(color="white", fill="lightblue", binwidth = 0.2) + 
  ggtitle("Mercury Levels in Sample of Florida Lakes") + 
  xlab("Mercury Level") + ylab("Frequency") 
Lakes_Hist

We’ll calculate the mean and median mercury level for the 53 lakes in the sample.

Lakes_Stats <- FloridaLakes %>% summarize(MeanHg = mean(Mercury), 
                           StDevHG = sd(Mercury),
                           N=n())
kable(Lakes_Stats)
MeanHg StDevHG N
0.5271698 0.3410356 53

We want to calculate a 95% confidence interval for the mean mercury level among all Florida lakes. We’ll use bootstrapping again, this time using the sample mean, rather than the proportion exceeding 1 ppm, as our statistic of interest.

Bootstrap Steps

  1. Take a sample of 53 lakes by randomly sampling from the original sample of 53 lakes, with replacement.

  2. Calculate the mean mercury level in the bootstrap sample.

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the mean mercury level in each bootstrap sample.

  4. Look at the distribution of the mean across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the mean mercury level.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

BootstrapSample1 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample1 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake         Mercury
##    <int> <chr>          <dbl>
##  1     1 Alligator       1.23
##  2     1 Alligator       1.23
##  3     3 Apopka          0.04
##  4     4 Blue Cypress    0.44
##  5     4 Blue Cypress    0.44
##  6     6 Bryant          0.27
##  7     6 Bryant          0.27
##  8     6 Bryant          0.27
##  9     7 Cherry          0.48
## 10     8 Crescent        0.19
## # ℹ 43 more rows

We calculate the mean mercury level among the lakes in the bootstrap sample.

mean(BootstrapSample1$Mercury)
## [1] 0.4986792

Bootstrap Sample #2

BootstrapSample2 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample2 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake         Mercury
##    <int> <chr>          <dbl>
##  1     1 Alligator       1.23
##  2     3 Apopka          0.04
##  3     4 Blue Cypress    0.44
##  4     6 Bryant          0.27
##  5     6 Bryant          0.27
##  6     7 Cherry          0.48
##  7     7 Cherry          0.48
##  8     8 Crescent        0.19
##  9     8 Crescent        0.19
## 10     8 Crescent        0.19
## # ℹ 43 more rows

Mean Mercury Level:

mean(BootstrapSample2$Mercury)
## [1] 0.5192453

Bootstrap Sample #3

BootstrapSample3 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample3 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake         Mercury
##    <int> <chr>          <dbl>
##  1     1 Alligator       1.23
##  2     2 Annie           1.33
##  3     4 Blue Cypress    0.44
##  4     6 Bryant          0.27
##  5     6 Bryant          0.27
##  6     8 Crescent        0.19
##  7    12 Down            0.5 
##  8    12 Down            0.5 
##  9    12 Down            0.5 
## 10    13 Eaton           0.49
## # ℹ 43 more rows

Mean Mercury Level:

mean(BootstrapSample3$Mercury)
## [1] 0.445283

Now, we’ll take 10,000 bootstrap samples, and record the mean mercury concentration in each sample.

mean <- mean(FloridaLakes$Mercury)  #calculate sample statistic
Bootstrap_Mean <- rep(NA, 10000) # setup vector to hold bootstrap statistics

for (i in 1:10000){
BootstrapSample <- sample_n(FloridaLakes, 53, replace=TRUE) # take bootstrap sample
Bootstrap_Mean[i] <- mean(BootstrapSample$Mercury) # calculate mean in bootstrap sample
}
Lakes_Bootstrap_Results_Mean <- data.frame(Bootstrap_Mean)  #store results in data frame

The bootstrap distribution for the mean mercury level is shown below, along with its standard error.

Lakes_Bootstrap_Mean_Plot <- ggplot(data=Lakes_Bootstrap_Results_Mean, 
                                    aes(x=Bootstrap_Mean)) +  
  geom_histogram(color="white", fill="lightblue") +
  xlab("Mean Mercury in Bootstrap Sample ") + ylab("Frequency") +
  ggtitle("Bootstrap Distribution for Sample Mean in Florida Lakes") + 
  theme(legend.position = "none") 
Lakes_Bootstrap_Mean_Plot 

We’ll calculate the standard error of the mean. This is a measure of how much the mean varies between samples.

SE_mean <- sd(Lakes_Bootstrap_Results_Mean$Bootstrap_Mean)
SE_mean
## [1] 0.04655744

Notice that the standard error of the mean is much less than the sample standard deviation of 0.341.

Interpretations of sample standard deviation and standard error of the mean

  • The sample standard deviation measures the amount of variability in mercury levels between the 53 individual lakes in our sample.
  • The standard error of the mean measures the amount of variability in sample mean mercury levels between different samples of size 53.

There is more variability between mercury levels in individual lakes than there is between average mercury levels in different samples of size 53.

Since the bootstrap distribution is roughly symmetric and bell-shaped, we can use the bootstrap standard error method to calculate an approximate 95% confidence interval for the mean mercury level among all Florida lakes.

\[ \text{Statistic} \pm 2\times\text{Standard Error} \]

In this case, the statistic of interest is the sample mean \(\bar{x}=0.527\). The confidence interval is

\[ \begin{aligned} & \bar{x} \pm 2\times\text{SE}(\bar{x}) \\ & = 0.527 \pm 2\times\text{0.0465574} \end{aligned} \]

95% Confidence Interval:

c(mean - 2*SE_mean, mean + 2*SE_mean) 
## [1] 0.4340549 0.6202847

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Lakes_Bootstrap_Mean_Plot + 
  geom_segment(aes(x=mean - 2*SE_mean,xend=mean + 2*SE_mean, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that the average mercury level among all Florida lakes is between 0.4340549 and 0.6202847 parts per million.

It is important to note that we are not saying that we are 95% confident that an individual lake lie in this range, or that 95% of all individual lakes lie in this range. We are only saying that we are confident that the average mercury level among all lakes lies in this range. A confidence interval is a statement about a population parameter (in this case the average mercury level), rather than about individual lakes in the population. Since there is more variability about individual lakes than overall averages, we’ll need to make a wider interval when talking about the mercury level for an individual lake.

4.3.3 CI for Standard Deviation

Now, we’ll calculate a confidence interval for the standard deviation in mercury levels among all Florida lakes. Recall that the sample standard deviation (\(s\)) was:

Sample_SD <- sd(FloridaLakes$Mercury)
Sample_SD
## [1] 0.3410356

We’ll use this estimate to calculate a confidence interval for the population standard deviation \(\sigma\).

This time, our statistic of interest is the sample standard deviation \(s\).

Bootstrap Steps

  1. Take a sample of 53 lakes by randomly sampling from the original sample of 53 lakes, with replacement.

  2. Calculate the standard deviation in mercury level in the bootstrap sample.

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the standard deviation mercury level in each bootstrap sample.

  4. Look at the distribution of the standard deviations across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the standard deviation in mercury level.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

BootstrapSample1 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample1 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake       Mercury
##    <int> <chr>        <dbl>
##  1     2 Annie         1.33
##  2     3 Apopka        0.04
##  3     3 Apopka        0.04
##  4     5 Brick         1.2 
##  5     6 Bryant        0.27
##  6     7 Cherry        0.48
##  7     7 Cherry        0.48
##  8     8 Crescent      0.19
##  9     8 Crescent      0.19
## 10     9 Deer Point    0.83
## # ℹ 43 more rows

We calculate the standard deviation in mercury levels among the lakes in the bootstrap sample.

sd(BootstrapSample1$Mercury)
## [1] 0.3169752

Bootstrap Sample #2

BootstrapSample2 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample2 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake       Mercury
##    <int> <chr>        <dbl>
##  1     2 Annie         1.33
##  2     2 Annie         1.33
##  3     3 Apopka        0.04
##  4     5 Brick         1.2 
##  5     5 Brick         1.2 
##  6     8 Crescent      0.19
##  7     8 Crescent      0.19
##  8     9 Deer Point    0.83
##  9     9 Deer Point    0.83
## 10    12 Down          0.5 
## # ℹ 43 more rows

Standard Deviation in Mercury Level:

sd(BootstrapSample2$Mercury)
## [1] 0.3927499

Bootstrap Sample #3

BootstrapSample3 <- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample3 %>% select(ID, Lake, Mercury)
## # A tibble: 53 × 3
##       ID Lake       Mercury
##    <int> <chr>        <dbl>
##  1     2 Annie         1.33
##  2     2 Annie         1.33
##  3     5 Brick         1.2 
##  4     5 Brick         1.2 
##  5     6 Bryant        0.27
##  6     6 Bryant        0.27
##  7     9 Deer Point    0.83
##  8     9 Deer Point    0.83
##  9     9 Deer Point    0.83
## 10     9 Deer Point    0.83
## # ℹ 43 more rows

Standard Deviation Mercury Level:

sd(BootstrapSample3$Mercury)
## [1] 0.3589536

Now, we’ll take 10,000 bootstrap samples, and record the standard deviation in mercury concentration in each sample.

Sample_SD <- sd(FloridaLakes$Mercury)  #calculate sample statistic
Bootstrap_SD <- rep(NA, 10000) # setup vector to hold bootstrap statistics

for (i in 1:10000){
BootstrapSample <- sample_n(FloridaLakes, 53, replace=TRUE) # take bootstrap sample
Bootstrap_SD[i] <- sd(BootstrapSample$Mercury) # calculate standard deviation in bootstrap sample
}
Lakes_Bootstrap_Results_SD <- data.frame(Bootstrap_SD)  #store results in data frame

The bootstrap distribution for the mean mercury level is shown below, along with its standard error.

Lakes_Bootstrap_SD_Plot <- ggplot(data=Lakes_Bootstrap_Results_SD, 
                                    aes(x=Bootstrap_SD)) +  
  geom_histogram(color="white", fill="lightblue") +
  xlab("SD in Mercury in Bootstrap Sample ") + ylab("Frequency") +
  ggtitle("Bootstrap Distribution for Sample SD in Florida Lakes") + 
  theme(legend.position = "none") 
Lakes_Bootstrap_SD_Plot 

We’ll calculate the standard error of the standard deviation. This is a measure of how much the standard deviation varies between samples.

SE_SD <- sd(Lakes_Bootstrap_Results_SD$Bootstrap_SD)
SE_SD
## [1] 0.02869572

Since the bootstrap distribution is roughly symmetric and bell-shaped, we can use the bootstrap standard error method to calculate an approximate 95% confidence interval for the standard deviation in mercury levels among all Florida lakes.

\[ \text{Statistic} \pm 2\times\text{Standard Error} \]

In this case, the statistic of interest is the sample standard deviation \(s=0.341\). The confidence interval is

\[ \begin{aligned} & s \pm 2\times\text{SE}(s) \\ & = 0.341 \pm 2\times{0.029} \end{aligned} \] 95% Confidence Interval:

c(Sample_SD - 2*SE_SD, Sample_SD + 2*SE_SD        ) 
## [1] 0.2836442 0.3984271

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Lakes_Bootstrap_SD_Plot + 
  geom_segment(aes(x=Sample_SD - 2*SE_SD,xend=Sample_SD + 2*SE_SD, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that the standard deviation in mercury levels among all Florida lakes is between 0.2836442 and 0.3984271 parts per million.

4.3.4 CI for Median

We already calculated a confidence interval for the mean mercury level among all Florida lakes. We could calculate a bootstrap confidence interval for the median mercury level as well, but since the distribution of mercury levels in the lakes is roughly symmetric, the mean is a reasonable measure of center, and there is not a clear reason for using the median instead.

When a distribution is skewed or contains large outliers, however, the median is a more robust measure of center than the mean. Recall the distribution of 100 Seattle house prices seen in Chapters 1 and 2.

ggplot(data=Houses, aes(x=price)) + 
  geom_histogram(fill="lightblue", color="white") + 
  ggtitle("Distribution of House Prices") +
  xlab("Price") + 
  ylab("Frequency")

These 100 houses are a sample of all houses sold in Seattle in 2014 and 2015, so we can use statistics from our sample to draw conclusions about all houses sold in Seattle in this time period.

In this subsection, we’ll use bootstrapping to calculate a 95% confidence interval for the median price among all houses sold in Seattle in this time period.

We calculate the sample median price.

Sample_Median <- median(Houses$price)
Sample_Median
## [1] 507.5

Bootstrap Steps

  1. Take a sample of 100 houses by randomly sampling from the original sample of 100 houses, with replacement.

  2. Calculate the median price in the bootstrap sample.

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the median price in each bootstrap sample.

  4. Look at the distribution of the median price across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the median price.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

BootstrapSample1 <- sample_n(Houses, 100, replace=TRUE) %>% arrange(Id)
BootstrapSample1 %>% select(Id, price)
## # A tibble: 100 × 2
##       Id price
##    <int> <dbl>
##  1     1 1225 
##  2     1 1225 
##  3     1 1225 
##  4     4  253.
##  5     4  253.
##  6     6  310.
##  7     7  550.
##  8     7  550.
##  9     9  315.
## 10    11  410 
## # ℹ 90 more rows

We calculate the median price among the houses in the bootstrap sample.

median(BootstrapSample1$price)
## [1] 520

Bootstrap Sample #2

BootstrapSample2 <- sample_n(Houses, 100, replace=TRUE) %>% arrange(Id)
BootstrapSample2 %>% select(Id, price)
## # A tibble: 100 × 2
##       Id price
##    <int> <dbl>
##  1     2  885.
##  2     2  885.
##  3     4  253.
##  4     4  253.
##  5     5  468.
##  6     6  310.
##  7     7  550.
##  8     7  550.
##  9     7  550.
## 10     8  485.
## # ℹ 90 more rows

Median Price:

median(BootstrapSample2$price)
## [1] 510

Bootstrap Sample #3

BootstrapSample3 <- sample_n(Houses, 100, replace=TRUE) %>% arrange(Id)
BootstrapSample3 %>% select(Id, price)
## # A tibble: 100 × 2
##       Id price
##    <int> <dbl>
##  1     1 1225 
##  2     1 1225 
##  3     1 1225 
##  4     2  885.
##  5     4  253.
##  6     5  468.
##  7     6  310.
##  8     8  485.
##  9     8  485.
## 10     8  485.
## # ℹ 90 more rows

Median Price:

median(BootstrapSample3$price)
## [1] 538

Now, we’ll take 10,000 bootstrap samples, and record the median price in each sample.

Sample_Med <- median(Houses$price)  #calculate sample median
Bootstrap_Med <- rep(NA, 10000) # setup vector to hold bootstrap statistics

for (i in 1:10000){
BootstrapSample <- sample_n(Houses, 100, replace=TRUE) # take bootstrap sample
Bootstrap_Med[i] <- median(BootstrapSample$price) # calculate standard deviation in bootstrap sample
}
Houses_Bootstrap_Results_Med <- data.frame(Bootstrap_Med)  #store results in data frame

The bootstrap distribution for the median price is shown below, along with its standard error.

Houses_Bootstrap_Med_Plot <- ggplot(data=Houses_Bootstrap_Results_Med, 
                                    aes(x=Bootstrap_Med)) +  
  geom_histogram(color="white", fill="lightblue") +
  xlab("Median Price in Bootstrap Sample ") + ylab("Frequency") +
  ggtitle("Bootstrap Distribution for Median Price in Seattle Houses") + 
  theme(legend.position = "none") 
Houses_Bootstrap_Med_Plot 

We’ll calculate the standard error of the median. This is a measure of how much the median varies between samples.

SE_Med <- sd(Houses_Bootstrap_Results_Med$Bootstrap_Med)
SE_Med
## [1] 47.88898

The standard error measures the amount of variability in median house price between different samples of size 100.

Note that this is different than the sample standard deviation, which represents the standard deviation in prices between the 100 different houses in the sample.

Notice that the bootstrap distribution for the median is not symmetric and bell-shaped. Thus, we cannot be assured that 95% of samples will produce a statistic within two standard errors of the mean, so the standard error confidence interval method is not appropriate here. Instead, we’ll calculate a confidence interval by taking the middle 95% of the values in the bootstrap distribution. A cionfidence interval calculated this way is called a percentile bootstrap interval.

We’ll calculate the 0.025 quantile and the 0.975 quantile of this bootstrap distribution. These are the points below which lie 2.5% and 97.5% of the medians in the bootstrap distribution. Thus, the middle 95% of the medians lie between these values.

q.025 <- quantile(Houses_Bootstrap_Results_Med$Bootstrap_Med, 0.025)
q.025
## 2.5% 
##  410
q.975 <- quantile(Houses_Bootstrap_Results_Med$Bootstrap_Med, 0.975)
q.975 
## 97.5% 
##   600

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Houses_Bootstrap_Med_Plot + 
  geom_segment(aes(x=q.025,xend=q.975, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that the median price among all houses that sold in Seattle between 2014 and 2015 is between 410 and 600 thousand dollars.

4.3.5 CI for Difference in Means

We previously calculated a confidence interval for the average mercury level among all lakes in Florida.

Now, we’ll calculate an interval for the difference in average mercury levels between lakes in Northern Florida, compared to Southern Florida.

The boxplot shows and table below describe the distribution of mercury levels for lakes in Northern Florida, compared to Southern Florida.

LakesBP <- ggplot(data=FloridaLakes, aes(x=Location, y=Mercury, fill=Location)) + 
  geom_boxplot() +   geom_jitter() + ggtitle("Mercury Levels in Florida Lakes") + 
  xlab("Location") + ylab("Mercury Level") + theme(axis.text.x = element_text(angle = 90)) + coord_flip()
LakesBP

LakesTable <- FloridaLakes %>% group_by(Location) %>% summarize(MeanHg=mean(Mercury), 
                                                  StDevHg=sd(Mercury), 
                                                  N=n())
LakesTable
## # A tibble: 2 × 4
##   Location MeanHg StDevHg     N
##   <fct>     <dbl>   <dbl> <int>
## 1 N         0.425   0.270    33
## 2 S         0.696   0.384    20

In our sample of 33 Northern Lakes and 20 Southern Lakes, we saw a difference of 0.27 ppm. We’ll calculate a confidence interval to estimate how big or small this difference could be among all Florida lakes.

We’ll use a statistical model to calculate the average mercury levels in Northern and Southern Florida.

\(\widehat{\text{Mercury}} = b_0 +b_1\times{\text{South}}\)

  • \(b_0\) represents the mean mercury level for lakes in North Florida, and
  • \(b_1\) represents the mean difference in mercury level for lakes in South Florida, compared to North Florida

The estimates for corresponding to the original sample are shown below.

M <- lm(data=FloridaLakes, Mercury~Location)
M
## 
## Call:
## lm(formula = Mercury ~ Location, data = FloridaLakes)
## 
## Coefficients:
## (Intercept)    LocationS  
##      0.4245       0.2720

Thus, we can obtain a confidence interval for the difference in average mercury levels by fitting a regression model to each of our bootstrap samples and recording the value of the sample statistic \(b_1\), which represents this difference. Alternatively, we could calculate the mean from each group separately and subtract.

When comparing groups, we make one modification in Step #1 of the bootstrap process. Rather than drawing a sample of size \(n\) at random, with replacement, we’ll draw the same number of observations from each group as were in the original sample. In this case, we had 33 northern lakes, and 20 southern lakes.

Bootstrap Steps

  1. Take a sample of 33 northern lakes and 20 southern lakes by randomly sampling from the original sample, with replacement.

  2. Fit a regression model with location as the explanatory variable and record the value of \(b_1\), representing the difference between the means for each group (South-North).

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the difference in means in each bootstrap sample.

  4. Look at the distribution of the differences in means across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the difference in means between mercury levels in Northern and Southern Florida.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

NLakes <- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE)   ## sample 33 northern lakes
SLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE)   ## sample 20 southern lakes
BootstrapSample1 <- rbind(NLakes, SLakes) %>% arrange(ID) %>% 
  select(ID, Lake, Location, Mercury)   ## combine Northern and Southern Lakes
BootstrapSample1
## # A tibble: 53 × 4
##       ID Lake      Location Mercury
##    <int> <chr>     <fct>      <dbl>
##  1     1 Alligator S           1.23
##  2     2 Annie     S           1.33
##  3     3 Apopka    N           0.04
##  4     6 Bryant    N           0.27
##  5     7 Cherry    N           0.48
##  6     8 Crescent  N           0.19
##  7     8 Crescent  N           0.19
##  8    11 Dorr      N           0.71
##  9    12 Down      S           0.5 
## 10    12 Down      S           0.5 
## # ℹ 43 more rows

We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re interested in the second coefficient, \(b_1\), which represents the mean difference between lakes in Southern and Northern Florida

Mb1 <- lm(data=BootstrapSample1, Mercury ~ Location) ## fit linear model
Mb1
## 
## Call:
## lm(formula = Mercury ~ Location, data = BootstrapSample1)
## 
## Coefficients:
## (Intercept)    LocationS  
##      0.3864       0.2391
NLakes <- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE)   ## sample 33 northern lakes
SLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE)   ## sample 20 southern lakes
BootstrapSample2 <- rbind(NLakes, SLakes) %>% arrange(ID) %>% 
  select(ID, Lake, Location, Mercury)   ## combine Northern and Southern Lakes
BootstrapSample2
## # A tibble: 53 × 4
##       ID Lake         Location Mercury
##    <int> <chr>        <fct>      <dbl>
##  1     1 Alligator    S           1.23
##  2     1 Alligator    S           1.23
##  3     2 Annie        S           1.33
##  4     3 Apopka       N           0.04
##  5     3 Apopka       N           0.04
##  6     4 Blue Cypress S           0.44
##  7     8 Crescent     N           0.19
##  8     9 Deer Point   N           0.83
##  9    10 Dias         N           0.81
## 10    10 Dias         N           0.81
## # ℹ 43 more rows

Bootstrap Sample 2

Mb2 <- lm(data=BootstrapSample2, Mercury ~ Location) ## fit linear model
Mb2
## 
## Call:
## lm(formula = Mercury ~ Location, data = BootstrapSample2)
## 
## Coefficients:
## (Intercept)    LocationS  
##      0.4876       0.3934

Bootstrap Sample 3

NLakes <- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE)   ## sample 33 northern lakes
SLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE)   ## sample 20 southern lakes
BootstrapSample3 <- rbind(NLakes, SLakes) %>% arrange(ID) %>% 
  select(ID, Lake, Location, Mercury)   ## combine Northern and Southern Lakes
BootstrapSample3
## # A tibble: 53 × 4
##       ID Lake       Location Mercury
##    <int> <chr>      <fct>      <dbl>
##  1     2 Annie      S           1.33
##  2     2 Annie      S           1.33
##  3     2 Annie      S           1.33
##  4     3 Apopka     N           0.04
##  5     5 Brick      S           1.2 
##  6     6 Bryant     N           0.27
##  7     7 Cherry     N           0.48
##  8     9 Deer Point N           0.83
##  9    10 Dias       N           0.81
## 10    12 Down       S           0.5 
## # ℹ 43 more rows
Mb3 <- lm(data=BootstrapSample3, Mercury ~ Location) ## fit linear model
Mb3
## 
## Call:
## lm(formula = Mercury ~ Location, data = BootstrapSample3)
## 
## Coefficients:
## (Intercept)    LocationS  
##      0.4436       0.2259

We’ll now take 10,000 different bootstrap samples and look at the bootstrap distribution for \(b_1\), the difference in mean mercury levels between lakes in Southern and Northern Florida

M <- lm(data=FloridaLakes, Mercury~Location) #fit model to original sample
Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)
Bootstrap_b1 <- rep(NA, 10000)  #vector to store b1 values

for (i in 1:10000){
NLakes <- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE)   ## sample 33 northern lakes
SLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE)   ## sample 20 southern lakes
BootstrapSample <- rbind(NLakes, SLakes)   ## combine Northern and Southern Lakes
M <- lm(data=BootstrapSample, Mercury ~ Location) ## fit linear model
Bootstrap_b1[i] <- M$coefficients[2] ## record b1 
}
NS_Lakes_Bootstrap_Results <- data.frame(Bootstrap_b1)  #save results as dataframe

The bootstrap distribution for the difference in means, \(b_1\), is shown below, along with the standard error for the difference.

NS_Lakes_Bootstrap_Plot_b1 <- ggplot(data=NS_Lakes_Bootstrap_Results, aes(x=Bootstrap_b1)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Mean Difference (b1) in Bootstrap Sample") + ylab("Frequency") +
  ggtitle("Northern vs Southern Lakes: Bootstrap Distribution for b1") 
NS_Lakes_Bootstrap_Plot_b1

Standard Error of the Difference in Means \(b_1\)

SE_b1 <- sd(NS_Lakes_Bootstrap_Results$Bootstrap_b1)
SE_b1
## [1] 0.09620419

The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

\[ \begin{aligned} & b_1 \pm 2\times\text{SE}(b_1) \\ & = 0.271 \pm 2\times{0.095} \end{aligned} \] 95% Confidence Interval:

c(Sample_b1 - 2*SE_b1, Sample_b1 + 2*SE_b1) 
##  LocationS  LocationS 
## 0.07954616 0.46436293

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

NS_Lakes_Bootstrap_Plot_b1 + 
  geom_segment(aes(x=Sample_b1 - 2*SE_b1,xend=Sample_b1 + 2*SE_b1, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that the mean mercury level among all lakes in Southern Florida is between 0.0795462 and 0.4643629 higher than the mean mercury level among all lakes in Northern Florida.

Question: We previously performed a hypothesis test and concluded that there was evidence that mean mercury level was higher for lakes in South Florida than Northern Florida. Is this confidence interval consistent with the result of the hypothesis test? Why or why not?

4.3.6 CI for Regression Slope

Recall that we examined the relationship between mercury levels and pH in Florida lakes.

ggplot(data=FloridaLakes, aes(y=Mercury, x=pH)) + 
  geom_point() + stat_smooth(method="lm", se=FALSE)

The regression equation is

\[ \widehat{\text{Mercury}} = b_0 + b_1\times\text{pH} \]

Regression estimates \(b_0\) and \(b_1\) are shown below.

M <- lm(data=FloridaLakes, Mercury~pH)
M
## 
## Call:
## lm(formula = Mercury ~ pH, data = FloridaLakes)
## 
## Coefficients:
## (Intercept)           pH  
##      1.5309      -0.1523
  • On average, lakes with pH level 0 are expected to have a mercury level of 1.53 ppm.
  • For each one-unit increase in pH, mercury level is expected to decrease by 0.15 ppm.

These estimates are sample statistics, calculated from our sample of 53 lakes. We can think of our regression equation estimates \(b_0\) and \(b_1\) as estimates of parameters \(\beta_0\) and \(\beta_1\), which pertain to the slope and intercept of the regression line pertaining to the entire population of all lakes in Florida. We’ll use \(b_0\) and \(b_1\) to estimate \(\beta_0\) and \(\beta_1\) in the same way that we used sample proportion \(\hat{p}\) to estimate population proportion \(p\) and sample mean \(\bar{x}\) to estimate population mean \(\mu\).

The intercept, \(\beta_0\) has little meaning here, but the slope \(\beta_1\) represents the average change in mercury level for each one-unit increase in pH, among all Florida lakes. We’ll use bootstrapping to find a confidence interval for this quantity.

Bootstrap Steps

  1. Take a sample of 53 lakes by randomly sampling from the original sample, with replacement.

  2. Fit a regression model with pH as the explanatory variable and record the value of slope \(b_1\).

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of slope of the regression line for each bootstrap sample.

  4. Look at the distribution of the slopes across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the slope relating mercury and pH levels.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

BootstrapSample1 <- sample_n(FloridaLakes , 53, replace=TRUE)  %>% arrange(ID) %>% 
  select(ID, Lake, pH, Mercury)   # take bootstrap sample
BootstrapSample1
## # A tibble: 53 × 4
##       ID Lake            pH Mercury
##    <int> <chr>        <dbl>   <dbl>
##  1     1 Alligator      6.1    1.23
##  2     1 Alligator      6.1    1.23
##  3     1 Alligator      6.1    1.23
##  4     2 Annie          5.1    1.33
##  5     2 Annie          5.1    1.33
##  6     2 Annie          5.1    1.33
##  7     3 Apopka         9.1    0.04
##  8     3 Apopka         9.1    0.04
##  9     3 Apopka         9.1    0.04
## 10     4 Blue Cypress   6.9    0.44
## # ℹ 43 more rows

We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re again interested in the second coefficient, \(b_1\), which now represents the slope of the regression line.

Mb1 <- lm(data=BootstrapSample1, Mercury ~ pH) # fit linear model
Mb1
## 
## Call:
## lm(formula = Mercury ~ pH, data = BootstrapSample1)
## 
## Coefficients:
## (Intercept)           pH  
##      1.7903      -0.1918

Bootstrap Sample 2

BootstrapSample2 <- sample_n(FloridaLakes , 53, replace=TRUE)  %>% arrange(ID) %>% 
  select(ID, Lake, pH, Mercury)
BootstrapSample2
## # A tibble: 53 × 4
##       ID Lake            pH Mercury
##    <int> <chr>        <dbl>   <dbl>
##  1     1 Alligator      6.1    1.23
##  2     4 Blue Cypress   6.9    0.44
##  3     4 Blue Cypress   6.9    0.44
##  4     4 Blue Cypress   6.9    0.44
##  5     5 Brick          4.6    1.2 
##  6     5 Brick          4.6    1.2 
##  7     6 Bryant         7.3    0.27
##  8     7 Cherry         5.4    0.48
##  9     8 Crescent       8.1    0.19
## 10     8 Crescent       8.1    0.19
## # ℹ 43 more rows
Mb2 <- lm(data=BootstrapSample2, Mercury ~ pH) # fit linear model
Mb2
## 
## Call:
## lm(formula = Mercury ~ pH, data = BootstrapSample2)
## 
## Coefficients:
## (Intercept)           pH  
##      1.5724      -0.1661

Bootstrap Sample 3

BootstrapSample3 <- sample_n(FloridaLakes , 53, replace=TRUE)  %>% arrange(ID) %>% 
  select(ID, Lake, pH, Mercury)
BootstrapSample3
## # A tibble: 53 × 4
##       ID Lake          pH Mercury
##    <int> <chr>      <dbl>   <dbl>
##  1     5 Brick        4.6    1.2 
##  2     6 Bryant       7.3    0.27
##  3     6 Bryant       7.3    0.27
##  4     7 Cherry       5.4    0.48
##  5     8 Crescent     8.1    0.19
##  6     9 Deer Point   5.8    0.83
##  7     9 Deer Point   5.8    0.83
##  8    10 Dias         6.4    0.81
##  9    12 Down         7.2    0.5 
## 10    12 Down         7.2    0.5 
## # ℹ 43 more rows
Mb3 <- lm(data=BootstrapSample3, Mercury ~ pH) # fit linear model
Mb3
## 
## Call:
## lm(formula = Mercury ~ pH, data = BootstrapSample3)
## 
## Coefficients:
## (Intercept)           pH  
##      1.6215      -0.1612

We’ll now take 10,000 different bootstrap samples and look at the bootstrap distribution for \(b_1\), the slope of the regression line relating mercury level and pH.

M <- lm(data=FloridaLakes, Mercury~pH) #fit model to original sample
Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)
Bootstrap_b1 <- rep(NA, 10000)  #vector to store b1 values

for (i in 1:10000){
BootstrapSample <- sample_n(FloridaLakes , 53, replace=TRUE)   #take bootstrap sample
M <- lm(data=BootstrapSample, Mercury ~ pH) # fit linear model
Bootstrap_b1[i] <- M$coefficients[2] # record b1 
}
Lakes_Bootstrap_Slope_Results <- data.frame(Bootstrap_b1)  #save results as dataframe

The bootstrap distribution for the slopes, \(b_1\), is shown below, along with the standard error for the difference.

Lakes_Bootstrap_Plot_Slope <- ggplot(data=Lakes_Bootstrap_Slope_Results, aes(x=Bootstrap_b1)) +
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Slope in Bootstrap Sample") + ylab("Frequency") +
  ggtitle("Bootstrap Distribution for Slope") 
Lakes_Bootstrap_Plot_Slope

Standard Error of the slope \(b_1\)

SE_b1 <- sd(Lakes_Bootstrap_Slope_Results$Bootstrap_b1)
SE_b1
## [1] 0.02676588

The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

\[ \begin{aligned} & b_1 \pm 2\times\text{SE}(b_1) \\ & = -0.1523 \pm 2\times{0.027} \end{aligned} \]

95% Confidence Interval:

c(Sample_b1 - 2*SE_b1, Sample_b1 + 2*SE_b1) 
##         pH         pH 
## -0.2058326 -0.0987691

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Lakes_Bootstrap_Plot_Slope + 
  geom_segment(aes(x=Sample_b1 - 2*SE_b1,xend=Sample_b1 + 2*SE_b1, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that among all Florida lakes, for each 1 unit increase in pH, mercury level decreases between 0.2058326 and 0.0987691, on average.

4.3.7 CI for Regression Response

In addition to calculating a confidence interval for the slope of the regression line relating mercury and pH levels in a lake, we can also calculate a confidence interval for the average mercury level among all lakes with a given pH.

We’ll calculate a confidence interval for the average mercury level among all lakes with a neutral pH level of 7.

The regression equation is

\[ \begin{aligned} \widehat{\text{Mercury}} & = b_0 + b_1\times\text{pH} \\ & = 1.5309 - 0.1523\times\text{pH} \end{aligned} \]

so the expected mercury level among all lakes with \(\text{pH} = 7\) is \(b_0+7b_1 = 1.5309-0.1523(7)=0.4648\) ppm.

This quantity is a statistic calculated from a sample of 53 lakes, so we would not expect the average mercury level among all lakes in the population to be exactly equal to 0.4648. Again, we’ll use this sample statistic as an estimate of the population parameter, and use bootstrapping to estimate the variability associated with this statistic, in order to make a confidence interval.

Bootstrap Steps

  1. Take a sample of 53 lakes by randomly sampling from the original sample, with replacement.

  2. Fit a regression model with location as the explanatory variable and record the values of \(b_0\) and \(b_1\). Use these to calculate \(b_0+7b_1\).

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of \(b_0\) and \(b_1\), and calculating \(b_0+7b_1\) in each bootstrap sample.

  4. Look at the distribution of the expected response, \(b_0 + 7b_1\), across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the expected mercury level among all lakes with pH level of 7.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

BootstrapSample1 <- sample_n(FloridaLakes , 53, replace=TRUE)  %>% arrange(ID) %>% 
  select(ID, Lake, pH, Mercury)   # take bootstrap sample
BootstrapSample1
## # A tibble: 53 × 4
##       ID Lake          pH Mercury
##    <int> <chr>      <dbl>   <dbl>
##  1     2 Annie        5.1    1.33
##  2     3 Apopka       9.1    0.04
##  3     5 Brick        4.6    1.2 
##  4     9 Deer Point   5.8    0.83
##  5    10 Dias         6.4    0.81
##  6    10 Dias         6.4    0.81
##  7    11 Dorr         5.4    0.71
##  8    11 Dorr         5.4    0.71
##  9    12 Down         7.2    0.5 
## 10    13 Eaton        7.2    0.49
## # ℹ 43 more rows

We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re interested in the second coefficient, \(b_1\), which represents the mean difference between lakes in Southern and Northern Florida

Mb1 <- lm(data=BootstrapSample1, Mercury ~ pH) ## fit linear model
b0 <- Mb1$coefficients[1] # record value of b0 (first coefficient)
b1 <- Mb1$coefficients[2] # record value of b1 (second coefficient)
b0+7*b1 #calculate b0+7*b1
## (Intercept) 
##   0.4726683

Bootstrap Sample 2

BootstrapSample2 <- sample_n(FloridaLakes , 53, replace=TRUE)  %>% arrange(ID) %>% 
  select(ID, Lake, pH, Mercury)
BootstrapSample2
## # A tibble: 53 × 4
##       ID Lake            pH Mercury
##    <int> <chr>        <dbl>   <dbl>
##  1     4 Blue Cypress   6.9    0.44
##  2     6 Bryant         7.3    0.27
##  3     6 Bryant         7.3    0.27
##  4     6 Bryant         7.3    0.27
##  5     7 Cherry         5.4    0.48
##  6     8 Crescent       8.1    0.19
##  7     8 Crescent       8.1    0.19
##  8    10 Dias           6.4    0.81
##  9    10 Dias           6.4    0.81
## 10    11 Dorr           5.4    0.71
## # ℹ 43 more rows
Mb2 <- lm(data=BootstrapSample2, Mercury ~ pH) # fit linear model
b0 <- Mb2$coefficients[1] # record value of b0 (first coefficient)
b1 <- Mb2$coefficients[2] # record value of b1 (second coefficient)
b0+7*b1 #calculate b0+7*b1
## (Intercept) 
##   0.4574841

Bootstrap Sample 3

BootstrapSample3 <- sample_n(FloridaLakes , 53, replace=TRUE)  %>% arrange(ID) %>% 
  select(ID, Lake, pH, Mercury)
BootstrapSample3
## # A tibble: 53 × 4
##       ID Lake            pH Mercury
##    <int> <chr>        <dbl>   <dbl>
##  1     1 Alligator      6.1    1.23
##  2     2 Annie          5.1    1.33
##  3     3 Apopka         9.1    0.04
##  4     3 Apopka         9.1    0.04
##  5     4 Blue Cypress   6.9    0.44
##  6     4 Blue Cypress   6.9    0.44
##  7     4 Blue Cypress   6.9    0.44
##  8     5 Brick          4.6    1.2 
##  9     6 Bryant         7.3    0.27
## 10     6 Bryant         7.3    0.27
## # ℹ 43 more rows
Mb3 <- lm(data=BootstrapSample3, Mercury ~ pH) # fit linear model
b0 <- Mb3$coefficients[1] # record value of b0 (first coefficient)
b1 <- Mb3$coefficients[2] # record value of b1 (second coefficient)
b0+7*b1 #calculate b0+7*b1
## (Intercept) 
##   0.4077507

We’ll now take 10,000 different bootstrap samples and record the values of \(b_0\), \(b_1\), which we’ll then use to calculate \(b_0+7b_1\).

M <- lm(data=FloridaLakes, Mercury~pH) #fit model to original sample
Sample_b0 <- M$coefficients[1] # record b0 value (second coefficient)
Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)
Sample_Exp7 <- Sample_b0 + 7*Sample_b1 # calculate sample expected mercury when pH=7
Bootstrap_b0 <- rep(NA, 10000)  #vector to store b1 values
Bootstrap_b1 <- rep(NA, 10000)  #vector to store b1 values

for (i in 1:10000){
BootstrapSample <- sample_n(FloridaLakes , 53, replace=TRUE)   #take bootstrap sample
M <- lm(data=BootstrapSample, Mercury ~ pH) # fit linear model
Bootstrap_b0[i] <- M$coefficients[1] # record b0 
Bootstrap_b1[i] <- M$coefficients[2] # record b1 
}

Bootstrap_Exp7 <-  Bootstrap_b0 + 7*Bootstrap_b1 # calcualte expected response for each bootstrap sample

Lakes_Bootstrap_Exp7_Results <- data.frame(Bootstrap_b0, Bootstrap_b1, Bootstrap_Exp7)  #save results as dataframe

The bootstrap distribution for the expected mercury level among all lakes with pH level 7, \(b_0+7b_1\), is shown below, along with the standard error for this quantity.

Lakes_Bootstrap_Plot_Exp7 <- ggplot(data=Lakes_Bootstrap_Exp7_Results, aes(x=Bootstrap_Exp7)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Expected Mercury Level in Bootstrap Sample") + ylab("Frequency") +
  ggtitle( "Bootstrap Distribution for Exp. Mercury when pH=7") 
Lakes_Bootstrap_Plot_Exp7

Standard Error of the expected response \(b_0 + 7b_1\)

SE_Exp7 <- sd(Lakes_Bootstrap_Exp7_Results$Bootstrap_Exp7)
SE_Exp7
## [1] 0.03676639

Again, the bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

\[ \begin{aligned} & b_1 \pm 2\times\text{SE}(b_1) \\ & = 0.4648 \pm 2\times{0.037} \end{aligned} \]

95% Confidence Interval:

c(Sample_Exp7 - 2*SE_Exp7, Sample_Exp7 + 2*SE_Exp7) 
## (Intercept) (Intercept) 
##   0.3912799   0.5383454

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Lakes_Bootstrap_Plot_Exp7 + 
  geom_segment(aes(x=Sample_Exp7 - 2*SE_Exp7,xend=Sample_Exp7 + 2*SE_Exp7, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that average mercury level among all Florida lakes with pH level 7 is between 0.3912799 and 0.5383454 ppm.

Again, we are not saying that we think an individual like with a pH level of 7 will lie in this range, only that the average mercury level among all such lakes lies in this range.

4.3.8 More CI’s in Regression

We saw in the previous two examples how to calculate a confidence interval for the slope of a regression line, and for an expected response in regression. In fact, we can calculate confidence intervals for any function involving regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\), in a similar manner.

For example, let’s consider the model for Seattle house prices that involved square feet, whether or not the house was on the waterfront, and an interaction term between these variables.

The model is

\[ \widehat{Price} = b_0 + b_1\times\text{Sq. Ft.} + b_2\times\text{waterfront} + b_3\times\text{Sq.Ft}\times\text{Waterfront} \]

We fit the model and obtain the parameter estimates shown below.

M <- lm(data=Houses, price~sqft_living + waterfront +
          sqft_living:waterfront) #fit model to original sample
Sample_b0 <- M$coefficients[1] # record b0 value (second coefficient)
Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b2 <- M$coefficients[3] # record b1 value (second coefficient)
Sample_b3 <- M$coefficients[4] # record b1 value (second coefficient)
M
## 
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront, 
##     data = Houses)
## 
## Coefficients:
##               (Intercept)                sqft_living  
##                   67.3959                     0.2184  
##             waterfrontYes  sqft_living:waterfrontYes  
##                 -364.5950                     0.4327

Consider the following quantities that we might be interested in estimating:

  1. The expected price of a 2,000 square foot waterfront house.
  2. The expected price of a 1,500 square foot non-waterfront house.
  3. The difference between the expected price of a house 1,800 square foot house on the waterfront, compared to a house the same size that is not on the waterfront.
  4. The difference in the rate of change in house prices for each additional 100 square feet for houses on the waterfront, compared to houses not on the waterfront.

Each of these quantities can be expressed as a linear function of our regression coefficients \(b_0, b_1, b_2, b_3\). We just need to find the appropriate function of the \(b_j\)’s, and then calculate a bootstrap confidence interval for that quantity, using the same steps we’ve seen in the previous examples.

Substituting into the regression equation, we see that:

  1. The expected price of a 2,000 square foot waterfront house is given by \[b_0 + 2000b_1 + b_2 + 2000b_3\]

We calculate this estimate from the model, based on our sample of 100 houses:

2000*Sample_b1 +Sample_b2+2000*Sample_b3 # calculate b0+2000b1+b2+2000b3
## sqft_living 
##    937.4777

We estimate that the average price of all 2,000 square foot waterfront houses in Seattle is 937 thousand dollars.

  1. The expected price of a 1,500 square foot non-waterfront house is given by \[b_0 + 1500b_1\]
Sample_b0 + 1500*Sample_b1 # calculate b0+1500b1+
## (Intercept) 
##    394.9499

We estimate that the average price of all 1,500 square foot non-waterfront houses in Seattle is 395 thousand dollars.

  1. The difference between the expected price of a house 1,800 square foot house on the waterfront, compared to a house the same size that is not on the waterfront is given by:

\[ \begin{aligned} & (b_0 + 1800b_1 + b_2 + 1800b_3) - (b_0 + 1800b_1) \\ & = b_2 +1800b_3 \end{aligned} \]

Sample_b2+1800*Sample_b3 # calculate b2+1800b3
## waterfrontYes 
##      414.2058

We estimate that on average a 1,800 square foot house on the waterfront will cost 414 thousand dollars more than a 1,800 square foot house not on the waterfront.

  1. The difference in the rate of change in house prices for each additional 100 square feet for houses on the waterfront, compared to houses not on the waterfront.

This question is asking about the difference in slopes of the regression lines relating price and square feet for houses on the waterfront, compared to those not on the waterfront.

For houses on the waterfront, the regression equation is

\[ \widehat{Price} = (b_0 + b_2) + (b_1 +b_3)\times\text{Sq. Ft.}, \]

so the slope is \(b_1 + b_3\).

For houses not on the waterfront, the regression equation is

\[ \widehat{Price} = b_0 + b_1 \times\text{Sq. Ft.}, \]

so the slope is \(b_1\).

These slope pertain to the expected change in price for each additional 1 square foot. So, for a 100-square foot increase, the price of a waterfront house is expected to increase by \(100(b_1+b_3)\), compared to an increase of \(100b_1\) for a non-waterfront house. Thus, the difference in the rates of change is \(100b_3\).

100*Sample_b3 # calculate 100b3
## sqft_living:waterfrontYes 
##                  43.26671

We estimate that the price of waterfront houses increases by 43 thousand dollars more for each additional 100 square feet than the price of non-waterfront houses.

These estimates calculated from the sample are statistics, which, like all the other statistics we’ve seen are likely to vary from the true values of the corresponding population parameters, due to variability between samples. We can use bootstrapping to calculate confidence intervals for the relevant population parameters, using these sample statistics (the functions of \(b_j\)’s), just as we’ve done for the other statistics we’ve seen.

Bootstrap Steps

  1. Take a sample of 100 houses by randomly sampling from the original sample, with replacement.

  2. Fit a regression model with location as the explanatory variable and record the values of regression coefficients \(b_0, b_1, b_2, b_3\). Use these to calculate each of the four desired quantities (i.e. \(b_0 + 2000b_1 + b_2 +2000b_3\))

  3. Repeat steps 1 and 2 many (say 10,000) times, keeping track of the regression coefficients and calculating the desired quantities in each bootstrap sample.

  4. Look at the distribution of the quantities of interest, across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for each of these quantities.

We’ll illustrate the procedure on 3 bootstrap samples.

Bootstrap Sample 1

We take the first bootstrap sample and fit a model with interaction. For brevity, we won’t list out the houses in each of the bootstrap samples, as the idea should be clear by now. Model coefficients are shown below.

BootstrapSample1 <- sample_n(Houses , 100, replace=TRUE)  %>% arrange(Id) %>% 
  select(Id, price, sqft_living, waterfront)   

Mb1 <- lm(data=BootstrapSample1, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interaction
b0 <- Mb1$coefficients[1] # record value of b0 (first coefficient)
b1 <- Mb1$coefficients[2] # record value of b1 (second coefficient)
b2 <- Mb1$coefficients[3] # record value of b2 (third coefficient)
b3 <- Mb1$coefficients[4] # record value of b3 (fourth coefficient)
Mb1
## 
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront, 
##     data = BootstrapSample1)
## 
## Coefficients:
##               (Intercept)                sqft_living  
##                   23.1250                     0.2416  
##             waterfrontYes  sqft_living:waterfrontYes  
##                 -552.9883                     0.5005

We calculate each of the four desired quantities.

b0+2000*b1 + b2 + 2000*b3
## (Intercept) 
##    954.3864
b0+1500*b1
## (Intercept) 
##    385.5317
b2+1800*b3
## waterfrontYes 
##      347.9484
100*b3
## sqft_living:waterfrontYes 
##                  50.05204

Bootstrap Sample 2

BootstrapSample2 <- sample_n(Houses , 100, replace=TRUE)  %>% arrange(Id) %>% 
  select(Id, price, sqft_living, waterfront)   

Mb2 <- lm(data=BootstrapSample2, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interaction
b0 <- Mb2$coefficients[1] # record value of b0 (first coefficient)
b1 <- Mb2$coefficients[2] # record value of b1 (second coefficient)
b2 <- Mb2$coefficients[3] # record value of b2 (third coefficient)
b3 <- Mb2$coefficients[4] # record value of b3 (fourth coefficient)
Mb2
## 
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront, 
##     data = BootstrapSample2)
## 
## Coefficients:
##               (Intercept)                sqft_living  
##                   97.8551                     0.2133  
##             waterfrontYes  sqft_living:waterfrontYes  
##                 -622.7018                     0.5163

We calculate each of the four desired quantities.

b0+2000*b1 + b2 + 2000*b3
## (Intercept) 
##    934.5378
b0+1500*b1
## (Intercept) 
##     417.871
b2+1800*b3
## waterfrontYes 
##      306.7251
100*b3
## sqft_living:waterfrontYes 
##                  51.63483

Bootstrap Sample 3

BootstrapSample3 <- sample_n(Houses , 100, replace=TRUE)  %>% arrange(Id) %>% 
  select(Id, price, sqft_living, waterfront)   

Mb3 <- lm(data=BootstrapSample3, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interaction
b0 <- Mb3$coefficients[1] # record value of b0 (first coefficient)
b1 <- Mb3$coefficients[2] # record value of b1 (second coefficient)
b2 <- Mb3$coefficients[3] # record value of b2 (third coefficient)
b3 <- Mb3$coefficients[4] # record value of b3 (fourth coefficient)
Mb3
## 
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront, 
##     data = BootstrapSample3)
## 
## Coefficients:
##               (Intercept)                sqft_living  
##                   15.6004                     0.2456  
##             waterfrontYes  sqft_living:waterfrontYes  
##                  -85.2989                     0.3362

We calculate each of the four desired quantities.

b0+2000*b1 + b2 + 2000*b3
## (Intercept) 
##    1093.984
b0+1500*b1
## (Intercept) 
##    384.0153
b2+1800*b3
## waterfrontYes 
##      519.9176
100*b3
## sqft_living:waterfrontYes 
##                  33.62314

We’ll now take 10,000 different bootstrap samples and record the values of \(b_0\), \(b_1\), \(b_3\), and \(b_4\), which we’ll then use to calculate each of our four desired quantities.

M <- lm(data=Houses, price~sqft_living + waterfront +
          sqft_living:waterfront) #fit model to original sample
Sample_b0 <- M$coefficients[1] # record b0 value (second coefficient)
Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b2 <- M$coefficients[3] # record b1 value (second coefficient)
Sample_b3 <- M$coefficients[4] # record b1 value (second coefficient)
Sample_Q1 <- Sample_b0 + 2000*Sample_b1 +Sample_b2+2000*Sample_b3 # calculate b0+2000b1+b2+2000b3
Sample_Q2 <- Sample_b0 + 1500*Sample_b1 # calculate b0+1500b1+
Sample_Q3 <- Sample_b2+1800*Sample_b3 # calculate b2+1800b3
Sample_Q4 <- 100*Sample_b3 # calculate 100b3

Bootstrap_b0 <- rep(NA, 10000)  #vector to store b0 values
Bootstrap_b1 <- rep(NA, 10000)  #vector to store b1 values
Bootstrap_b2 <- rep(NA, 10000)  #vector to store b2 values
Bootstrap_b3 <- rep(NA, 10000)  #vector to store b3 values


for (i in 1:10000){
BootstrapSample <- sample_n(Houses, 1000, replace=TRUE)   #take bootstrap sample
Mb <- lm(data=BootstrapSample, price ~ sqft_living + 
           waterfront + sqft_living:waterfront) # fit linear model with interaction
Bootstrap_b0[i] <- Mb$coefficients[1] # record value of b0 (first coefficient)
Bootstrap_b1[i] <- Mb$coefficients[2] # record value of b1 (second coefficient)
Bootstrap_b2[i] <- Mb$coefficients[3] # record value of b2 (third coefficient)
Bootstrap_b3[i] <- Mb$coefficients[4] # record value of b3 (fourth coefficient)
}

Bootstrap_Q1 <-  Bootstrap_b0 + 2000*Bootstrap_b1 + Bootstrap_b2 + 2000*Bootstrap_b3
Bootstrap_Q2 <-  Bootstrap_b0 + 1500*Bootstrap_b1 
Bootstrap_Q3 <-  Bootstrap_b2 + 1800*Bootstrap_b3
Bootstrap_Q4 <-  100*Bootstrap_b3

Houses_Bootstrap_Results <- data.frame(Bootstrap_b0, Bootstrap_b1, Bootstrap_b2, Bootstrap_b3, Bootstrap_Q1, Bootstrap_Q2 , Bootstrap_Q3 , Bootstrap_Q4)  #save results as dataframe

Bootstrap Distribution for b_0 + 2000b_1 + b_2 + 2000b_3

Houses_Bootstrap_Plot_Q1 <- ggplot(data=Houses_Bootstrap_Results, 
                                   aes(x=Bootstrap_Q1)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Expected Price of 2000 Sq. Ft. Waterfront House") + ylab("Frequency") +
  ggtitle( "Bootstrap Distribution b0+2000b1+b2+2000b3") 
Houses_Bootstrap_Plot_Q1

Standard Error:

SE_Q1 <- sd(Houses_Bootstrap_Results$Bootstrap_Q1)
SE_Q1
## [1] 40.29367

The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

95% Confidence Interval:

c(Sample_Q1 - 2*SE_Q1, Sample_Q1 + 2*SE_Q1) 
## (Intercept) (Intercept) 
##    924.2863   1085.4610

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Houses_Bootstrap_Plot_Q1 + 
  geom_segment(aes(x=Sample_Q1 - 2*SE_Q1,xend=Sample_Q1 + 2*SE_Q1, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that average price among all 2,000 square foot Seattle waterfront houses is between 924.2863407 and 1085.4610253 thousand dollars.

Bootstrap Distribution for b_0 + 1500b_1

Houses_Bootstrap_Plot_Q2 <- ggplot(data=Houses_Bootstrap_Results, 
                                   aes(x=Bootstrap_Q2)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Expected Price of 1500 Sq. Ft. Non-Waterfront House") + ylab("Frequency") +
  ggtitle( "Bootstrap Distribution b0+1500b1") 
Houses_Bootstrap_Plot_Q2

Standard Error:

SE_Q2 <- sd(Houses_Bootstrap_Results$Bootstrap_Q2)
SE_Q2
## [1] 5.859015

The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

95% Confidence Interval:

c(Sample_Q2 - 2*SE_Q2, Sample_Q2 + 2*SE_Q2) 
## (Intercept) (Intercept) 
##    383.2318    406.6679

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Houses_Bootstrap_Plot_Q2 + 
  geom_segment(aes(x=Sample_Q2 - 2*SE_Q2,xend=Sample_Q2 + 2*SE_Q2, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that average price among all 1,500 square foot Seattle non-waterfront houses is between 383.2318261 and 406.6678858 thousand dollars.

Bootstrap Distribution for b_2 + 1800b_3

Houses_Bootstrap_Plot_Q3 <- ggplot(data=Houses_Bootstrap_Results, 
                                   aes(x=Bootstrap_Q3)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Expected Price Difference WF vs NWF for 1800 sq. Ft. House") + ylab("Frequency") +
  ggtitle( "Bootstrap Distribution b2+1800b3") 
Houses_Bootstrap_Plot_Q3

Standard Error:

SE_Q3 <- sd(Houses_Bootstrap_Results$Bootstrap_Q3)
SE_Q3
## [1] 40.66194

The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

95% Confidence Interval:

c(Sample_Q3 - 2*SE_Q3, Sample_Q3 + 2*SE_Q3) 
## waterfrontYes waterfrontYes 
##      332.8819      495.5297

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Houses_Bootstrap_Plot_Q3 + 
  geom_segment(aes(x=Sample_Q3 - 2*SE_Q3,xend=Sample_Q3 + 2*SE_Q3, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that the average price among all 1800 square feet waterfront houses in Seattle is between 332.8818906 and 495.5296513 thousand dollars more than the average price among all non-waterfront houses of the same size.

Bootstrap Distribution for 100b_3

Houses_Bootstrap_Plot_Q4 <- ggplot(data=Houses_Bootstrap_Results, 
                                   aes(x=Bootstrap_Q4)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("Expected Difference per 100 square feet") + ylab("Frequency") +
  ggtitle( "Bootstrap Distribution 100b3") 
Houses_Bootstrap_Plot_Q4

Standard Error:

SE_Q4 <- sd(Houses_Bootstrap_Results$Bootstrap_Q4)
SE_Q4
## [1] 2.224772

The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.

95% Confidence Interval:

c(Sample_Q4 - 2*SE_Q4, Sample_Q4 + 2*SE_Q4) 
## sqft_living:waterfrontYes sqft_living:waterfrontYes 
##                  38.81716                  47.71625

The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.

Houses_Bootstrap_Plot_Q4 + 
  geom_segment(aes(x=Sample_Q4 - 2*SE_Q4,xend=Sample_Q4 + 2*SE_Q4, y=50, yend=50), 
               color="gold", size=10, alpha=0.01) 

We are 95% confident that for each 100 square foot increase, the average price among all waterfront houses increases by between 38.8171639 and 47.7162531 thousand dollars more than the increase in average price among all non-waterfront.

4.3.9 Bootstrapping Cautions

While bootstrapping is a popular and robust procedure for calculating confidence intervals, it does have cautions and limitations. We should be sure to use the bootstrap procedure appropriate for our context. A standard-error bootstrap interval is appropriate when the sampling distribution for our statistic is roughly symmetric and bell-shaped. When this is not true, a percentile bootstrap interval can be used as long as there are no gaps or breaks in the bootstrap distribution. In situations where there are gaps and breaks in the bootstrap distribution, then the bootstrap distribution may not be a reasonable approximation of the sampling distribution we are interested in.

4.4 Estimating Standard Error

4.4.1 Standard Error vs Standard Deviation

Recall that standard error is the standard deviation of the distribution of a statistic (sample mean, proportion, regression coefficient, etc.). It describes the amount of variability in this statistic between samples of a given size.

This is different than the sample standard deviation, which pertains to the amount of variability between individuals in the sample.

For example, the histogram displays the distribution of mercury levels in our sample of 53 lakes in Florida.

Lakes_Hist

Standard Deviation of Mercury Levels Between Lakes:

sd(FloridaLakes$Mercury)
## [1] 0.3410356

The standard deviation in mercury levels between individual lakes is 0.341 ppm. This describes the amount of variability in mercury levels between individual lakes.

Bootstrap Distribution for Mean Mercury Level (\(n=53\))

Lakes_Bootstrap_Mean_Plot + xlim(c(0,1.5))

Standard Error for Mean:

SE <- sd(Lakes_Bootstrap_Results_Mean$Bootstrap_Mean); SE
## [1] 0.04655744

The standard deviation in the distribution for mean mercury levels between different samples of 53 lakes is approximately 0.0465574 ppm. This describes the amount of variability in mean mercury levels between different samples of 53 lakes.

There is less variability in average among samples of size 53, than there is between mercury levels of individual lakes. Although mercury levels might vary quite a bit between individual lakes, in a sample of size 53, the higher and lower levels tend to average out to something in the middle, resulting in less variability associated with the average than with individual lakes.

4.4.2 Sample Size and Standard Error

Question:

Suppose the sample consisted of 10 lakes, or 30 lakes, or 100 lakes, instead of 53, and that the distribution of the lakes in the sample was otherwise similar to that of the original 53. Would you expect the mercury level of individual lakes to increase, decrease, or stay about the same? What about the standard error of the mean mercury level?

The table shows the standard deviation in each of the samples.

Sample_Size SD
10 0.3831797
30 0.3198103
53 0.3410356
100 0.3352758

Sample size does not impact the amount of variability between individual lakes. Standard deviation in mercury levels between individual lakes does not systematically increase or decrease based on sample size (of course it varies a little based on the lakes randomly chosen in the sample).

Now, we’ll examine what happens to the standard error of the mean as the sample size changes.

Distributions of Mean Between Different Samples

The table shows the standard error of the mean for samples of different size:

Sample_Size SE
10 0.1166601
30 0.0579587
53 0.0465574
100 0.0331806

As sample size increases, variability between means of different samples decreases. Standard error of the mean decreases. This is also true of standard errors for other statistics (i.e. difference in means, regression slopes, etc.)

4.4.3 Standard Error Formulas

So far, we’ve used simulation (permutation tests and bootstrap intervals) to determine the amount of variability associated with a test statistic or estimate, in order to perform hypotheses tests and create confidence intervals. In special situations, there are mathematical formulas, based on probability theory, that can be used to approximate the amount of variability without having to perform the simulations.

These approximations only exist for certain kinds of statistics, and they are only valid when the sampling distribution for the statistic is symmetric and bell-shaped. Thus, they cannot be used in all situations, and we should be careful to check whether they are appropriate before applying them. Nevertheless, when appropriate, they can allow us to bypass the computation required in a simulation.

Theory-Based Standard Error Formulas

Scenario Standard Error
Single Mean \(SE(b_0)=\frac{s}{\sqrt{n}}\)
Difference in Means \(SE(b_j)=s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\)
Single Proportion \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
Difference in Proportions \(SE(\hat{p}) = \sqrt{\left(\frac{\hat{p_1}(1-\hat{p}_1)}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2}\right)}\)
Intercept in Simple Linear Regression \(SE(b_0)=s\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}}\)
Slope in Simple Linear Regression \(SE(b_1)=\sqrt{\frac{s^2}{\sum(x_i-\bar{x})^2}}=\sqrt{\frac{1}{n-2}\frac{{\sum(\hat{y}_i-y_i)^2}}{\sum(x_i-\bar{x})^2}}\)
  • \(s=\sqrt{\frac{\displaystyle\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-(p+1))}}\), (p is number of regression coefficients not including \(b_0\)) is sample standard deviation. Note that in the one-sample case, this simplifies to the standard deviation formula we’ve seen previously.

  • In the 2nd formula, the standard error estimate \(s\sqrt{\frac{1}{n_1+n_2}}\) is called a “pooled” estimate since it combines information from all groups. When there is reason to believe standard deviation differs between groups, we often use an “unpooled” standard error estimate of \(\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\), where \(s_1, s_2\) represent the standard deviation for groups 1 and 2.

There is no theory-based formula for standard error associated with the median or standard deviation. For these, and many other statistics, we rely on simulation to estimate variability between samples.

There are formulas for standard errors associated with coefficients in multiple regression, but these require mathematics beyond what is assumed in this class. They involve linear algebra and matrix inversion, which you can read about here if you are interested.

4.4.4 One-Sample Mean Example

We use the first formula to calculate the standard error of the mean mercury concentration associated with samples of 53 lakes.

\(SE(\bar{x})=\frac{s}{\sqrt{n}} = \frac{0.341}{\sqrt{53}}=0.04684\)

The calculation in R is shown below.

sd(FloridaLakes$Mercury)/sqrt(53)
## [1] 0.04684485

The standard error column in the R summary output is calculated using the theory-based formulas.

M <- lm(data=FloridaLakes, Mercury~1)
summary(M)
## 
## Call:
## lm(formula = Mercury ~ 1, data = FloridaLakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48717 -0.25717 -0.04717  0.24283  0.80283 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  0.52717    0.04684   11.25 0.00000000000000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.341 on 52 degrees of freedom

Let’s compare this to the standard error we obtained using 10,000 bootstrap simulations.

Lakes_Bootstrap_Mean_Plot

Bootstrap standard error:

SE <- sd(Lakes_Bootstrap_Results_Mean$Bootstrap_Mean); SE
## [1] 0.04655744

The theory-based formula gives a standard error estimate very close to the one we obtained via bootstrap simulation.

4.4.5 Difference in Means Example

We use the probability-based formula to calculate the standard error for difference in means between 33 lakes in North Florida, and 20 lakes in South Florida

\[ SE(\bar{x}_1-\bar{x}_2)=s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}, \]

s <- sqrt(sum(Lakes_M$residuals^2)/(53-2))
SE <- s*sqrt(1/20+1/33); SE
## [1] 0.08984774

R model summary output:

Lakes_M <- lm(data=FloridaLakes, Mercury~Location)
summary(Lakes_M)
## 
## Call:
## lm(formula = Mercury ~ Location, data = FloridaLakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65650 -0.23455 -0.08455  0.24350  0.67545 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)  0.42455    0.05519   7.692 0.000000000441 ***
## LocationS    0.27195    0.08985   3.027        0.00387 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3171 on 51 degrees of freedom
## Multiple R-squared:  0.1523, Adjusted R-squared:  0.1357 
## F-statistic: 9.162 on 1 and 51 DF,  p-value: 0.003868

Comparison to Bootstrap standard error:

NS_Lakes_Bootstrap_Plot_b1 <- ggplot(data=NS_Lakes_Bootstrap_Results, aes(x=Bootstrap_b1)) +  
  geom_histogram(color="white", fill="lightblue") + 
  xlab("b1 in Bootstrap Sample") + ylab("Frequency") +
  ggtitle("Northern vs Southern Lakes: Bootstrap Distribution for b1") 
NS_Lakes_Bootstrap_Plot_b1 

sd(NS_Lakes_Bootstrap_Results$Bootstrap_b1)
## [1] 0.09620419

The mathematical standard error approximation is close to the one observed in our simulation, though not exact.

4.4.6 Regression Example

We use the theory-based approach to calculate standard error associated with the slope of the regression line relating mercury level and pH in Florida lakes.

\(SE(b_1)=\sqrt{\frac{s^2}{\sum(x_i-\bar{x})^2}}=\sqrt{\frac{1}{n-2}\frac{{\sum(\hat{y}_i-y_i)^2}}{\sum(x_i-\bar{x})^2}}\)

M <- lm(data=FloridaLakes, Mercury~pH) # fit model
s2 <- sum(M$residuals^2)/(53-2) #calculate s^2
x <- FloridaLakes$pH
SSx <- sum((x-mean(x))^2) #calculate denominator
SE <- sqrt(s2/SSx) # calculate SE
SE
## [1] 0.03031326

R model summary output:

Lakes_M_pH <- lm(data=FloridaLakes, Mercury~pH)
summary(Lakes_M_pH)
## 
## Call:
## lm(formula = Mercury ~ pH, data = FloridaLakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48895 -0.19188 -0.05774  0.09456  0.71134 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)  1.53092    0.20349   7.523 0.000000000814 ***
## pH          -0.15230    0.03031  -5.024 0.000006572811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2816 on 51 degrees of freedom
## Multiple R-squared:  0.3311, Adjusted R-squared:  0.318 
## F-statistic: 25.24 on 1 and 51 DF,  p-value: 0.000006573

Comparison to bootstrap standard error:

Lakes_Bootstrap_Plot_Slope

SE_b1 <- sd(Lakes_Bootstrap_Slope_Results$Bootstrap_b1)
SE_b1
## [1] 0.02676588

The mathematical standard error approximation is again close to the one observed in our simulation.

4.4.7 Theory-Based Confidence Intervals

If the sampling distribution for a statistic is symmetric and bell-shaped, we can obtain an approximate 95% confidence interval using the formula:

\[ \text{Statistic} \pm 2\times{\text{Standard Error}}, \]

where the standard error is calculated by formula, rather than via bootstrap simulations.

The confint command in R provides confidence intervals associate with each of the model parameters, which are calculated using the theory-based standard errors

Confidence Intervals for N vs S Florida Lakes

confint(Lakes_M)
##                  2.5 %    97.5 %
## (Intercept) 0.31374083 0.5353501
## LocationS   0.09157768 0.4523314

Intercept Interval:

\(\approx 0.42455 \pm 2\times0.05519\)

We are 95% confident that the average mercury level among all lakes in North Florida is between 0.313 and 0.535 ppm.

LocationS Interpretation

\(\approx -0.27195 \pm 2\times 0.08985\)

We are 95% confident that the average mercury level among all lakes in South Florida is between 0.09 and 0.45 ppm higher than the average mercury concentration among all lakes in North Florida.

Note that the use of 2 in the confidence interval formula is an approximation, though one that is reasonable and widely used. Rather than using 2, R uses a value very close to 2, which varies depending on the sample size. Thus R’s calculations will not exactly match the formulas shown, but are similar enough for practical purposes. Remember that in statistics, everything is an approximation.

Confidence Intervals for pH and Mercury in Florida Lakes

confint(Lakes_M_pH)
##                  2.5 %      97.5 %
## (Intercept)  1.1223897  1.93944769
## pH          -0.2131573 -0.09144445

Intercept Interval:

\(\approx 1.53092 \pm 2\times 0.20349\)

We are 95% confident that the average mercury concentration among all lakes with pH level 0 is between 1.12 and 1.94 ppm. This is not a meaningful interpretation, since none of the lakes in our sample have a pH level anywhere close to 0.

pH Interval

\(\approx -0.15230 \pm 2\times 0.03031\)

We are 95% confident that for each one-unit increase in pH, mercury level is expected to decrease between 0.09 and 0.21 ppm, on average.

4.4.8 CI Method Comparison

We’ve now seen 3 different ways to obtain confidence intervals based on statistics calculated from data.

The table below tells us what must be true of the sampling distribution for a statistic in order to use each technique.

Technique No Gaps Bell-Shaped Known Formula for SE
Bootstrap Percentile x
Bootstrap Standard Error x x
Theory-Based x x x