Chapter 4 Bootstrap Interval Estimation
Learning Outcomes:
- Explain the impact of sampling variability on interval estimation.
- Define sampling distribution and standard error.
- Interpret confidence intervals, or explain why it is inappropriate to do so.
- Explain the purpose of bootstrapping.
- Explain how to obtain a bootstrap distribution for a statistic (such as a sample mean, median, standard deviation, proportion, difference in means or proportions, or regression coefficient).
- Determine when it is appropriate to use standard error and percentile confidence intervals.
- Explain whether or not the results of a confidence interval are consistent with the conclusion of a hypothesis test.
- Interpret confidence intervals, or explain why it is inappropriate to do so.
- Define standard error of a statistic and interpret it in context.
- Explain how sample size and level of confidence impact the width of a confidence interval.
- Explain how sample size impacts variability in individual observations, and the sampling distribution for a test statistic.
- Define standard error of a statistic and interpret it in context.
- Explain when it is appropriate to use “theory-based” standard error formulas.
4.1 Sampling Distributions
4.1.1 Sampling From a Population
In statistics, we often do not have the time, money, or means to collect data on all individuals or units on which we want to draw conclusions. Instead, we might collect data on only a subset of the individuals, and then make inferences about all individuals we are interested in, using the information we collected.
Vocabulary:
- A population is the entire set of individuals that we want to draw conclusions about.
- A sample is a subset of a population
- A parameter is a numerical quantity pertaining to an entire population
- A statistic is a numerical quantity calculated from a sample
We’ll work with a dataset containing information on all 20,591 flights from New York to Chicago in 2013. Our population of interest is all 20,591 flights.
In this situation, we have information on the entire population, but suppose temporarily that we didn’t. Instead, suppose we had only information on a random sample of 75 flights.
The parameter of interest is the proportion of on-time arrivals out of all flights in the population of 20,591. When the parameter is proportion, we’ll denote it with the letter \(p\).
We take a sample of 75 flights. The first 6 flights in the sample are shown below. The ontime
variable tells whether or not the flight arrived on time.
set.seed(08082023)
<- sample_n(Flights_NY_CHI, 75)
S1 head(S1)
## # A tibble: 6 × 9
## year month day carrier origin dest sched_dep_time arr_delay ontime
## <int> <int> <int> <chr> <chr> <chr> <int> <dbl> <chr>
## 1 2013 3 8 AA LGA ORD 1720 106 N
## 2 2013 12 15 AA JFK ORD 1715 -24 Y
## 3 2013 10 22 UA EWR ORD 1300 -12 Y
## 4 2013 8 26 UA EWR ORD 2110 15 N
## 5 2013 7 23 AA LGA ORD 1359 66 N
## 6 2013 5 13 UA EWR ORD 900 -21 Y
We’ll calculate the number, and proportion of flights that arrived on time.
<- sum(S1$ontime == "Y") # count number of on-time arrivals num_ontime
Number of on-time arrivals in the sample.
num_ontime
## [1] 39
Proportion of on-time arrivals in the sample.
<- num_ontime/75
p_hat p_hat
## [1] 0.52
In out sample 52 percent of flights arrived on-time. We’ll denote this quantity \(\hat{p}\), keeping with our convention of using the \(\hat{}\) symbol to represent a quantity that is calculated from data (like the predictions and estimates we saw in the previous chapter). The sample statistic \(\hat{p}\) is an estimate of the population proportion \(p\).
Of course, this was just one sample of 75 flights. We should not expect the proportion of on-time flights in our sample (\(\hat{p}\)) to exactly match the proportion of on-time flights in the entire population (\(p\)). Nevertheless we can use the sample to estimate the proportion of all flights in the population that arrive on time. Perhaps, we could say that we would expect between 42 and 62 percent of all 2013 flights from New York to arrive on time. Or, perhaps we could be more precise and estimate that the percentage of on-time flights to be between 47 and 57. We’ll need to figure out how precise we can make our range, while still being confident that it does, in fact, contain the true population parameter.
To summarize:
- The population is all 20,591 flights from New York to Chicago in 2013.
- The sample is the flights that we randomly selected.
- The parameter \(p\) is the proportion of on-time arrivals among all 2013 flights from New York to Chicago, which we do not know (though in this particular example, we could find it, since we have data on all flights in the population.)
- The sample statistic is the proportion of flights in our sample that arrived on time, which we know to be \(\hat{p}\) = 0.52.
Now, let’s take a different sample of 75 flights and see how the proportion of on-time arrivals compares.
<- sample_n(Flights_NY_CHI, 75)
S2 <- sum(S2$ontime == "Y") # count number of on-time arrivals
num_ontime2 <- num_ontime2/75
p_hat2 p_hat2
## [1] 0.5066667
By studying the behavior of the proportion of on-time arrivals in different samples we can gauge how close the proportion in a given sample is likely be to the unknown population parameter. If all of our samples produce very similar estimates, then it is likely that the population parameter is close to these estimates. If the sample proportion varies considerably from sample to sample, then it is possible that the proportion in any given sample might be very different than the population parameter.
Let’s take 10,000 more random samples of 75 flights and record the proportion of on-time arrivals in each sample.
<- 10000 # number of repetitions
nreps <- rep(NA, nreps) # create vector to hold proportion of on-time arrivals
p_hat_val <- 1:nreps
Sample
for(i in 1:nreps){
<- sample_n(Flights_NY_CHI, 75) # take sample of 75
S <- sum(S$ontime == "Y") # count number of on-time arrivals
N_ontime <- N_ontime/75 # record proportion on-time
p_hat_val[i]
}
<- data.frame(Sample, p_hat_val) # store results in a data frame Samples_df
The table shows the proportion of on-time arrivals in the first 20 samples of 75 flights.
kable(head(Samples_df, 20))
Sample | p_hat_val |
---|---|
1 | 0.6533333 |
2 | 0.5333333 |
3 | 0.5866667 |
4 | 0.6933333 |
5 | 0.5866667 |
6 | 0.6666667 |
7 | 0.6133333 |
8 | 0.5866667 |
9 | 0.6533333 |
10 | 0.6533333 |
11 | 0.6133333 |
12 | 0.5866667 |
13 | 0.6133333 |
14 | 0.4800000 |
15 | 0.6000000 |
16 | 0.4400000 |
17 | 0.4933333 |
18 | 0.5733333 |
19 | 0.6400000 |
20 | 0.5600000 |
The histogram below shows the distribution of the proportion of on-time arrivals in the 10,000 different samples.
<- ggplot(data=Samples_df, aes(x=p_hat_val)) +
Prop_Samp_Distgeom_histogram(color="white", fill="blue") +
ggtitle("Sampling Distribution for Proportion On Time") +
xlab("Prop. on time in sample")
Prop_Samp_Dist
We notice that most of our 10,000 samples yielded proportions of on-time arrivals between 0.5 and 0.7, The distribution of proportion of on-time arrivals is roughly symmetric and bell-shaped.
The distribution shown in this histogram is called the sampling distribution for \(\hat{p}\). We can gauge how much the proportion of on-time arrivals varies between samples by calculating the standard deviation of this sampling distribution. The standard deviation of a sampling distribution for a statistic is also called the standard error of the statistic. In this case it represents the standard error \(\hat{p}\) (the proportion of on-time arrivals), and is denoted \(\text{SE}(\hat{p})\). This standard error is shown below.
<- sd(Samples_df$p_hat_val)
SE_p_hat SE_p_hat
## [1] 0.05659102
Vocabulary:
- The sampling distribution of a statistic is the distribution of values the statistic takes on across many different samples of a given size.
- The standard error of a statistic is the standard deviation of that statistic’s sampling distribution. It measures how much the statistic varies between different samples of a given size.
In this rare situation, we actually have data on all 20,591 flights from New York to Chicago in 2013 (our entire population), Let’s calculate the true value of the population parameter \(p\), the proportion of flights that arrived on-time in our actual population.
<- sum(Flights_NY_CHI$ontime == "Y")/20591
p p
## [1] 0.6079841
In fact, just over 60% of all flights in the population arrived on time.
The sampling distribution for the proportion of on-time flights is shown again below. The true proportion of on-time flights is marked by the green dotted line. The gold bar at the bottom of the histogram represents the range of sample proportions that lie within \(\pm 2\) standard errors of the true population proportion of flights that arrived on time.
0.6079841 - 2(0.056591) to 0.6079841 + 2(0.056591)
+ geom_vline(xintercept=p, color="green", linetype="dotted", linewidth=2) + geom_segment(aes(x=p - 2*SE_p_hat,xend=p + 2*SE_p_hat, y=50, yend=50), color="gold", size=10, alpha=0.01) Prop_Samp_Dist
We calculate the proportion of samples whose proportion of on-time arrivals lies within \(\pm 2\) standard errors of the true proportion.
<- p - 2*SE_p_hat
Lower <- p + 2*SE_p_hat
Upper sum((Samples_df$p_hat_val >=Lower) & (Samples_df$p_hat_val <= Upper))
## [1] 9539
Approximately 95% 10,000 samples produced proportions within \(\pm 2\) standard errors of the true population proportion of on-time flights.
4.1.2 Confidence Intervals
In a real situation, we won’t have access to the entire population of flights, only the flights in a single sample. For example, recall our original sample of 75 flights, in which we observed a proportion of on-time arrivals of \(\hat{p}=\) 0.52.
Since we now know that 95% of all samples produce proportions that lie within two standard errors of the population proportion, we can obtain an estimate of the population proportion \(p\) by adding and subtracting \(2\times \text{SE}(\hat{p})\) from our observed sample proportion \(\hat{p}\). Such an interval is called an approximate 95% confidence interval for the true population proportion \(p\).
Approximate 95% Confidence Interval for \(\hat{p}\)
\[ \hat{p} \pm 2\times \text{SE}(\hat{p}) \]
The confidence interval, based on our original sample, is calculated below.
c(p_hat - 2*SE_p_hat, p_hat + 2*SE_p_hat)
## [1] 0.406818 0.633182
Based on our sample of 75 flights, we can be 95% confident that the true proportion of on-time arrivals among all 2013 flights from New York to Chicago is between 0.406818 and p_hat + 2*SE_p_hat
.
In fact, knowing what we do about the true value of the population parameter \(p\), we can see that our confidence interval does indeed contain this value. Of course, in a real situation, we won’t know the true value of the population parameter, so we won’t know for sure whether or not our confidence interval contains this true parameter value.
A pertinent question at this stage would be ``What does 95% confidence mean?“. To answer that, let’s explore what happens when we calculate confidence intervals based on estimates many different samples. For each of our 10,000 different samples taken from our population, we’ll add and subtract two standard errors from the sample proportion \(\hat{p}\) corresponding to that sample.
The table below displays the value of \(\hat{p}\), for the first 20 samples we took, along with the lower and upper bounds of the confidence interval, and whether or not the confidence interval contains the true parameter value \(p\) (either TRUE
or FALSE
).
<- Samples_df %>% mutate( Lower = p_hat_val - 2*SE_p_hat,
Samples_df Upper = p_hat_val + 2*SE_p_hat,
Contains = p >= Lower & p <= Upper)
kable(head(Samples_df, 20))
Sample | p_hat_val | Lower | Upper | Contains |
---|---|---|---|---|
1 | 0.6533333 | 0.5401513 | 0.7665154 | TRUE |
2 | 0.5333333 | 0.4201513 | 0.6465154 | TRUE |
3 | 0.5866667 | 0.4734846 | 0.6998487 | TRUE |
4 | 0.6933333 | 0.5801513 | 0.8065154 | TRUE |
5 | 0.5866667 | 0.4734846 | 0.6998487 | TRUE |
6 | 0.6666667 | 0.5534846 | 0.7798487 | TRUE |
7 | 0.6133333 | 0.5001513 | 0.7265154 | TRUE |
8 | 0.5866667 | 0.4734846 | 0.6998487 | TRUE |
9 | 0.6533333 | 0.5401513 | 0.7665154 | TRUE |
10 | 0.6533333 | 0.5401513 | 0.7665154 | TRUE |
11 | 0.6133333 | 0.5001513 | 0.7265154 | TRUE |
12 | 0.5866667 | 0.4734846 | 0.6998487 | TRUE |
13 | 0.6133333 | 0.5001513 | 0.7265154 | TRUE |
14 | 0.4800000 | 0.3668180 | 0.5931820 | FALSE |
15 | 0.6000000 | 0.4868180 | 0.7131820 | TRUE |
16 | 0.4400000 | 0.3268180 | 0.5531820 | FALSE |
17 | 0.4933333 | 0.3801513 | 0.6065154 | FALSE |
18 | 0.5733333 | 0.4601513 | 0.6865154 | TRUE |
19 | 0.6400000 | 0.5268180 | 0.7531820 | TRUE |
20 | 0.5600000 | 0.4468180 | 0.6731820 | TRUE |
The graphic below visualizes the confidence intervals produced using the estimates from the first 100 samples. The green dotted line indicates the true value of \(p\). The black dots indicate the value of \(\hat{p}\) for each sample. Intervals that do in fact contain the true value of \(p\) are shown in blue, and intervals that do not contain the true value of \(p\) are shown in green.
ggplot(data=Samples_df[1:100,], aes(y=Sample, x=p_hat_val)) +
geom_point() +
geom_errorbar(aes(xmin = Lower, xmax = Upper, color=Contains)) +
xlab("Confidence Interval") +
ylab("Sample") +
geom_vline(xintercept = p, color="green", linetype="dotted", size=2) + theme_bw()
Out of these 100 samples, 93 contain the true value of the population parameter \(p\). This is close to the 95% confidence level.
The picture shows confidence intervals produced by the first 100 samples, but we actually took 10,000 different samples of 75 flights. Let’s calculate how many of these samples produced confidence intervals that contain the true value of \(p\).
sum(Samples_df$Contains == TRUE)
## [1] 9539
Again, notice that close to 95% of the samples produced confidence intervals contain the true population parameter \(p\). Note that for the red intervals that do not contain \(p\) nothing was done incorrectly. The sample was taken at random, and the confidence interval was calculated using the correct formula. It just happened that by chance, we obtained a sample proportion \(\hat{p}\) that was unusually high or low, leading to an interval that did not capture the true population parameter. This, of course, happens rarely, and approximately 95% of the samples do, in fact, result in intervals that contain the true value of \(p\).
This brings us back to the question “what does 95% confidence mean?”. An approximate 95% confidence interval means that if we take a large number of samples and calculate confidence intervals from each of them, then approximately 95% of the samples will produce intervals containing the true population parameter. In reality, we’ll only have on sample, and won’t know whether or not our interval contains the true parameter value. Assuming we have taken the sample and calculated the interval correctly, we can rest assured in the knowledge that that 95% of all intervals taken would contain the true parameter value, and hope that ours is among that 95%.
We calculated the confidence interval by taking our sample statistic \(\hat{p}\) plus/minus two standard errors. Confidence intervals that are calculated by adding and subtracting a certain number of standard errors from the sample statistic are called standard error confidence intervals. This approach will work as long as the sampling distribution is symmetric and bell-shaped. Probability theory tells us that in a symmetric and bell-shaped distribution, approximately 95% of the area lies within two standard errors of the center of the distribution, given by the true parameter value. We will, however, see that this approach will not work in all cases. Not all statistics produce sampling distributions that are symmetric and bell-shaped, and we will need an alternative way to calculate confidence intervals in these situations.
::include_graphics("Emp_Rule.png") knitr
If we want to use a level of confidence that is different than 95%, we can adjust the value we multiply the standard error by. In general, a standard error confidence interval has the form
\[ \text{Statistic } \pm m\times \text{Standard Error}, \]
where the value of \(m\) depends on the desired level of confidence.
Of course, you might ask why we needed to calculate a confidence interval for the proportion of on-time flights in the first place, since we actually have data on all 20,591 flights in the population and already know the true proportion of on-time arrivals to be 0.608. The answer is that we don’t. But, in most real situations, we will only have data from a single sample, not the entire population, and we won’t know the true population parameter. We’ll be able to build on the ideas of sampling distributions and standard error that we learned about in this section to calculate confidence intervals in those scenarios.
4.2 Bootstrapping
4.2.1 Mercury Levels in Florida Lakes
A 2004 study by Lange, T., Royals, H. and Connor, L. examined Mercury accumulation in large-mouth bass, taken from a sample of 53 Florida Lakes. If Mercury accumulation exceeds 0.5 ppm, then there are environmental concerns. In fact, the legal safety limit in Canada is 0.5 ppm, although it is 1 ppm in the United States.
Recall that our data come from a sample of 53 lakes, out of more then 30,000 in the entire state of Florida.
The mercury levels of the first 10 lakes in the sample are shown in the table below.
data("FloridaLakes")
<- FloridaLakes %>% rename(Mercury = AvgMercury) FloridaLakes
kable(head(FloridaLakes %>% select(ID, Lake, Mercury), 10))
ID | Lake | Mercury |
---|---|---|
1 | Alligator | 1.23 |
2 | Annie | 1.33 |
3 | Apopka | 0.04 |
4 | Blue Cypress | 0.44 |
5 | Brick | 1.20 |
6 | Bryant | 0.27 |
7 | Cherry | 0.48 |
8 | Crescent | 0.19 |
9 | Deer Point | 0.83 |
10 | Dias | 0.81 |
The histogram shows the distribution of mercury levels in the 53 lakes in the sample. Lakes exceeding the US standard of 1 ppm are shown in red.
<- ggplot(data=FloridaLakes, aes(x=Mercury)) +
Lakes_Hist geom_histogram(aes(fill=Mercury<=1), color="white", binwidth = 0.1) +
ggtitle("Mercury Levels in Sample of 53 Florida Lakes") +
xlab("Mercury Level") + ylab("Frequency") + theme_bw()
Lakes_Hist
The proportion of lakes with mercury levels exceeding 1 ppm is calculated below.
<- sum(FloridaLakes$Mercury > 1)/53
p_hat p_hat
## [1] 0.1132075
We see that in our sample of 53 lakes, approximately 11% have mercury levels exceeding the US standard of 1 ppm. Suppose we want to estimate the proportion of all Florida Lakes whose mercury level exceeds this standard. As we saw in the previous section, we would not expect the population proportion to exactly match the sample, due to random variability between samples. We can use the sample proportion as an estimate (\(\hat{p} = 0.1132\)), and construct a confidence interval for the unknown population proportion \(p\).
In order to construct the confidence interval, we need to know how much the sample proportion of lakes exceeding 1 ppm \(\hat{p}\) could vary between different samples of size 53. That is, we need to know the standard error of \(\hat{p}\). In the previous section, we calculated the standard error by taking 10,000 different samples of the same size as ours from the population, calculating the proportion for each sample, and then calculating the standard deviation of the proportions obtained from these 10,000 different samples. This procedure will not work here, however, because unlike the previous example where we really did have data on the entire population of all flights from New York to Chicago, we do not have data on all 30,000+ lakes in Florida. We cannot take a lot of different samples of size 53 from the population of all lakes, and thus, cannot obtain the sampling distribution for the the proportion of lakes exceeding 1 ppm, or estimate the standard error of \(\hat{p}\).
4.2.2 Bootstrap Sampling
All we have is a single sample of 53 lakes. We need to figure out how much the proportion of lakes with mercury levels exceeding 1 ppm would vary between different samples of size 53, using only the information contained in our one sample.
To do this, we’ll implement a popular simulation-based strategy, known as bootstrapping.
Let’s assume our sample is representative of all Florida lakes. Then, we’ll duplicate the sample many times to create a large set that will look like the population of all Florida Lakes. We can then draw samples of 53 from that large population, and record the mean mercury level for each sample of 53.
An illustration of the bootstrapping procedure is shown below, using a sample of 12 colored dots, instead of the 53 lakes.
In fact, duplicating the sample many times and selecting new samples of size \(n\) has the same effect as drawing samples of size \(n\) from the original sample, by putting the item drawn back in each time, a procedure called sampling with replacement. Thus, we can skip the step of copying/pasting the sample many times, and instead draw our samples with replacement.
This means that in each new sample, some lakes will be drawn multiple times and others not at all. It also ensures that each sample is different, allowing us to estimate variability in the sample mean between the different samples of size 53.
An illustration of the concept of bootstrapping, using sampling with replacement is shown below.
The variability in sample means in our newly drawn samples is used to approximate the variability in proportion \(\hat{p}\) that would occur between different samples of 53 lakes, drawn from the population of all Florida Lakes.
The point of bootstrapping is to observe how much a statistic (in this case the proportion of lakes with Mercury levels exceeding 1 ppm) varies between bootstrap samples. This can act as an estimate of how much that statistic would vary between different samples of size \(n\), drawn from the population.
The steps of bootstrap sampling can be summarized in the following algorithm.
Bootstrap Algorithm
For an original sample of size \(n\):
Take a sample size \(n\) by randomly sampling from the original, with replacement. Thus, some observations will show up multiple times, and others not at all. This sample is called a bootstrap sample.
Calculate the statistic of interest in the bootstrap sample (in this case \(\hat{p}\), the proportion of lakes whose mercury levels exceed 1 ppm).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the statistic of interest that is calculated in each bootstrap sample.
Look at the distribution of the statistic across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the statistic of interest.
4.2.3 Bootstrap Samples of Lakes
The sample_n()
function samples the specified number rows from a data frame, with or without replacement.
The lakes in the first sample are shown below. Notice that some lakes occur multiple times, and others not at all.
Bootstrap Sample 1
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample1 %>% select(ID, Lake, Mercury) BootstrapSample1
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 1 Alligator 1.23
## 2 2 Annie 1.33
## 3 3 Apopka 0.04
## 4 5 Brick 1.2
## 5 5 Brick 1.2
## 6 6 Bryant 0.27
## 7 7 Cherry 0.48
## 8 9 Deer Point 0.83
## 9 9 Deer Point 0.83
## 10 12 Down 0.5
## # ℹ 43 more rows
We calculate the proportion of lakes with mercury levels exceeding 1 ppm in this bootstrap sample. Note that if a lake shows up more than once in the bootstrap sample, then it is counted however many times it shows up.
sum(BootstrapSample1$Mercury > 1) / 53
## [1] 0.0754717
Bootstrap Sample #2
We take a second bootstrap sample. Notice that the lakes chosen and omitted differ from the first sample.
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample2 %>% select(ID, Lake, Mercury) BootstrapSample2
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 1 Alligator 1.23
## 2 2 Annie 1.33
## 3 2 Annie 1.33
## 4 2 Annie 1.33
## 5 2 Annie 1.33
## 6 3 Apopka 0.04
## 7 6 Bryant 0.27
## 8 11 Dorr 0.71
## 9 11 Dorr 0.71
## 10 11 Dorr 0.71
## # ℹ 43 more rows
Proportion exceeding 1 ppm:
sum(BootstrapSample2$Mercury > 1) / 53
## [1] 0.1698113
Bootstrap Sample #3
We’ll take one more bootstrap sample and calculate the proportion of lakes with mercury levels exceeding 1 ppm.
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample3 %>% select(ID, Lake, Mercury) BootstrapSample3
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 1 Alligator 1.23
## 2 1 Alligator 1.23
## 3 2 Annie 1.33
## 4 3 Apopka 0.04
## 5 3 Apopka 0.04
## 6 5 Brick 1.2
## 7 6 Bryant 0.27
## 8 7 Cherry 0.48
## 9 9 Deer Point 0.83
## 10 9 Deer Point 0.83
## # ℹ 43 more rows
Proportion exceeding 1 ppm:
sum(BootstrapSample3$Mercury > 1) / 53
## [1] 0.09433962
4.2.4 Bootstrap Distribution
Now that we have seen how bootstrap sampling works, we’ll take a large number (10,000) different bootstrap samples and examine how the proportion of lakes with mercury levels exceeding 1 ppm varies between samples.
We’ll use a for-loop
to take many different bootstrap samples and record the observed proportion in a vector called p_hat_b
<- sum(FloridaLakes$Mercury > 1)/53 #calculate sample statistic
p_hat <- rep(NA, 10000) #setup vector to hold bootstrap statistics
Bootstrap_prop
for (i in 1:10000){
<- sample_n(FloridaLakes, 53, replace=TRUE) #take bootstrap sample
BootstrapSample <- sum(BootstrapSample$Mercury > 1)/53 # calc. prop exceeding 1
Bootstrap_prop[i]
}<- data.frame(Bootstrap_prop) #store values in a dataframe Lakes_Bootstrap_Prop
The distribution of proportions observed in the 10,000 different bootstrap samples is shown below. This distribution is called the bootstrap distribution.
<- ggplot(data=Lakes_Bootstrap_Prop, aes(x=Bootstrap_prop)) +
Lakes_Bootstrap_Prop_plot geom_histogram(color="white", fill="lightblue") +
xlab("Prop > 1 in Bootstrap Sample ") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Prop. of Lakes Exeeding 1 ppm Hg") +
theme(legend.position = "none")
Lakes_Bootstrap_Prop_plot
4.2.5 Bootstrap SE Confidence Interval
We calculate the standard deviation of this bootstrap distribution, which is an estimate of the standard error of \(\hat{p}\).
<- sd(Lakes_Bootstrap_Prop$Bootstrap_prop) SE_p_hat
Since the bootstrap distribution is roughly symmetric and bell-shaped, we can calculate a 95% confidence interval for the proportion of all Florida lakes with mercury levels exceeding 1 ppm, using bootstrap standard error confidence interval method
\[ \hat{p} \pm 2\times\text{SE}(\hat{p}) \]
c(p_hat - 2*SE_p_hat, p_hat + 2*SE_p_hat)
## [1] 0.02684504 0.19957005
We are 95% confident that the proportion of all Florida lakes with mercury levels exceeding 1 ppm is between 0.026845 and 0.1995701.
The gold bar at the bottom of the bootstrap distribution represents this 95% confidence interval.
+
Lakes_Bootstrap_Prop_plot geom_segment(aes(x=p_hat - 2*SE_p_hat,xend=p_hat + 2*SE_p_hat, y=50, yend=50),
color="gold", size=10, alpha=0.01)
4.2.6 Bootstrap Distribution vs Sampling Distribution
We stated that the standard error of the bootstrap distribution is meant to give us and estimate of the standard error of the statistic of interest (\(\hat{p}\) in this case). We do note, however, that the bootstrap distribution is not the same as the sampling distribution for a statistic illustrated in the previous section. The sampling distribution shows the distribution of values the statistic would take on accross many different samples drawn from the population. As such, it will be centered at the true population parameter (\(p\) in this case). The bootstrap distribution, on the other hand, shows us the distribution of values the statistic takes on across different bootstrap samples drawn from the original sample, using replacement. Since it comes entirely from the original sample, rather than the population, it will be centered at the sample statistic ( in this case \(\hat{p}\)), rather than the unknown \(p\). Nevertheless, the amount of variability in a statistic, calculated between different bootstrap samples, is often a reasonable approximation of the amount of variability we would observe in that statistic if we could take many samples of the same size from the entire population.
4.3 Bootstrap Confidence Interval Example
4.3.1 Bootstrapping Other Statistics
We’ve seen how to use bootstrapping to calculate confidence intervals for an unknown population parameter \(p\), using an estimate \(\hat{p}\), calculated from a sample of size \(n\). This procedure can be applied to calculate confidence intervals for a wide range of population parameters, using statistics calculated from a sample.
For example, we could calculate confidence intervals any of the following parameters, using the corresponding sample statistic.
Context | Parameter | Statistic |
---|---|---|
Proportion | \(p\) | \(\hat{p}\) |
Mean | \(\mu\) | \(\bar{x}\) |
Standard Deviation | \(\sigma\) | \(s\) |
Median | no comon abbreviations | |
Difference in Means | \(\mu_2-\mu_1\) | \(\bar{x}_2 - \bar{x}_1\) |
Regression Coefficient | \(\beta_j\) | \(b_j\) |
Estimated Regression Response | \(\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}\) | \(b_0 + b_1x_{i1} + \ldots + b_px_{ip}\) |
We follow the same algorithm as we did when working with a proportion, and simply calculate whatever statistic we are interested in step 2, in place of \(\hat{p}\), as we did previously.
The bootstrap algorithm is given again, below.
Bootstrap Algorithm
For an original sample of size \(n\):
Take a sample of size \(n\) by randomly sampling from the original sample with replacement. (Thus some observations will show up multiple times and others not at all.)
Calculate the statistic of interest in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the statistic of interest that is calculated in each bootstrap sample.
Look at the distribution of the statistic across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the statistic of interest.
We’ll now go through examples, calculating bootstrap confidence intervals for each of the parameters listed above.
4.3.2 CI for Mean
The histogram shows the distribution of mercury levels of the 53 lakes in our sample. The mean and standard deviation in mercury levels for these 53 lakes is shown.
<- ggplot(data=FloridaLakes, aes(x=Mercury)) +
Lakes_Hist geom_histogram(color="white", fill="lightblue", binwidth = 0.2) +
ggtitle("Mercury Levels in Sample of Florida Lakes") +
xlab("Mercury Level") + ylab("Frequency")
Lakes_Hist
We’ll calculate the mean and median mercury level for the 53 lakes in the sample.
<- FloridaLakes %>% summarize(MeanHg = mean(Mercury),
Lakes_Stats StDevHG = sd(Mercury),
N=n())
kable(Lakes_Stats)
MeanHg | StDevHG | N |
---|---|---|
0.5271698 | 0.3410356 | 53 |
We want to calculate a 95% confidence interval for the mean mercury level among all Florida lakes. We’ll use bootstrapping again, this time using the sample mean, rather than the proportion exceeding 1 ppm, as our statistic of interest.
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample of 53 lakes, with replacement.
Calculate the mean mercury level in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the mean mercury level in each bootstrap sample.
Look at the distribution of the mean across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the mean mercury level.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample1 %>% select(ID, Lake, Mercury) BootstrapSample1
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 1 Alligator 1.23
## 2 1 Alligator 1.23
## 3 3 Apopka 0.04
## 4 4 Blue Cypress 0.44
## 5 4 Blue Cypress 0.44
## 6 6 Bryant 0.27
## 7 6 Bryant 0.27
## 8 6 Bryant 0.27
## 9 7 Cherry 0.48
## 10 8 Crescent 0.19
## # ℹ 43 more rows
We calculate the mean mercury level among the lakes in the bootstrap sample.
mean(BootstrapSample1$Mercury)
## [1] 0.4986792
Bootstrap Sample #2
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample2 %>% select(ID, Lake, Mercury) BootstrapSample2
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 1 Alligator 1.23
## 2 3 Apopka 0.04
## 3 4 Blue Cypress 0.44
## 4 6 Bryant 0.27
## 5 6 Bryant 0.27
## 6 7 Cherry 0.48
## 7 7 Cherry 0.48
## 8 8 Crescent 0.19
## 9 8 Crescent 0.19
## 10 8 Crescent 0.19
## # ℹ 43 more rows
Mean Mercury Level:
mean(BootstrapSample2$Mercury)
## [1] 0.5192453
Bootstrap Sample #3
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample3 %>% select(ID, Lake, Mercury) BootstrapSample3
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 1 Alligator 1.23
## 2 2 Annie 1.33
## 3 4 Blue Cypress 0.44
## 4 6 Bryant 0.27
## 5 6 Bryant 0.27
## 6 8 Crescent 0.19
## 7 12 Down 0.5
## 8 12 Down 0.5
## 9 12 Down 0.5
## 10 13 Eaton 0.49
## # ℹ 43 more rows
Mean Mercury Level:
mean(BootstrapSample3$Mercury)
## [1] 0.445283
Now, we’ll take 10,000 bootstrap samples, and record the mean mercury concentration in each sample.
<- mean(FloridaLakes$Mercury) #calculate sample statistic
mean <- rep(NA, 10000) # setup vector to hold bootstrap statistics
Bootstrap_Mean
for (i in 1:10000){
<- sample_n(FloridaLakes, 53, replace=TRUE) # take bootstrap sample
BootstrapSample <- mean(BootstrapSample$Mercury) # calculate mean in bootstrap sample
Bootstrap_Mean[i]
}<- data.frame(Bootstrap_Mean) #store results in data frame Lakes_Bootstrap_Results_Mean
The bootstrap distribution for the mean mercury level is shown below, along with its standard error.
<- ggplot(data=Lakes_Bootstrap_Results_Mean,
Lakes_Bootstrap_Mean_Plot aes(x=Bootstrap_Mean)) +
geom_histogram(color="white", fill="lightblue") +
xlab("Mean Mercury in Bootstrap Sample ") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Sample Mean in Florida Lakes") +
theme(legend.position = "none")
Lakes_Bootstrap_Mean_Plot
We’ll calculate the standard error of the mean. This is a measure of how much the mean varies between samples.
<- sd(Lakes_Bootstrap_Results_Mean$Bootstrap_Mean)
SE_mean SE_mean
## [1] 0.04655744
Notice that the standard error of the mean is much less than the sample standard deviation of 0.341.
Interpretations of sample standard deviation and standard error of the mean
- The sample standard deviation measures the amount of variability in mercury levels between the 53 individual lakes in our sample.
- The standard error of the mean measures the amount of variability in sample mean mercury levels between different samples of size 53.
There is more variability between mercury levels in individual lakes than there is between average mercury levels in different samples of size 53.
Since the bootstrap distribution is roughly symmetric and bell-shaped, we can use the bootstrap standard error method to calculate an approximate 95% confidence interval for the mean mercury level among all Florida lakes.
\[ \text{Statistic} \pm 2\times\text{Standard Error} \]
In this case, the statistic of interest is the sample mean \(\bar{x}=0.527\). The confidence interval is
\[ \begin{aligned} & \bar{x} \pm 2\times\text{SE}(\bar{x}) \\ & = 0.527 \pm 2\times\text{0.0465574} \end{aligned} \]
95% Confidence Interval:
c(mean - 2*SE_mean, mean + 2*SE_mean)
## [1] 0.4340549 0.6202847
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Lakes_Bootstrap_Mean_Plot geom_segment(aes(x=mean - 2*SE_mean,xend=mean + 2*SE_mean, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that the average mercury level among all Florida lakes is between 0.4340549 and 0.6202847 parts per million.
It is important to note that we are not saying that we are 95% confident that an individual lake lie in this range, or that 95% of all individual lakes lie in this range. We are only saying that we are confident that the average mercury level among all lakes lies in this range. A confidence interval is a statement about a population parameter (in this case the average mercury level), rather than about individual lakes in the population. Since there is more variability about individual lakes than overall averages, we’ll need to make a wider interval when talking about the mercury level for an individual lake.
4.3.3 CI for Standard Deviation
Now, we’ll calculate a confidence interval for the standard deviation in mercury levels among all Florida lakes. Recall that the sample standard deviation (\(s\)) was:
<- sd(FloridaLakes$Mercury)
Sample_SD Sample_SD
## [1] 0.3410356
We’ll use this estimate to calculate a confidence interval for the population standard deviation \(\sigma\).
This time, our statistic of interest is the sample standard deviation \(s\).
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample of 53 lakes, with replacement.
Calculate the standard deviation in mercury level in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the standard deviation mercury level in each bootstrap sample.
Look at the distribution of the standard deviations across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the standard deviation in mercury level.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample1 %>% select(ID, Lake, Mercury) BootstrapSample1
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 2 Annie 1.33
## 2 3 Apopka 0.04
## 3 3 Apopka 0.04
## 4 5 Brick 1.2
## 5 6 Bryant 0.27
## 6 7 Cherry 0.48
## 7 7 Cherry 0.48
## 8 8 Crescent 0.19
## 9 8 Crescent 0.19
## 10 9 Deer Point 0.83
## # ℹ 43 more rows
We calculate the standard deviation in mercury levels among the lakes in the bootstrap sample.
sd(BootstrapSample1$Mercury)
## [1] 0.3169752
Bootstrap Sample #2
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample2 %>% select(ID, Lake, Mercury) BootstrapSample2
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 2 Annie 1.33
## 2 2 Annie 1.33
## 3 3 Apopka 0.04
## 4 5 Brick 1.2
## 5 5 Brick 1.2
## 6 8 Crescent 0.19
## 7 8 Crescent 0.19
## 8 9 Deer Point 0.83
## 9 9 Deer Point 0.83
## 10 12 Down 0.5
## # ℹ 43 more rows
Standard Deviation in Mercury Level:
sd(BootstrapSample2$Mercury)
## [1] 0.3927499
Bootstrap Sample #3
<- sample_n(FloridaLakes, 53, replace=TRUE) %>% arrange(Lake)
BootstrapSample3 %>% select(ID, Lake, Mercury) BootstrapSample3
## # A tibble: 53 × 3
## ID Lake Mercury
## <int> <chr> <dbl>
## 1 2 Annie 1.33
## 2 2 Annie 1.33
## 3 5 Brick 1.2
## 4 5 Brick 1.2
## 5 6 Bryant 0.27
## 6 6 Bryant 0.27
## 7 9 Deer Point 0.83
## 8 9 Deer Point 0.83
## 9 9 Deer Point 0.83
## 10 9 Deer Point 0.83
## # ℹ 43 more rows
Standard Deviation Mercury Level:
sd(BootstrapSample3$Mercury)
## [1] 0.3589536
Now, we’ll take 10,000 bootstrap samples, and record the standard deviation in mercury concentration in each sample.
<- sd(FloridaLakes$Mercury) #calculate sample statistic
Sample_SD <- rep(NA, 10000) # setup vector to hold bootstrap statistics
Bootstrap_SD
for (i in 1:10000){
<- sample_n(FloridaLakes, 53, replace=TRUE) # take bootstrap sample
BootstrapSample <- sd(BootstrapSample$Mercury) # calculate standard deviation in bootstrap sample
Bootstrap_SD[i]
}<- data.frame(Bootstrap_SD) #store results in data frame Lakes_Bootstrap_Results_SD
The bootstrap distribution for the mean mercury level is shown below, along with its standard error.
<- ggplot(data=Lakes_Bootstrap_Results_SD,
Lakes_Bootstrap_SD_Plot aes(x=Bootstrap_SD)) +
geom_histogram(color="white", fill="lightblue") +
xlab("SD in Mercury in Bootstrap Sample ") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Sample SD in Florida Lakes") +
theme(legend.position = "none")
Lakes_Bootstrap_SD_Plot
We’ll calculate the standard error of the standard deviation. This is a measure of how much the standard deviation varies between samples.
<- sd(Lakes_Bootstrap_Results_SD$Bootstrap_SD)
SE_SD SE_SD
## [1] 0.02869572
Since the bootstrap distribution is roughly symmetric and bell-shaped, we can use the bootstrap standard error method to calculate an approximate 95% confidence interval for the standard deviation in mercury levels among all Florida lakes.
\[ \text{Statistic} \pm 2\times\text{Standard Error} \]
In this case, the statistic of interest is the sample standard deviation \(s=0.341\). The confidence interval is
\[ \begin{aligned} & s \pm 2\times\text{SE}(s) \\ & = 0.341 \pm 2\times{0.029} \end{aligned} \] 95% Confidence Interval:
c(Sample_SD - 2*SE_SD, Sample_SD + 2*SE_SD )
## [1] 0.2836442 0.3984271
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Lakes_Bootstrap_SD_Plot geom_segment(aes(x=Sample_SD - 2*SE_SD,xend=Sample_SD + 2*SE_SD, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that the standard deviation in mercury levels among all Florida lakes is between 0.2836442 and 0.3984271 parts per million.
4.3.4 CI for Median
We already calculated a confidence interval for the mean mercury level among all Florida lakes. We could calculate a bootstrap confidence interval for the median mercury level as well, but since the distribution of mercury levels in the lakes is roughly symmetric, the mean is a reasonable measure of center, and there is not a clear reason for using the median instead.
When a distribution is skewed or contains large outliers, however, the median is a more robust measure of center than the mean. Recall the distribution of 100 Seattle house prices seen in Chapters 1 and 2.
ggplot(data=Houses, aes(x=price)) +
geom_histogram(fill="lightblue", color="white") +
ggtitle("Distribution of House Prices") +
xlab("Price") +
ylab("Frequency")
These 100 houses are a sample of all houses sold in Seattle in 2014 and 2015, so we can use statistics from our sample to draw conclusions about all houses sold in Seattle in this time period.
In this subsection, we’ll use bootstrapping to calculate a 95% confidence interval for the median price among all houses sold in Seattle in this time period.
We calculate the sample median price.
<- median(Houses$price)
Sample_Median Sample_Median
## [1] 507.5
Bootstrap Steps
Take a sample of 100 houses by randomly sampling from the original sample of 100 houses, with replacement.
Calculate the median price in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the median price in each bootstrap sample.
Look at the distribution of the median price across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the median price.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
<- sample_n(Houses, 100, replace=TRUE) %>% arrange(Id)
BootstrapSample1 %>% select(Id, price) BootstrapSample1
## # A tibble: 100 × 2
## Id price
## <int> <dbl>
## 1 1 1225
## 2 1 1225
## 3 1 1225
## 4 4 253.
## 5 4 253.
## 6 6 310.
## 7 7 550.
## 8 7 550.
## 9 9 315.
## 10 11 410
## # ℹ 90 more rows
We calculate the median price among the houses in the bootstrap sample.
median(BootstrapSample1$price)
## [1] 520
Bootstrap Sample #2
<- sample_n(Houses, 100, replace=TRUE) %>% arrange(Id)
BootstrapSample2 %>% select(Id, price) BootstrapSample2
## # A tibble: 100 × 2
## Id price
## <int> <dbl>
## 1 2 885.
## 2 2 885.
## 3 4 253.
## 4 4 253.
## 5 5 468.
## 6 6 310.
## 7 7 550.
## 8 7 550.
## 9 7 550.
## 10 8 485.
## # ℹ 90 more rows
Median Price:
median(BootstrapSample2$price)
## [1] 510
Bootstrap Sample #3
<- sample_n(Houses, 100, replace=TRUE) %>% arrange(Id)
BootstrapSample3 %>% select(Id, price) BootstrapSample3
## # A tibble: 100 × 2
## Id price
## <int> <dbl>
## 1 1 1225
## 2 1 1225
## 3 1 1225
## 4 2 885.
## 5 4 253.
## 6 5 468.
## 7 6 310.
## 8 8 485.
## 9 8 485.
## 10 8 485.
## # ℹ 90 more rows
Median Price:
median(BootstrapSample3$price)
## [1] 538
Now, we’ll take 10,000 bootstrap samples, and record the median price in each sample.
<- median(Houses$price) #calculate sample median
Sample_Med <- rep(NA, 10000) # setup vector to hold bootstrap statistics
Bootstrap_Med
for (i in 1:10000){
<- sample_n(Houses, 100, replace=TRUE) # take bootstrap sample
BootstrapSample <- median(BootstrapSample$price) # calculate standard deviation in bootstrap sample
Bootstrap_Med[i]
}<- data.frame(Bootstrap_Med) #store results in data frame Houses_Bootstrap_Results_Med
The bootstrap distribution for the median price is shown below, along with its standard error.
<- ggplot(data=Houses_Bootstrap_Results_Med,
Houses_Bootstrap_Med_Plot aes(x=Bootstrap_Med)) +
geom_histogram(color="white", fill="lightblue") +
xlab("Median Price in Bootstrap Sample ") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Median Price in Seattle Houses") +
theme(legend.position = "none")
Houses_Bootstrap_Med_Plot
We’ll calculate the standard error of the median. This is a measure of how much the median varies between samples.
<- sd(Houses_Bootstrap_Results_Med$Bootstrap_Med)
SE_Med SE_Med
## [1] 47.88898
The standard error measures the amount of variability in median house price between different samples of size 100.
Note that this is different than the sample standard deviation, which represents the standard deviation in prices between the 100 different houses in the sample.
Notice that the bootstrap distribution for the median is not symmetric and bell-shaped. Thus, we cannot be assured that 95% of samples will produce a statistic within two standard errors of the mean, so the standard error confidence interval method is not appropriate here. Instead, we’ll calculate a confidence interval by taking the middle 95% of the values in the bootstrap distribution. A cionfidence interval calculated this way is called a percentile bootstrap interval.
We’ll calculate the 0.025 quantile and the 0.975 quantile of this bootstrap distribution. These are the points below which lie 2.5% and 97.5% of the medians in the bootstrap distribution. Thus, the middle 95% of the medians lie between these values.
.025 <- quantile(Houses_Bootstrap_Results_Med$Bootstrap_Med, 0.025)
q.025 q
## 2.5%
## 410
.975 <- quantile(Houses_Bootstrap_Results_Med$Bootstrap_Med, 0.975)
q.975 q
## 97.5%
## 600
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Houses_Bootstrap_Med_Plot geom_segment(aes(x=q.025,xend=q.975, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that the median price among all houses that sold in Seattle between 2014 and 2015 is between 410 and 600 thousand dollars.
4.3.5 CI for Difference in Means
We previously calculated a confidence interval for the average mercury level among all lakes in Florida.
Now, we’ll calculate an interval for the difference in average mercury levels between lakes in Northern Florida, compared to Southern Florida.
The boxplot shows and table below describe the distribution of mercury levels for lakes in Northern Florida, compared to Southern Florida.
<- ggplot(data=FloridaLakes, aes(x=Location, y=Mercury, fill=Location)) +
LakesBP geom_boxplot() + geom_jitter() + ggtitle("Mercury Levels in Florida Lakes") +
xlab("Location") + ylab("Mercury Level") + theme(axis.text.x = element_text(angle = 90)) + coord_flip()
LakesBP
<- FloridaLakes %>% group_by(Location) %>% summarize(MeanHg=mean(Mercury),
LakesTable StDevHg=sd(Mercury),
N=n())
LakesTable
## # A tibble: 2 × 4
## Location MeanHg StDevHg N
## <fct> <dbl> <dbl> <int>
## 1 N 0.425 0.270 33
## 2 S 0.696 0.384 20
In our sample of 33 Northern Lakes and 20 Southern Lakes, we saw a difference of 0.27 ppm. We’ll calculate a confidence interval to estimate how big or small this difference could be among all Florida lakes.
We’ll use a statistical model to calculate the average mercury levels in Northern and Southern Florida.
\(\widehat{\text{Mercury}} = b_0 +b_1\times{\text{South}}\)
- \(b_0\) represents the mean mercury level for lakes in North Florida, and
- \(b_1\) represents the mean difference in mercury level for lakes in South Florida, compared to North Florida
The estimates for corresponding to the original sample are shown below.
<- lm(data=FloridaLakes, Mercury~Location)
M M
##
## Call:
## lm(formula = Mercury ~ Location, data = FloridaLakes)
##
## Coefficients:
## (Intercept) LocationS
## 0.4245 0.2720
Thus, we can obtain a confidence interval for the difference in average mercury levels by fitting a regression model to each of our bootstrap samples and recording the value of the sample statistic \(b_1\), which represents this difference. Alternatively, we could calculate the mean from each group separately and subtract.
When comparing groups, we make one modification in Step #1 of the bootstrap process. Rather than drawing a sample of size \(n\) at random, with replacement, we’ll draw the same number of observations from each group as were in the original sample. In this case, we had 33 northern lakes, and 20 southern lakes.
Bootstrap Steps
Take a sample of 33 northern lakes and 20 southern lakes by randomly sampling from the original sample, with replacement.
Fit a regression model with location as the explanatory variable and record the value of \(b_1\), representing the difference between the means for each group (South-North).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the difference in means in each bootstrap sample.
Look at the distribution of the differences in means across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the difference in means between mercury levels in Northern and Southern Florida.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
<- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE) ## sample 33 northern lakes
NLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE) ## sample 20 southern lakes
SLakes <- rbind(NLakes, SLakes) %>% arrange(ID) %>%
BootstrapSample1 select(ID, Lake, Location, Mercury) ## combine Northern and Southern Lakes
BootstrapSample1
## # A tibble: 53 × 4
## ID Lake Location Mercury
## <int> <chr> <fct> <dbl>
## 1 1 Alligator S 1.23
## 2 2 Annie S 1.33
## 3 3 Apopka N 0.04
## 4 6 Bryant N 0.27
## 5 7 Cherry N 0.48
## 6 8 Crescent N 0.19
## 7 8 Crescent N 0.19
## 8 11 Dorr N 0.71
## 9 12 Down S 0.5
## 10 12 Down S 0.5
## # ℹ 43 more rows
We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re interested in the second coefficient, \(b_1\), which represents the mean difference between lakes in Southern and Northern Florida
<- lm(data=BootstrapSample1, Mercury ~ Location) ## fit linear model
Mb1 Mb1
##
## Call:
## lm(formula = Mercury ~ Location, data = BootstrapSample1)
##
## Coefficients:
## (Intercept) LocationS
## 0.3864 0.2391
<- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE) ## sample 33 northern lakes
NLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE) ## sample 20 southern lakes
SLakes <- rbind(NLakes, SLakes) %>% arrange(ID) %>%
BootstrapSample2 select(ID, Lake, Location, Mercury) ## combine Northern and Southern Lakes
BootstrapSample2
## # A tibble: 53 × 4
## ID Lake Location Mercury
## <int> <chr> <fct> <dbl>
## 1 1 Alligator S 1.23
## 2 1 Alligator S 1.23
## 3 2 Annie S 1.33
## 4 3 Apopka N 0.04
## 5 3 Apopka N 0.04
## 6 4 Blue Cypress S 0.44
## 7 8 Crescent N 0.19
## 8 9 Deer Point N 0.83
## 9 10 Dias N 0.81
## 10 10 Dias N 0.81
## # ℹ 43 more rows
Bootstrap Sample 2
<- lm(data=BootstrapSample2, Mercury ~ Location) ## fit linear model
Mb2 Mb2
##
## Call:
## lm(formula = Mercury ~ Location, data = BootstrapSample2)
##
## Coefficients:
## (Intercept) LocationS
## 0.4876 0.3934
Bootstrap Sample 3
<- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE) ## sample 33 northern lakes
NLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE) ## sample 20 southern lakes
SLakes <- rbind(NLakes, SLakes) %>% arrange(ID) %>%
BootstrapSample3 select(ID, Lake, Location, Mercury) ## combine Northern and Southern Lakes
BootstrapSample3
## # A tibble: 53 × 4
## ID Lake Location Mercury
## <int> <chr> <fct> <dbl>
## 1 2 Annie S 1.33
## 2 2 Annie S 1.33
## 3 2 Annie S 1.33
## 4 3 Apopka N 0.04
## 5 5 Brick S 1.2
## 6 6 Bryant N 0.27
## 7 7 Cherry N 0.48
## 8 9 Deer Point N 0.83
## 9 10 Dias N 0.81
## 10 12 Down S 0.5
## # ℹ 43 more rows
<- lm(data=BootstrapSample3, Mercury ~ Location) ## fit linear model
Mb3 Mb3
##
## Call:
## lm(formula = Mercury ~ Location, data = BootstrapSample3)
##
## Coefficients:
## (Intercept) LocationS
## 0.4436 0.2259
We’ll now take 10,000 different bootstrap samples and look at the bootstrap distribution for \(b_1\), the difference in mean mercury levels between lakes in Southern and Northern Florida
<- lm(data=FloridaLakes, Mercury~Location) #fit model to original sample
M <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b1 <- rep(NA, 10000) #vector to store b1 values
Bootstrap_b1
for (i in 1:10000){
<- sample_n(FloridaLakes %>% filter(Location=="N"), 33, replace=TRUE) ## sample 33 northern lakes
NLakes <- sample_n(FloridaLakes %>% filter(Location=="S"), 20, replace=TRUE) ## sample 20 southern lakes
SLakes <- rbind(NLakes, SLakes) ## combine Northern and Southern Lakes
BootstrapSample <- lm(data=BootstrapSample, Mercury ~ Location) ## fit linear model
M <- M$coefficients[2] ## record b1
Bootstrap_b1[i]
}<- data.frame(Bootstrap_b1) #save results as dataframe NS_Lakes_Bootstrap_Results
The bootstrap distribution for the difference in means, \(b_1\), is shown below, along with the standard error for the difference.
<- ggplot(data=NS_Lakes_Bootstrap_Results, aes(x=Bootstrap_b1)) +
NS_Lakes_Bootstrap_Plot_b1 geom_histogram(color="white", fill="lightblue") +
xlab("Mean Difference (b1) in Bootstrap Sample") + ylab("Frequency") +
ggtitle("Northern vs Southern Lakes: Bootstrap Distribution for b1")
NS_Lakes_Bootstrap_Plot_b1
Standard Error of the Difference in Means \(b_1\)
<- sd(NS_Lakes_Bootstrap_Results$Bootstrap_b1)
SE_b1 SE_b1
## [1] 0.09620419
The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
\[ \begin{aligned} & b_1 \pm 2\times\text{SE}(b_1) \\ & = 0.271 \pm 2\times{0.095} \end{aligned} \] 95% Confidence Interval:
c(Sample_b1 - 2*SE_b1, Sample_b1 + 2*SE_b1)
## LocationS LocationS
## 0.07954616 0.46436293
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
NS_Lakes_Bootstrap_Plot_b1 geom_segment(aes(x=Sample_b1 - 2*SE_b1,xend=Sample_b1 + 2*SE_b1, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that the mean mercury level among all lakes in Southern Florida is between 0.0795462 and 0.4643629 higher than the mean mercury level among all lakes in Northern Florida.
Question: We previously performed a hypothesis test and concluded that there was evidence that mean mercury level was higher for lakes in South Florida than Northern Florida. Is this confidence interval consistent with the result of the hypothesis test? Why or why not?
4.3.6 CI for Regression Slope
Recall that we examined the relationship between mercury levels and pH in Florida lakes.
ggplot(data=FloridaLakes, aes(y=Mercury, x=pH)) +
geom_point() + stat_smooth(method="lm", se=FALSE)
The regression equation is
\[ \widehat{\text{Mercury}} = b_0 + b_1\times\text{pH} \]
Regression estimates \(b_0\) and \(b_1\) are shown below.
<- lm(data=FloridaLakes, Mercury~pH)
M M
##
## Call:
## lm(formula = Mercury ~ pH, data = FloridaLakes)
##
## Coefficients:
## (Intercept) pH
## 1.5309 -0.1523
- On average, lakes with pH level 0 are expected to have a mercury level of 1.53 ppm.
- For each one-unit increase in pH, mercury level is expected to decrease by 0.15 ppm.
These estimates are sample statistics, calculated from our sample of 53 lakes. We can think of our regression equation estimates \(b_0\) and \(b_1\) as estimates of parameters \(\beta_0\) and \(\beta_1\), which pertain to the slope and intercept of the regression line pertaining to the entire population of all lakes in Florida. We’ll use \(b_0\) and \(b_1\) to estimate \(\beta_0\) and \(\beta_1\) in the same way that we used sample proportion \(\hat{p}\) to estimate population proportion \(p\) and sample mean \(\bar{x}\) to estimate population mean \(\mu\).
The intercept, \(\beta_0\) has little meaning here, but the slope \(\beta_1\) represents the average change in mercury level for each one-unit increase in pH, among all Florida lakes. We’ll use bootstrapping to find a confidence interval for this quantity.
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample, with replacement.
Fit a regression model with pH as the explanatory variable and record the value of slope \(b_1\).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of slope of the regression line for each bootstrap sample.
Look at the distribution of the slopes across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the slope relating mercury and pH levels.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
<- sample_n(FloridaLakes , 53, replace=TRUE) %>% arrange(ID) %>%
BootstrapSample1 select(ID, Lake, pH, Mercury) # take bootstrap sample
BootstrapSample1
## # A tibble: 53 × 4
## ID Lake pH Mercury
## <int> <chr> <dbl> <dbl>
## 1 1 Alligator 6.1 1.23
## 2 1 Alligator 6.1 1.23
## 3 1 Alligator 6.1 1.23
## 4 2 Annie 5.1 1.33
## 5 2 Annie 5.1 1.33
## 6 2 Annie 5.1 1.33
## 7 3 Apopka 9.1 0.04
## 8 3 Apopka 9.1 0.04
## 9 3 Apopka 9.1 0.04
## 10 4 Blue Cypress 6.9 0.44
## # ℹ 43 more rows
We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re again interested in the second coefficient, \(b_1\), which now represents the slope of the regression line.
<- lm(data=BootstrapSample1, Mercury ~ pH) # fit linear model
Mb1 Mb1
##
## Call:
## lm(formula = Mercury ~ pH, data = BootstrapSample1)
##
## Coefficients:
## (Intercept) pH
## 1.7903 -0.1918
Bootstrap Sample 2
<- sample_n(FloridaLakes , 53, replace=TRUE) %>% arrange(ID) %>%
BootstrapSample2 select(ID, Lake, pH, Mercury)
BootstrapSample2
## # A tibble: 53 × 4
## ID Lake pH Mercury
## <int> <chr> <dbl> <dbl>
## 1 1 Alligator 6.1 1.23
## 2 4 Blue Cypress 6.9 0.44
## 3 4 Blue Cypress 6.9 0.44
## 4 4 Blue Cypress 6.9 0.44
## 5 5 Brick 4.6 1.2
## 6 5 Brick 4.6 1.2
## 7 6 Bryant 7.3 0.27
## 8 7 Cherry 5.4 0.48
## 9 8 Crescent 8.1 0.19
## 10 8 Crescent 8.1 0.19
## # ℹ 43 more rows
<- lm(data=BootstrapSample2, Mercury ~ pH) # fit linear model
Mb2 Mb2
##
## Call:
## lm(formula = Mercury ~ pH, data = BootstrapSample2)
##
## Coefficients:
## (Intercept) pH
## 1.5724 -0.1661
Bootstrap Sample 3
<- sample_n(FloridaLakes , 53, replace=TRUE) %>% arrange(ID) %>%
BootstrapSample3 select(ID, Lake, pH, Mercury)
BootstrapSample3
## # A tibble: 53 × 4
## ID Lake pH Mercury
## <int> <chr> <dbl> <dbl>
## 1 5 Brick 4.6 1.2
## 2 6 Bryant 7.3 0.27
## 3 6 Bryant 7.3 0.27
## 4 7 Cherry 5.4 0.48
## 5 8 Crescent 8.1 0.19
## 6 9 Deer Point 5.8 0.83
## 7 9 Deer Point 5.8 0.83
## 8 10 Dias 6.4 0.81
## 9 12 Down 7.2 0.5
## 10 12 Down 7.2 0.5
## # ℹ 43 more rows
<- lm(data=BootstrapSample3, Mercury ~ pH) # fit linear model
Mb3 Mb3
##
## Call:
## lm(formula = Mercury ~ pH, data = BootstrapSample3)
##
## Coefficients:
## (Intercept) pH
## 1.6215 -0.1612
We’ll now take 10,000 different bootstrap samples and look at the bootstrap distribution for \(b_1\), the slope of the regression line relating mercury level and pH.
<- lm(data=FloridaLakes, Mercury~pH) #fit model to original sample
M <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b1 <- rep(NA, 10000) #vector to store b1 values
Bootstrap_b1
for (i in 1:10000){
<- sample_n(FloridaLakes , 53, replace=TRUE) #take bootstrap sample
BootstrapSample <- lm(data=BootstrapSample, Mercury ~ pH) # fit linear model
M <- M$coefficients[2] # record b1
Bootstrap_b1[i]
}<- data.frame(Bootstrap_b1) #save results as dataframe Lakes_Bootstrap_Slope_Results
The bootstrap distribution for the slopes, \(b_1\), is shown below, along with the standard error for the difference.
<- ggplot(data=Lakes_Bootstrap_Slope_Results, aes(x=Bootstrap_b1)) +
Lakes_Bootstrap_Plot_Slope geom_histogram(color="white", fill="lightblue") +
xlab("Slope in Bootstrap Sample") + ylab("Frequency") +
ggtitle("Bootstrap Distribution for Slope")
Lakes_Bootstrap_Plot_Slope
Standard Error of the slope \(b_1\)
<- sd(Lakes_Bootstrap_Slope_Results$Bootstrap_b1)
SE_b1 SE_b1
## [1] 0.02676588
The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
\[ \begin{aligned} & b_1 \pm 2\times\text{SE}(b_1) \\ & = -0.1523 \pm 2\times{0.027} \end{aligned} \]
95% Confidence Interval:
c(Sample_b1 - 2*SE_b1, Sample_b1 + 2*SE_b1)
## pH pH
## -0.2058326 -0.0987691
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Lakes_Bootstrap_Plot_Slope geom_segment(aes(x=Sample_b1 - 2*SE_b1,xend=Sample_b1 + 2*SE_b1, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that among all Florida lakes, for each 1 unit increase in pH, mercury level decreases between 0.2058326 and 0.0987691, on average.
4.3.7 CI for Regression Response
In addition to calculating a confidence interval for the slope of the regression line relating mercury and pH levels in a lake, we can also calculate a confidence interval for the average mercury level among all lakes with a given pH.
We’ll calculate a confidence interval for the average mercury level among all lakes with a neutral pH level of 7.
The regression equation is
\[ \begin{aligned} \widehat{\text{Mercury}} & = b_0 + b_1\times\text{pH} \\ & = 1.5309 - 0.1523\times\text{pH} \end{aligned} \]
so the expected mercury level among all lakes with \(\text{pH} = 7\) is \(b_0+7b_1 = 1.5309-0.1523(7)=0.4648\) ppm.
This quantity is a statistic calculated from a sample of 53 lakes, so we would not expect the average mercury level among all lakes in the population to be exactly equal to 0.4648. Again, we’ll use this sample statistic as an estimate of the population parameter, and use bootstrapping to estimate the variability associated with this statistic, in order to make a confidence interval.
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample, with replacement.
Fit a regression model with location as the explanatory variable and record the values of \(b_0\) and \(b_1\). Use these to calculate \(b_0+7b_1\).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of \(b_0\) and \(b_1\), and calculating \(b_0+7b_1\) in each bootstrap sample.
Look at the distribution of the expected response, \(b_0 + 7b_1\), across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the expected mercury level among all lakes with pH level of 7.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
<- sample_n(FloridaLakes , 53, replace=TRUE) %>% arrange(ID) %>%
BootstrapSample1 select(ID, Lake, pH, Mercury) # take bootstrap sample
BootstrapSample1
## # A tibble: 53 × 4
## ID Lake pH Mercury
## <int> <chr> <dbl> <dbl>
## 1 2 Annie 5.1 1.33
## 2 3 Apopka 9.1 0.04
## 3 5 Brick 4.6 1.2
## 4 9 Deer Point 5.8 0.83
## 5 10 Dias 6.4 0.81
## 6 10 Dias 6.4 0.81
## 7 11 Dorr 5.4 0.71
## 8 11 Dorr 5.4 0.71
## 9 12 Down 7.2 0.5
## 10 13 Eaton 7.2 0.49
## # ℹ 43 more rows
We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re interested in the second coefficient, \(b_1\), which represents the mean difference between lakes in Southern and Northern Florida
<- lm(data=BootstrapSample1, Mercury ~ pH) ## fit linear model
Mb1 <- Mb1$coefficients[1] # record value of b0 (first coefficient)
b0 <- Mb1$coefficients[2] # record value of b1 (second coefficient)
b1 +7*b1 #calculate b0+7*b1 b0
## (Intercept)
## 0.4726683
Bootstrap Sample 2
<- sample_n(FloridaLakes , 53, replace=TRUE) %>% arrange(ID) %>%
BootstrapSample2 select(ID, Lake, pH, Mercury)
BootstrapSample2
## # A tibble: 53 × 4
## ID Lake pH Mercury
## <int> <chr> <dbl> <dbl>
## 1 4 Blue Cypress 6.9 0.44
## 2 6 Bryant 7.3 0.27
## 3 6 Bryant 7.3 0.27
## 4 6 Bryant 7.3 0.27
## 5 7 Cherry 5.4 0.48
## 6 8 Crescent 8.1 0.19
## 7 8 Crescent 8.1 0.19
## 8 10 Dias 6.4 0.81
## 9 10 Dias 6.4 0.81
## 10 11 Dorr 5.4 0.71
## # ℹ 43 more rows
<- lm(data=BootstrapSample2, Mercury ~ pH) # fit linear model
Mb2 <- Mb2$coefficients[1] # record value of b0 (first coefficient)
b0 <- Mb2$coefficients[2] # record value of b1 (second coefficient)
b1 +7*b1 #calculate b0+7*b1 b0
## (Intercept)
## 0.4574841
Bootstrap Sample 3
<- sample_n(FloridaLakes , 53, replace=TRUE) %>% arrange(ID) %>%
BootstrapSample3 select(ID, Lake, pH, Mercury)
BootstrapSample3
## # A tibble: 53 × 4
## ID Lake pH Mercury
## <int> <chr> <dbl> <dbl>
## 1 1 Alligator 6.1 1.23
## 2 2 Annie 5.1 1.33
## 3 3 Apopka 9.1 0.04
## 4 3 Apopka 9.1 0.04
## 5 4 Blue Cypress 6.9 0.44
## 6 4 Blue Cypress 6.9 0.44
## 7 4 Blue Cypress 6.9 0.44
## 8 5 Brick 4.6 1.2
## 9 6 Bryant 7.3 0.27
## 10 6 Bryant 7.3 0.27
## # ℹ 43 more rows
<- lm(data=BootstrapSample3, Mercury ~ pH) # fit linear model
Mb3 <- Mb3$coefficients[1] # record value of b0 (first coefficient)
b0 <- Mb3$coefficients[2] # record value of b1 (second coefficient)
b1 +7*b1 #calculate b0+7*b1 b0
## (Intercept)
## 0.4077507
We’ll now take 10,000 different bootstrap samples and record the values of \(b_0\), \(b_1\), which we’ll then use to calculate \(b_0+7b_1\).
<- lm(data=FloridaLakes, Mercury~pH) #fit model to original sample
M <- M$coefficients[1] # record b0 value (second coefficient)
Sample_b0 <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b1 <- Sample_b0 + 7*Sample_b1 # calculate sample expected mercury when pH=7
Sample_Exp7 <- rep(NA, 10000) #vector to store b1 values
Bootstrap_b0 <- rep(NA, 10000) #vector to store b1 values
Bootstrap_b1
for (i in 1:10000){
<- sample_n(FloridaLakes , 53, replace=TRUE) #take bootstrap sample
BootstrapSample <- lm(data=BootstrapSample, Mercury ~ pH) # fit linear model
M <- M$coefficients[1] # record b0
Bootstrap_b0[i] <- M$coefficients[2] # record b1
Bootstrap_b1[i]
}
<- Bootstrap_b0 + 7*Bootstrap_b1 # calcualte expected response for each bootstrap sample
Bootstrap_Exp7
<- data.frame(Bootstrap_b0, Bootstrap_b1, Bootstrap_Exp7) #save results as dataframe Lakes_Bootstrap_Exp7_Results
The bootstrap distribution for the expected mercury level among all lakes with pH level 7, \(b_0+7b_1\), is shown below, along with the standard error for this quantity.
<- ggplot(data=Lakes_Bootstrap_Exp7_Results, aes(x=Bootstrap_Exp7)) +
Lakes_Bootstrap_Plot_Exp7 geom_histogram(color="white", fill="lightblue") +
xlab("Expected Mercury Level in Bootstrap Sample") + ylab("Frequency") +
ggtitle( "Bootstrap Distribution for Exp. Mercury when pH=7")
Lakes_Bootstrap_Plot_Exp7
Standard Error of the expected response \(b_0 + 7b_1\)
<- sd(Lakes_Bootstrap_Exp7_Results$Bootstrap_Exp7)
SE_Exp7 SE_Exp7
## [1] 0.03676639
Again, the bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
\[ \begin{aligned} & b_1 \pm 2\times\text{SE}(b_1) \\ & = 0.4648 \pm 2\times{0.037} \end{aligned} \]
95% Confidence Interval:
c(Sample_Exp7 - 2*SE_Exp7, Sample_Exp7 + 2*SE_Exp7)
## (Intercept) (Intercept)
## 0.3912799 0.5383454
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Lakes_Bootstrap_Plot_Exp7 geom_segment(aes(x=Sample_Exp7 - 2*SE_Exp7,xend=Sample_Exp7 + 2*SE_Exp7, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that average mercury level among all Florida lakes with pH level 7 is between 0.3912799 and 0.5383454 ppm.
Again, we are not saying that we think an individual like with a pH level of 7 will lie in this range, only that the average mercury level among all such lakes lies in this range.
4.3.8 More CI’s in Regression
We saw in the previous two examples how to calculate a confidence interval for the slope of a regression line, and for an expected response in regression. In fact, we can calculate confidence intervals for any function involving regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\), in a similar manner.
For example, let’s consider the model for Seattle house prices that involved square feet, whether or not the house was on the waterfront, and an interaction term between these variables.
The model is
\[ \widehat{Price} = b_0 + b_1\times\text{Sq. Ft.} + b_2\times\text{waterfront} + b_3\times\text{Sq.Ft}\times\text{Waterfront} \]
We fit the model and obtain the parameter estimates shown below.
<- lm(data=Houses, price~sqft_living + waterfront +
M :waterfront) #fit model to original sample
sqft_living<- M$coefficients[1] # record b0 value (second coefficient)
Sample_b0 <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b1 <- M$coefficients[3] # record b1 value (second coefficient)
Sample_b2 <- M$coefficients[4] # record b1 value (second coefficient)
Sample_b3 M
##
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront,
## data = Houses)
##
## Coefficients:
## (Intercept) sqft_living
## 67.3959 0.2184
## waterfrontYes sqft_living:waterfrontYes
## -364.5950 0.4327
Consider the following quantities that we might be interested in estimating:
- The expected price of a 2,000 square foot waterfront house.
- The expected price of a 1,500 square foot non-waterfront house.
- The difference between the expected price of a house 1,800 square foot house on the waterfront, compared to a house the same size that is not on the waterfront.
- The difference in the rate of change in house prices for each additional 100 square feet for houses on the waterfront, compared to houses not on the waterfront.
Each of these quantities can be expressed as a linear function of our regression coefficients \(b_0, b_1, b_2, b_3\). We just need to find the appropriate function of the \(b_j\)’s, and then calculate a bootstrap confidence interval for that quantity, using the same steps we’ve seen in the previous examples.
Substituting into the regression equation, we see that:
- The expected price of a 2,000 square foot waterfront house is given by \[b_0 + 2000b_1 + b_2 + 2000b_3\]
We calculate this estimate from the model, based on our sample of 100 houses:
2000*Sample_b1 +Sample_b2+2000*Sample_b3 # calculate b0+2000b1+b2+2000b3
## sqft_living
## 937.4777
We estimate that the average price of all 2,000 square foot waterfront houses in Seattle is 937 thousand dollars.
- The expected price of a 1,500 square foot non-waterfront house is given by \[b_0 + 1500b_1\]
+ 1500*Sample_b1 # calculate b0+1500b1+ Sample_b0
## (Intercept)
## 394.9499
We estimate that the average price of all 1,500 square foot non-waterfront houses in Seattle is 395 thousand dollars.
- The difference between the expected price of a house 1,800 square foot house on the waterfront, compared to a house the same size that is not on the waterfront is given by:
\[ \begin{aligned} & (b_0 + 1800b_1 + b_2 + 1800b_3) - (b_0 + 1800b_1) \\ & = b_2 +1800b_3 \end{aligned} \]
+1800*Sample_b3 # calculate b2+1800b3 Sample_b2
## waterfrontYes
## 414.2058
We estimate that on average a 1,800 square foot house on the waterfront will cost 414 thousand dollars more than a 1,800 square foot house not on the waterfront.
- The difference in the rate of change in house prices for each additional 100 square feet for houses on the waterfront, compared to houses not on the waterfront.
This question is asking about the difference in slopes of the regression lines relating price and square feet for houses on the waterfront, compared to those not on the waterfront.
For houses on the waterfront, the regression equation is
\[ \widehat{Price} = (b_0 + b_2) + (b_1 +b_3)\times\text{Sq. Ft.}, \]
so the slope is \(b_1 + b_3\).
For houses not on the waterfront, the regression equation is
\[ \widehat{Price} = b_0 + b_1 \times\text{Sq. Ft.}, \]
so the slope is \(b_1\).
These slope pertain to the expected change in price for each additional 1 square foot. So, for a 100-square foot increase, the price of a waterfront house is expected to increase by \(100(b_1+b_3)\), compared to an increase of \(100b_1\) for a non-waterfront house. Thus, the difference in the rates of change is \(100b_3\).
100*Sample_b3 # calculate 100b3
## sqft_living:waterfrontYes
## 43.26671
We estimate that the price of waterfront houses increases by 43 thousand dollars more for each additional 100 square feet than the price of non-waterfront houses.
These estimates calculated from the sample are statistics, which, like all the other statistics we’ve seen are likely to vary from the true values of the corresponding population parameters, due to variability between samples. We can use bootstrapping to calculate confidence intervals for the relevant population parameters, using these sample statistics (the functions of \(b_j\)’s), just as we’ve done for the other statistics we’ve seen.
Bootstrap Steps
Take a sample of 100 houses by randomly sampling from the original sample, with replacement.
Fit a regression model with location as the explanatory variable and record the values of regression coefficients \(b_0, b_1, b_2, b_3\). Use these to calculate each of the four desired quantities (i.e. \(b_0 + 2000b_1 + b_2 +2000b_3\))
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the regression coefficients and calculating the desired quantities in each bootstrap sample.
Look at the distribution of the quantities of interest, across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for each of these quantities.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
We take the first bootstrap sample and fit a model with interaction. For brevity, we won’t list out the houses in each of the bootstrap samples, as the idea should be clear by now. Model coefficients are shown below.
<- sample_n(Houses , 100, replace=TRUE) %>% arrange(Id) %>%
BootstrapSample1 select(Id, price, sqft_living, waterfront)
<- lm(data=BootstrapSample1, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interaction
Mb1 <- Mb1$coefficients[1] # record value of b0 (first coefficient)
b0 <- Mb1$coefficients[2] # record value of b1 (second coefficient)
b1 <- Mb1$coefficients[3] # record value of b2 (third coefficient)
b2 <- Mb1$coefficients[4] # record value of b3 (fourth coefficient)
b3 Mb1
##
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront,
## data = BootstrapSample1)
##
## Coefficients:
## (Intercept) sqft_living
## 23.1250 0.2416
## waterfrontYes sqft_living:waterfrontYes
## -552.9883 0.5005
We calculate each of the four desired quantities.
+2000*b1 + b2 + 2000*b3 b0
## (Intercept)
## 954.3864
+1500*b1 b0
## (Intercept)
## 385.5317
+1800*b3 b2
## waterfrontYes
## 347.9484
100*b3
## sqft_living:waterfrontYes
## 50.05204
Bootstrap Sample 2
<- sample_n(Houses , 100, replace=TRUE) %>% arrange(Id) %>%
BootstrapSample2 select(Id, price, sqft_living, waterfront)
<- lm(data=BootstrapSample2, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interaction
Mb2 <- Mb2$coefficients[1] # record value of b0 (first coefficient)
b0 <- Mb2$coefficients[2] # record value of b1 (second coefficient)
b1 <- Mb2$coefficients[3] # record value of b2 (third coefficient)
b2 <- Mb2$coefficients[4] # record value of b3 (fourth coefficient)
b3 Mb2
##
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront,
## data = BootstrapSample2)
##
## Coefficients:
## (Intercept) sqft_living
## 97.8551 0.2133
## waterfrontYes sqft_living:waterfrontYes
## -622.7018 0.5163
We calculate each of the four desired quantities.
+2000*b1 + b2 + 2000*b3 b0
## (Intercept)
## 934.5378
+1500*b1 b0
## (Intercept)
## 417.871
+1800*b3 b2
## waterfrontYes
## 306.7251
100*b3
## sqft_living:waterfrontYes
## 51.63483
Bootstrap Sample 3
<- sample_n(Houses , 100, replace=TRUE) %>% arrange(Id) %>%
BootstrapSample3 select(Id, price, sqft_living, waterfront)
<- lm(data=BootstrapSample3, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interaction
Mb3 <- Mb3$coefficients[1] # record value of b0 (first coefficient)
b0 <- Mb3$coefficients[2] # record value of b1 (second coefficient)
b1 <- Mb3$coefficients[3] # record value of b2 (third coefficient)
b2 <- Mb3$coefficients[4] # record value of b3 (fourth coefficient)
b3 Mb3
##
## Call:
## lm(formula = price ~ sqft_living + waterfront + sqft_living:waterfront,
## data = BootstrapSample3)
##
## Coefficients:
## (Intercept) sqft_living
## 15.6004 0.2456
## waterfrontYes sqft_living:waterfrontYes
## -85.2989 0.3362
We calculate each of the four desired quantities.
+2000*b1 + b2 + 2000*b3 b0
## (Intercept)
## 1093.984
+1500*b1 b0
## (Intercept)
## 384.0153
+1800*b3 b2
## waterfrontYes
## 519.9176
100*b3
## sqft_living:waterfrontYes
## 33.62314
We’ll now take 10,000 different bootstrap samples and record the values of \(b_0\), \(b_1\), \(b_3\), and \(b_4\), which we’ll then use to calculate each of our four desired quantities.
<- lm(data=Houses, price~sqft_living + waterfront +
M :waterfront) #fit model to original sample
sqft_living<- M$coefficients[1] # record b0 value (second coefficient)
Sample_b0 <- M$coefficients[2] # record b1 value (second coefficient)
Sample_b1 <- M$coefficients[3] # record b1 value (second coefficient)
Sample_b2 <- M$coefficients[4] # record b1 value (second coefficient)
Sample_b3 <- Sample_b0 + 2000*Sample_b1 +Sample_b2+2000*Sample_b3 # calculate b0+2000b1+b2+2000b3
Sample_Q1 <- Sample_b0 + 1500*Sample_b1 # calculate b0+1500b1+
Sample_Q2 <- Sample_b2+1800*Sample_b3 # calculate b2+1800b3
Sample_Q3 <- 100*Sample_b3 # calculate 100b3
Sample_Q4
<- rep(NA, 10000) #vector to store b0 values
Bootstrap_b0 <- rep(NA, 10000) #vector to store b1 values
Bootstrap_b1 <- rep(NA, 10000) #vector to store b2 values
Bootstrap_b2 <- rep(NA, 10000) #vector to store b3 values
Bootstrap_b3
for (i in 1:10000){
<- sample_n(Houses, 1000, replace=TRUE) #take bootstrap sample
BootstrapSample <- lm(data=BootstrapSample, price ~ sqft_living +
Mb + sqft_living:waterfront) # fit linear model with interaction
waterfront <- Mb$coefficients[1] # record value of b0 (first coefficient)
Bootstrap_b0[i] <- Mb$coefficients[2] # record value of b1 (second coefficient)
Bootstrap_b1[i] <- Mb$coefficients[3] # record value of b2 (third coefficient)
Bootstrap_b2[i] <- Mb$coefficients[4] # record value of b3 (fourth coefficient)
Bootstrap_b3[i]
}
<- Bootstrap_b0 + 2000*Bootstrap_b1 + Bootstrap_b2 + 2000*Bootstrap_b3
Bootstrap_Q1 <- Bootstrap_b0 + 1500*Bootstrap_b1
Bootstrap_Q2 <- Bootstrap_b2 + 1800*Bootstrap_b3
Bootstrap_Q3 <- 100*Bootstrap_b3
Bootstrap_Q4
<- data.frame(Bootstrap_b0, Bootstrap_b1, Bootstrap_b2, Bootstrap_b3, Bootstrap_Q1, Bootstrap_Q2 , Bootstrap_Q3 , Bootstrap_Q4) #save results as dataframe Houses_Bootstrap_Results
Bootstrap Distribution for b_0 + 2000b_1 + b_2 + 2000b_3
<- ggplot(data=Houses_Bootstrap_Results,
Houses_Bootstrap_Plot_Q1 aes(x=Bootstrap_Q1)) +
geom_histogram(color="white", fill="lightblue") +
xlab("Expected Price of 2000 Sq. Ft. Waterfront House") + ylab("Frequency") +
ggtitle( "Bootstrap Distribution b0+2000b1+b2+2000b3")
Houses_Bootstrap_Plot_Q1
Standard Error:
<- sd(Houses_Bootstrap_Results$Bootstrap_Q1)
SE_Q1 SE_Q1
## [1] 40.29367
The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
95% Confidence Interval:
c(Sample_Q1 - 2*SE_Q1, Sample_Q1 + 2*SE_Q1)
## (Intercept) (Intercept)
## 924.2863 1085.4610
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Houses_Bootstrap_Plot_Q1 geom_segment(aes(x=Sample_Q1 - 2*SE_Q1,xend=Sample_Q1 + 2*SE_Q1, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that average price among all 2,000 square foot Seattle waterfront houses is between 924.2863407 and 1085.4610253 thousand dollars.
Bootstrap Distribution for b_0 + 1500b_1
<- ggplot(data=Houses_Bootstrap_Results,
Houses_Bootstrap_Plot_Q2 aes(x=Bootstrap_Q2)) +
geom_histogram(color="white", fill="lightblue") +
xlab("Expected Price of 1500 Sq. Ft. Non-Waterfront House") + ylab("Frequency") +
ggtitle( "Bootstrap Distribution b0+1500b1")
Houses_Bootstrap_Plot_Q2
Standard Error:
<- sd(Houses_Bootstrap_Results$Bootstrap_Q2)
SE_Q2 SE_Q2
## [1] 5.859015
The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
95% Confidence Interval:
c(Sample_Q2 - 2*SE_Q2, Sample_Q2 + 2*SE_Q2)
## (Intercept) (Intercept)
## 383.2318 406.6679
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Houses_Bootstrap_Plot_Q2 geom_segment(aes(x=Sample_Q2 - 2*SE_Q2,xend=Sample_Q2 + 2*SE_Q2, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that average price among all 1,500 square foot Seattle non-waterfront houses is between 383.2318261 and 406.6678858 thousand dollars.
Bootstrap Distribution for b_2 + 1800b_3
<- ggplot(data=Houses_Bootstrap_Results,
Houses_Bootstrap_Plot_Q3 aes(x=Bootstrap_Q3)) +
geom_histogram(color="white", fill="lightblue") +
xlab("Expected Price Difference WF vs NWF for 1800 sq. Ft. House") + ylab("Frequency") +
ggtitle( "Bootstrap Distribution b2+1800b3")
Houses_Bootstrap_Plot_Q3
Standard Error:
<- sd(Houses_Bootstrap_Results$Bootstrap_Q3)
SE_Q3 SE_Q3
## [1] 40.66194
The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
95% Confidence Interval:
c(Sample_Q3 - 2*SE_Q3, Sample_Q3 + 2*SE_Q3)
## waterfrontYes waterfrontYes
## 332.8819 495.5297
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Houses_Bootstrap_Plot_Q3 geom_segment(aes(x=Sample_Q3 - 2*SE_Q3,xend=Sample_Q3 + 2*SE_Q3, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that the average price among all 1800 square feet waterfront houses in Seattle is between 332.8818906 and 495.5296513 thousand dollars more than the average price among all non-waterfront houses of the same size.
Bootstrap Distribution for 100b_3
<- ggplot(data=Houses_Bootstrap_Results,
Houses_Bootstrap_Plot_Q4 aes(x=Bootstrap_Q4)) +
geom_histogram(color="white", fill="lightblue") +
xlab("Expected Difference per 100 square feet") + ylab("Frequency") +
ggtitle( "Bootstrap Distribution 100b3")
Houses_Bootstrap_Plot_Q4
Standard Error:
<- sd(Houses_Bootstrap_Results$Bootstrap_Q4)
SE_Q4 SE_Q4
## [1] 2.224772
The bootstrap distribution is symmetric and bell-shaped, so we can use the standard error method to calculate a 95% confidence interval.
95% Confidence Interval:
c(Sample_Q4 - 2*SE_Q4, Sample_Q4 + 2*SE_Q4)
## sqft_living:waterfrontYes sqft_living:waterfrontYes
## 38.81716 47.71625
The 95% confidence interval is shown by the gold bar on the graph of the bootstrap distribution below.
+
Houses_Bootstrap_Plot_Q4 geom_segment(aes(x=Sample_Q4 - 2*SE_Q4,xend=Sample_Q4 + 2*SE_Q4, y=50, yend=50),
color="gold", size=10, alpha=0.01)
We are 95% confident that for each 100 square foot increase, the average price among all waterfront houses increases by between 38.8171639 and 47.7162531 thousand dollars more than the increase in average price among all non-waterfront.
4.3.9 Bootstrapping Cautions
While bootstrapping is a popular and robust procedure for calculating confidence intervals, it does have cautions and limitations. We should be sure to use the bootstrap procedure appropriate for our context. A standard-error bootstrap interval is appropriate when the sampling distribution for our statistic is roughly symmetric and bell-shaped. When this is not true, a percentile bootstrap interval can be used as long as there are no gaps or breaks in the bootstrap distribution. In situations where there are gaps and breaks in the bootstrap distribution, then the bootstrap distribution may not be a reasonable approximation of the sampling distribution we are interested in.
4.4 Estimating Standard Error
4.4.1 Standard Error vs Standard Deviation
Recall that standard error is the standard deviation of the distribution of a statistic (sample mean, proportion, regression coefficient, etc.). It describes the amount of variability in this statistic between samples of a given size.
This is different than the sample standard deviation, which pertains to the amount of variability between individuals in the sample.
For example, the histogram displays the distribution of mercury levels in our sample of 53 lakes in Florida.
Lakes_Hist
Standard Deviation of Mercury Levels Between Lakes:
sd(FloridaLakes$Mercury)
## [1] 0.3410356
The standard deviation in mercury levels between individual lakes is 0.341 ppm. This describes the amount of variability in mercury levels between individual lakes.
Bootstrap Distribution for Mean Mercury Level (\(n=53\))
+ xlim(c(0,1.5)) Lakes_Bootstrap_Mean_Plot
Standard Error for Mean:
<- sd(Lakes_Bootstrap_Results_Mean$Bootstrap_Mean); SE SE
## [1] 0.04655744
The standard deviation in the distribution for mean mercury levels between different samples of 53 lakes is approximately 0.0465574 ppm. This describes the amount of variability in mean mercury levels between different samples of 53 lakes.
There is less variability in average among samples of size 53, than there is between mercury levels of individual lakes. Although mercury levels might vary quite a bit between individual lakes, in a sample of size 53, the higher and lower levels tend to average out to something in the middle, resulting in less variability associated with the average than with individual lakes.
4.4.2 Sample Size and Standard Error
Question:
Suppose the sample consisted of 10 lakes, or 30 lakes, or 100 lakes, instead of 53, and that the distribution of the lakes in the sample was otherwise similar to that of the original 53. Would you expect the mercury level of individual lakes to increase, decrease, or stay about the same? What about the standard error of the mean mercury level?
The table shows the standard deviation in each of the samples.
Sample_Size | SD |
---|---|
10 | 0.3831797 |
30 | 0.3198103 |
53 | 0.3410356 |
100 | 0.3352758 |
Sample size does not impact the amount of variability between individual lakes. Standard deviation in mercury levels between individual lakes does not systematically increase or decrease based on sample size (of course it varies a little based on the lakes randomly chosen in the sample).
Now, we’ll examine what happens to the standard error of the mean as the sample size changes.
Distributions of Mean Between Different Samples
The table shows the standard error of the mean for samples of different size:
Sample_Size | SE |
---|---|
10 | 0.1166601 |
30 | 0.0579587 |
53 | 0.0465574 |
100 | 0.0331806 |
As sample size increases, variability between means of different samples decreases. Standard error of the mean decreases. This is also true of standard errors for other statistics (i.e. difference in means, regression slopes, etc.)
4.4.3 Standard Error Formulas
So far, we’ve used simulation (permutation tests and bootstrap intervals) to determine the amount of variability associated with a test statistic or estimate, in order to perform hypotheses tests and create confidence intervals. In special situations, there are mathematical formulas, based on probability theory, that can be used to approximate the amount of variability without having to perform the simulations.
These approximations only exist for certain kinds of statistics, and they are only valid when the sampling distribution for the statistic is symmetric and bell-shaped. Thus, they cannot be used in all situations, and we should be careful to check whether they are appropriate before applying them. Nevertheless, when appropriate, they can allow us to bypass the computation required in a simulation.
Theory-Based Standard Error Formulas
Scenario | Standard Error |
---|---|
Single Mean | \(SE(b_0)=\frac{s}{\sqrt{n}}\) |
Difference in Means | \(SE(b_j)=s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\) |
Single Proportion | \(SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) |
Difference in Proportions | \(SE(\hat{p}) = \sqrt{\left(\frac{\hat{p_1}(1-\hat{p}_1)}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2}\right)}\) |
Intercept in Simple Linear Regression | \(SE(b_0)=s\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}}\) |
Slope in Simple Linear Regression | \(SE(b_1)=\sqrt{\frac{s^2}{\sum(x_i-\bar{x})^2}}=\sqrt{\frac{1}{n-2}\frac{{\sum(\hat{y}_i-y_i)^2}}{\sum(x_i-\bar{x})^2}}\) |
\(s=\sqrt{\frac{\displaystyle\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-(p+1))}}\), (p is number of regression coefficients not including \(b_0\)) is sample standard deviation. Note that in the one-sample case, this simplifies to the standard deviation formula we’ve seen previously.
In the 2nd formula, the standard error estimate \(s\sqrt{\frac{1}{n_1+n_2}}\) is called a “pooled” estimate since it combines information from all groups. When there is reason to believe standard deviation differs between groups, we often use an “unpooled” standard error estimate of \(\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\), where \(s_1, s_2\) represent the standard deviation for groups 1 and 2.
There is no theory-based formula for standard error associated with the median or standard deviation. For these, and many other statistics, we rely on simulation to estimate variability between samples.
There are formulas for standard errors associated with coefficients in multiple regression, but these require mathematics beyond what is assumed in this class. They involve linear algebra and matrix inversion, which you can read about here if you are interested.
4.4.4 One-Sample Mean Example
We use the first formula to calculate the standard error of the mean mercury concentration associated with samples of 53 lakes.
\(SE(\bar{x})=\frac{s}{\sqrt{n}} = \frac{0.341}{\sqrt{53}}=0.04684\)
The calculation in R is shown below.
sd(FloridaLakes$Mercury)/sqrt(53)
## [1] 0.04684485
The standard error column in the R summary output is calculated using the theory-based formulas.
<- lm(data=FloridaLakes, Mercury~1)
M summary(M)
##
## Call:
## lm(formula = Mercury ~ 1, data = FloridaLakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48717 -0.25717 -0.04717 0.24283 0.80283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.52717 0.04684 11.25 0.00000000000000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.341 on 52 degrees of freedom
Let’s compare this to the standard error we obtained using 10,000 bootstrap simulations.
Lakes_Bootstrap_Mean_Plot
Bootstrap standard error:
<- sd(Lakes_Bootstrap_Results_Mean$Bootstrap_Mean); SE SE
## [1] 0.04655744
The theory-based formula gives a standard error estimate very close to the one we obtained via bootstrap simulation.
4.4.5 Difference in Means Example
We use the probability-based formula to calculate the standard error for difference in means between 33 lakes in North Florida, and 20 lakes in South Florida
\[ SE(\bar{x}_1-\bar{x}_2)=s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}, \]
<- sqrt(sum(Lakes_M$residuals^2)/(53-2))
s <- s*sqrt(1/20+1/33); SE SE
## [1] 0.08984774
R model summary output:
<- lm(data=FloridaLakes, Mercury~Location)
Lakes_M summary(Lakes_M)
##
## Call:
## lm(formula = Mercury ~ Location, data = FloridaLakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65650 -0.23455 -0.08455 0.24350 0.67545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42455 0.05519 7.692 0.000000000441 ***
## LocationS 0.27195 0.08985 3.027 0.00387 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3171 on 51 degrees of freedom
## Multiple R-squared: 0.1523, Adjusted R-squared: 0.1357
## F-statistic: 9.162 on 1 and 51 DF, p-value: 0.003868
Comparison to Bootstrap standard error:
<- ggplot(data=NS_Lakes_Bootstrap_Results, aes(x=Bootstrap_b1)) +
NS_Lakes_Bootstrap_Plot_b1 geom_histogram(color="white", fill="lightblue") +
xlab("b1 in Bootstrap Sample") + ylab("Frequency") +
ggtitle("Northern vs Southern Lakes: Bootstrap Distribution for b1")
NS_Lakes_Bootstrap_Plot_b1
sd(NS_Lakes_Bootstrap_Results$Bootstrap_b1)
## [1] 0.09620419
The mathematical standard error approximation is close to the one observed in our simulation, though not exact.
4.4.6 Regression Example
We use the theory-based approach to calculate standard error associated with the slope of the regression line relating mercury level and pH in Florida lakes.
\(SE(b_1)=\sqrt{\frac{s^2}{\sum(x_i-\bar{x})^2}}=\sqrt{\frac{1}{n-2}\frac{{\sum(\hat{y}_i-y_i)^2}}{\sum(x_i-\bar{x})^2}}\)
<- lm(data=FloridaLakes, Mercury~pH) # fit model
M <- sum(M$residuals^2)/(53-2) #calculate s^2
s2 <- FloridaLakes$pH
x <- sum((x-mean(x))^2) #calculate denominator
SSx <- sqrt(s2/SSx) # calculate SE
SE SE
## [1] 0.03031326
R model summary output:
<- lm(data=FloridaLakes, Mercury~pH)
Lakes_M_pH summary(Lakes_M_pH)
##
## Call:
## lm(formula = Mercury ~ pH, data = FloridaLakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48895 -0.19188 -0.05774 0.09456 0.71134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.53092 0.20349 7.523 0.000000000814 ***
## pH -0.15230 0.03031 -5.024 0.000006572811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2816 on 51 degrees of freedom
## Multiple R-squared: 0.3311, Adjusted R-squared: 0.318
## F-statistic: 25.24 on 1 and 51 DF, p-value: 0.000006573
Comparison to bootstrap standard error:
Lakes_Bootstrap_Plot_Slope
<- sd(Lakes_Bootstrap_Slope_Results$Bootstrap_b1)
SE_b1 SE_b1
## [1] 0.02676588
The mathematical standard error approximation is again close to the one observed in our simulation.
4.4.7 Theory-Based Confidence Intervals
If the sampling distribution for a statistic is symmetric and bell-shaped, we can obtain an approximate 95% confidence interval using the formula:
\[ \text{Statistic} \pm 2\times{\text{Standard Error}}, \]
where the standard error is calculated by formula, rather than via bootstrap simulations.
The confint
command in R provides confidence intervals associate with each of the model parameters, which are calculated using the theory-based standard errors
Confidence Intervals for N vs S Florida Lakes
confint(Lakes_M)
## 2.5 % 97.5 %
## (Intercept) 0.31374083 0.5353501
## LocationS 0.09157768 0.4523314
Intercept Interval:
\(\approx 0.42455 \pm 2\times0.05519\)
We are 95% confident that the average mercury level among all lakes in North Florida is between 0.313 and 0.535 ppm.
LocationS Interpretation
\(\approx -0.27195 \pm 2\times 0.08985\)
We are 95% confident that the average mercury level among all lakes in South Florida is between 0.09 and 0.45 ppm higher than the average mercury concentration among all lakes in North Florida.
Note that the use of 2 in the confidence interval formula is an approximation, though one that is reasonable and widely used. Rather than using 2, R uses a value very close to 2, which varies depending on the sample size. Thus R’s calculations will not exactly match the formulas shown, but are similar enough for practical purposes. Remember that in statistics, everything is an approximation.
Confidence Intervals for pH and Mercury in Florida Lakes
confint(Lakes_M_pH)
## 2.5 % 97.5 %
## (Intercept) 1.1223897 1.93944769
## pH -0.2131573 -0.09144445
Intercept Interval:
\(\approx 1.53092 \pm 2\times 0.20349\)
We are 95% confident that the average mercury concentration among all lakes with pH level 0 is between 1.12 and 1.94 ppm. This is not a meaningful interpretation, since none of the lakes in our sample have a pH level anywhere close to 0.
pH Interval
\(\approx -0.15230 \pm 2\times 0.03031\)
We are 95% confident that for each one-unit increase in pH, mercury level is expected to decrease between 0.09 and 0.21 ppm, on average.
4.4.8 CI Method Comparison
We’ve now seen 3 different ways to obtain confidence intervals based on statistics calculated from data.
The table below tells us what must be true of the sampling distribution for a statistic in order to use each technique.
Technique | No Gaps | Bell-Shaped | Known Formula for SE |
---|---|---|---|
Bootstrap Percentile | x | ||
Bootstrap Standard Error | x | x | |
Theory-Based | x | x | x |