Define population, sample, and sampling distribution, and a parameter and statistic.
Interpret standard deviation and standard error in context.
Explain how to use bootstrapping to calculate confidence intervals.
Interpret confidence intervals in context.
Calculate confidence intervals in R.
Write null and alternative hypotheses, identify test statistics, and state conclusions in context.
Interpret p-values in context.
Explain how to use simulation to perform hypothesis tests.
Compare and contrast the conclusions we can draw from confidence intervals and hypothesis tests.
Perform hypothesis tests in R.
3.1 Sampling Distributions
3.1.1 Population and Sample
In statistics, we often do not have the time, money, or means to collect data on all individuals or units on which we want to draw conclusions. Instead, we might collect data on only a subset of the individuals, and then make inferences about all individuals we are interested in, using the information we collected.
Vocabulary:
A population is the entire set of individuals that we want to draw conclusions about.
A sample is a subset of a population.
A parameter is a numerical quantity pertaining to an entire population or process.
A statistic is a numerical quantity calculated from a sample.
We’ll work with a dataset containing information on all 20,591 flights from New York to Chicago in 2013. Our population of interest is all 20,591 flights. We’re interested in the proportion of fights that arrive on time, and the average arrival delay (in minutes). Arrival delay represents how much earlier/later did the flight arrive than expected. Whether or not the flight arrived on time is a categorical variable, while arrival delay is a quantitative variable.
In this situation, we have information on the entire population. In statistics, this is rare. It is more common to have information on only subset of flights contained in a sample. If the sample is collected in a way that is representative of the population, such as by sampling at random, then we can use the sample to draw conclusions about the population.
We’ll begin by studying the behavior of sample statistics when we know the population parameters, and then use what we learn to handle more real situations where we don’t know about the entire population.
The parameter of interest is the proportion of on-time arrivals out of all flights in the population of 20,591. When the parameter is proportion, we’ll denote it with the letter \(p\).
The first 10 flights, all of which occurred on January 1 are shown below. We see that most of those flights were not on-time.
head(Flights_NY_CHI, 10)
# A tibble: 10 × 9
year month day carrier origin dest sched_dep_time arr_delay ontime
<int> <int> <int> <chr> <chr> <chr> <int> <dbl> <chr>
1 2013 1 1 UA EWR ORD 558 12 N
2 2013 1 1 AA LGA ORD 600 8 N
3 2013 1 1 MQ EWR ORD 600 32 N
4 2013 1 1 AA LGA ORD 630 14 N
5 2013 1 1 AA LGA ORD 700 4 N
6 2013 1 1 UA LGA ORD 700 20 N
7 2013 1 1 UA EWR ORD 713 21 N
8 2013 1 1 AA LGA ORD 745 -12 Y
9 2013 1 1 MQ EWR ORD 710 49 N
10 2013 1 1 B6 JFK ORD 830 15 N
Note that a negative arrival delay denotes a flight that arrived before expected, thus on time.
The bar graph shows the number of flights that arrived on time throughout the year.
We see that the majority of flights did arrive on time.
We’ll calculate the proportion of flights arriving on time, among all 20,591 flights in the population.
#proportion of flights on timep <-sum(Flights_NY_CHI$ontime=="Y")/20591p
[1] 0.6079841
When the population parameter is a proportion, we’ll denote it with the letter \(p\). Here \(p\)=0.6079841. Keep in mind that in a real situation, we typically won’t know the value of the population parameter \(p\), and will need to estimate it from a sample.
The histogram shows the distribution of arrival delay times. Negative delays indicate the flight arriving ahead of schedule.
Delay_plot_POP <-ggplot(data=Flights_NY_CHI, aes(x=arr_delay)) +geom_histogram(fill="blue", binwidth=5) +ggtitle("Distribution of Arrival Delays")Delay_plot_POP
We see that the distribution of arrival delays is heavily right-skewed. While most flights arrive around the scheduled time, a few were late by 3 or more hours.
We’ll calculate the mean arrival delay.
#proportion of flights on timemu <-mean(Flights_NY_CHI$arr_delay)mu
[1] 7.144772
When the population parameter represents a mean, we’ll denote it using \(\mu\). Here \(\mu=\) 7.144772.
We also calculate the standard deviation of arrival delays.
# mean arrival delay in sampleSD_delay <-sd(S1$arr_delay)SD_delay
[1] 60.10419
3.1.2 Sampling Variability
We typically won’t have data on the full population and won’t know the values of parameters like \(p\) and \(\mu\). Instead, we’ll have data on just a sample taken from the population.
To illustrate, we’ll take a sample of 75 flights. The first 6 flights in the sample are shown below. The ontime variable tells whether or not the flight arrived on time.
# take sample of 75 flightsset.seed(08082023)S1 <-sample_n(Flights_NY_CHI, 75)head(S1)
# A tibble: 6 × 9
year month day carrier origin dest sched_dep_time arr_delay ontime
<int> <int> <int> <chr> <chr> <chr> <int> <dbl> <chr>
1 2013 3 8 AA LGA ORD 1720 106 N
2 2013 12 15 AA JFK ORD 1715 -24 Y
3 2013 10 22 UA EWR ORD 1300 -12 Y
4 2013 8 26 UA EWR ORD 2110 15 N
5 2013 7 23 AA LGA ORD 1359 66 N
6 2013 5 13 UA EWR ORD 900 -21 Y
We’ll calculate the number, and proportion of flights that arrived on time.
num_ontime <-sum(S1$ontime =="Y") # count number of on-time arrivalsnum_ontime
[1] 39
Proportion of on-time arrivals in the sample.
# proportion of on-time flights in samplep_hat <- num_ontime/75p_hat
[1] 0.52
When the sample statistic is a proportion, it is commonly denoted \(\hat{p}\).
In our sample \(\hat{p}\) = 52 percent of flights arrived on-time. The sample statistic \(\hat{p}\) is an estimate of the population proportion \(p\), the proportion of all flights arriving on time.
We also calculate the mean arrival delay in minutes.
# mean arrival delay in sampley_bar <-mean(S1$arr_delay)y_bar
[1] 19.2
We’ll denote this sample mean \(\bar{y}\). It is an estimate of the population mean, representing the arrival mean delay for all flights, which we’ll denote \(\mu\).
Of course, this was just one sample of 75 flights. If we took different samples of 75 flights, we would expect the statistics \(\hat{p}\) and \(\bar{x}\) to vary from sample to sample.
Here’s a different sample of 75 flights.
S2 <-sample_n(Flights_NY_CHI, 75)
Proportion arriving on time in second sample:
# proportion arriving on time in second samplenum_ontime2 <-sum(S2$ontime =="Y") # count number of on-time arrivalsp_hat2 <- num_ontime2/75p_hat2
[1] 0.5066667
Mean arrival delay in second sample:
# mean arrival delay in second sampley_bar2 <-mean(S2$arr_delay)y_bar2
[1] 15.28
Sample statistics will vary from sample to sample, thus it is not realistic to expect them to exactly match their corresponding population parameters.
Nevertheless we can use the sample to estimate the proportion of all flights in the population that arrive on time.
Let’s take 10,000 more random samples of 75 flights and record the proportion of on-time arrivals in each sample.
nreps <-10000# number of repetitionsp_hat_val <-rep(NA, nreps) # create vector to hold proportion of on-time arrivalsy_bar_val <-rep(NA, nreps) # create vector to hold mean arrival delaySample <-1:nrepsfor(i in1:nreps){S <-sample_n(Flights_NY_CHI, 75) # take sample of 75N_ontime <-sum(S$ontime =="Y") # count number of on-time arrivalsp_hat_val[i] <- N_ontime/75# record proportion on-timey_bar_val[i] <-mean(S$arr_delay) # record mean arrival delay}Samples_df <-data.frame(Sample, p_hat_val, y_bar_val) # store results in a data frame
The table shows the proportion of on-time arrivals in the first 20 samples of 75 flights.
kable(head(Samples_df, 20) |>round(2))
Sample
p_hat_val
y_bar_val
1
0.65
6.16
2
0.53
10.75
3
0.59
12.41
4
0.69
2.24
5
0.59
7.33
6
0.67
1.28
7
0.61
3.25
8
0.59
8.93
9
0.65
-3.16
10
0.65
5.73
11
0.61
12.32
12
0.59
8.79
13
0.61
5.53
14
0.48
10.55
15
0.60
5.40
16
0.44
24.11
17
0.49
15.93
18
0.57
4.45
19
0.64
6.76
20
0.56
3.65
The histogram below shows the distribution of the proportion of on-time arrivals in the 10,000 different samples.
Prop_Samp_Dist<-ggplot(data=Samples_df, aes(x=p_hat_val)) +geom_histogram(color="blue", fill="blue", binwidth=0.001) +ggtitle("Sampling Distribution for Proportion On Time") +xlab("Prop. on time in sample")Prop_Samp_Dist
We notice that most of our 10,000 samples yielded proportions of on-time arrivals between 0.5 and 0.7, The distribution of proportion of on-time arrivals is roughly symmetric and bell-shaped.
The distribution shown in this histogram is called the sampling distribution for\(\hat{p}\). The sampling distribution for a statistic shows the distribution of the statistic over many samples.
We’ll calculate the mean of the sampling distribution for \(\hat{p}\). How does it compare to the true population parameter \(p\)?
Mean_p_hat <-mean(Samples_df$p_hat_val)Mean_p_hat
[1] 0.60848
We can gauge how much the proportion of on-time arrivals varies between samples by calculating the standard deviation of this sampling distribution. The standard deviation of a sampling distribution for a statistic is also called the standard error of the statistic. In this case it represents the standard error \(\hat{p}\) (the proportion of on-time arrivals), and is denoted \(\text{SE}(\hat{p})\). This standard error is shown below.
SE_p_hat <-sd(Samples_df$p_hat_val)SE_p_hat
[1] 0.05659102
Now, we’ll examine the sampling distribution of the mean arrival time \(\bar{y}\).
Mean_Samp_Dist<-ggplot(data=Samples_df, aes(x=y_bar_val)) +geom_histogram(color="white", fill="blue", binwidth=0.5) +ggtitle("Sampling Distribution for Mean Arrival Delay") +xlab("Mean Arrival Delay")Mean_Samp_Dist
How does the sampling distribution for mean arrival delays compare to the distribution of arrival delays for individual flights? Think about the shape and the variability of the distributions.
The standard error of the mean, \(SE(\bar{y})\) is shown calculated below.
SE_y_bar <-sd(Samples_df$y_bar_val)SE_y_bar
[1] 5.563925
What does this standard error represent? How is it different than the the standard deviation of flight times, which we previously saw was 60.1 minutes?
Vocabulary:
The sampling distribution of a statistic is the distribution of values the statistic takes on across many different samples of a given size.
The standard error of a statistic is the standard deviation of that statistic’s sampling distribution. It measures how much the statistic varies between different samples of a given size.
3.1.3 Sample Size and Standard Error
Question:
Suppose the sample consisted of 10, or 30, or 500 flights, instead of 75? Would you expect the standard deviation of individual flight times to increase, decrease or stay about the same? What about the standard error of the mean delay?
The histogram shows the distribution of individual flight delays in random samples of each size.
For each sample, most of the flights have delays slightly below or above 0, though a small percentage of the flights in each sample have much larger delays. The variability in delays is about the same, regardless of sample size.
The table shows the standard deviation in each of the samples.
Sample_Size
SD
10
45.38673
30
40.64390
75
43.40787
200
48.01117
Sample size does not impact the amount of variability between individual flights. Standard deviation in delay times does not systematically increase or decrease based one sample size (of course it varies a little based on the lakes randomly chosen in the sample).
Now, we’ll examine what happens to the standard error of the mean as the sample size changes.
Distributions of Mean Between Different Samples
Notice that as the sample size increases, the sampling distribution of the mean becomes more symmetric and bell-shaped, and also more concentrated around the population mean \(\mu\).
The table shows the standard error of the mean for samples of different size:
Sample_Size
SE
10
15.027787
30
8.832375
75
5.461011
200
3.371757
As sample size increases, variability between means of different samples decreases. Standard error of the mean decreases. This is also true of standard errors for other statistics (i.e. difference in means, regression slopes, etc.)
3.2 Confidence Intervals
3.2.1 Constructing Confidence Intervals
We saw that while statistics calculated from individual samples deviate from population parameters, over many samples, they approximately average to the population parameter (assuming the samples are chosen randomly).
Thus, when we have only a single sample, we can use the sample statistic as an estimate of the population parameter, provided we allow for a certain margin of error. The question is how much margin of error do we need?
The sampling distribution for the proportion of on-time flights is shown again below. The true proportion of on-time flights (\(p=0.607984\)) is marked by the green dotted line. The gold bar at the bottom of the histogram represents the range of sample proportions that lie within \(\pm 2\) standard errors of the true population proportion of flights that arrived on time:
We calculate the proportion of samples whose proportion of on-time arrivals lies within \(\pm 2\) standard errors of the true proportion.
Lower <- p -2*SE_p_hatUpper <- p +2*SE_p_hatsum((Samples_df$p_hat_val >=Lower) & (Samples_df$p_hat_val <= Upper))
[1] 9539
Approximately 95% 10,000 samples produced proportions within \(\pm 2\) standard errors of the true population proportion of on-time flights.
In a real situation, we won’t have access to the entire population of flights, only the flights in a single sample. For example, recall our original sample of 75 flights, in which we observed a proportion of on-time arrivals of \(\hat{p}=\) 0.52.
Since we now know that 95% of all samples produce proportions that lie within two standard errors of the population proportion, we can obtain an estimate of the population proportion \(p\) by adding and subtracting \(2\times \text{SE}(\hat{p})\) from our observed sample proportion \(\hat{p}\).
Using probability theory, it can be shown generally that if the sampling distribution of a statistic is symmetric and bell shaped, then approximately 95% of all samples will produce sample statistics that lie within two standard errors of the corresponding population parameter. Such an interval is called an approximate 95% confidence interval for the population parameter.
Approximate 95% confidence interval: If the sampling distribution of a statistic is symmetric and bell-shaped, a 95% confidence interval for the population parameter is:
More generally, if we want to use a level of confidence that is different than 95%, we can adjust the value we multiply the standard error by. In general, a standard error confidence interval has the form:
where the value of \(m\) depends on the desired level of confidence.
Confidence intervals that are calculated by adding and subtracting a certain number of standard errors from the sample statistic are called standard error confidence intervals. This approach works as long as the sampling distribution is symmetric and bell-shaped. Probability theory tells us that in a symmetric and bell-shaped distribution, approximately 95% of the area lies within two standard errors of the center of the distribution, given by the true parameter value. We will, however, see that this approach will not work in all cases. Not all statistics produce sampling distributions that are symmetric and bell-shaped, and we will need an alternative way to calculate confidence intervals in these situations.
3.2.1.1 Example: 95% Confidence Interval for \(p\)
We’ll calculate a 95% confidence interval for the proportion of on-time flights, using our original sample where \(\hat{p}\) = 0.52. The 95% confidence interval is:
Based on our sample of 75 flights, we can be 95% confident that the true proportion of on-time arrivals among all 2013 flights from New York to Chicago is between 0.407 and 0.633.
3.2.1.2 Example: 95% Confidence Interval for \(\mu\)
Likewise, we calculate a 95% confidence interval for average arrival delay using the formula:
Based on our sample of 75 flights, we can be 95% confident that the mean arrival delay among all 2013 flights from New York to Chicago is between 8.1 and 30.3 minutes.
Note that this is a statement about what we think is true of the mean overall flight time, not the time of an individual flight. It would be incorrect to say that we are 95% confident that an individual flight would be expected to have a delay in this interval. You might think about whether the interval for the delay time of an individual flight should be wider or narrower than this. We’ll talk about such an interval later in the term.
3.2.2 What does 95% Confidence Mean?
Knowing what we do about the true value of the population parameters \(p\) and \(\mu\), we can see that our interval for \(p\), which was (0.407, 0.633) does indeed contain the true population value of \(p=\) 0.6079841. However, the interval for \(\mu\), which was (8.1, 30.3) does not contain the true value of \(\mu=\) 7.144772.
Does this mean we did something wrong when we calculated the interval for \(\mu\), the average flight delay among all flights in the population? The answer is “no”. Notice we claimed to be only “95%” confident that our interval contains the true value of the population parameter \(\mu\). This means that we should expect 5% of samples taken randomly to yield a sample mean \(\bar{y}\) so different from the population mean \(\mu\), that the resulting confidence interval would not contain the true value of \(\mu\). This does not mean we did anything wrong, just that we obtained an unusual sample just by chance. Since our procedure, namely adding and subtracting two standard errors, is designed to work 95% of the time, we can expect such samples to be rare.
In a real situation, we won’t know the true value of the population parameter, so we won’t know for sure whether or not our confidence interval contains this true parameter value.
To further understand the meaning of “95% confidence”, let’s explore what happens when we calculate confidence intervals based on estimates \(\bar{y}\) obtained from many different samples. For each of our 10,000 different samples taken from our population, we’ll add and subtract two standard errors from the sample proportion \(\hat{p}\) corresponding to that sample.
The table below displays the value of \(\hat{p}\), for the first 20 samples we took, along with the lower and upper bounds of the confidence interval, and whether or not the confidence interval contains the true parameter value \(p\) (either 1=TRUE or 0=FALSE).
The graphic below visualizes the confidence intervals produced using the estimates from the first 100 samples. The green dotted line indicates the true value of \(p\). The black dots indicate the value of \(\hat{p}\) for each sample. Intervals that do in fact contain the true value of \(p\) are shown in blue, and intervals that do not contain the true value of \(p\) are shown in green.
Out of these 100 samples, 93 contain the true value of the population parameter \(p\). This is close to the desired 95% confidence level.
The picture shows confidence intervals produced by the first 100 samples, but we actually took 10,000 different samples of 75 flights. Let’s calculate how many of these samples produced confidence intervals that contain the true value of \(p\).
sum(Samples_df_p$Contains ==TRUE)
[1] 9539
Again, notice that close to 95% of the samples produced confidence intervals contain the true population parameter \(p\). Note that for the red intervals that do not contain \(p\) nothing was done incorrectly. The sample was taken at random, and the confidence interval was calculated using the correct formula. It just happened that by chance, we obtained a sample proportion \(\hat{p}\) that was unusually high or low, leading to an interval that did not capture the true population parameter. This, of course, happens rarely, and approximately 95% of the samples do, in fact, result in intervals that contain the true value of \(p\).
The graphic below visualizes the confidence intervals produced using the estimates from the first 100 samples. The green dotted line indicates the true value of \(p\). The black dots indicate the value of \(\hat{p}\) for each sample. Intervals that do in fact contain the true value of \(p\) are shown in blue, and intervals that do not contain the true value of \(p\) are shown in green.
Out of these 100 samples, 92 contain the true value of the population parameter \(\mu\).
Out of all 10,000 samples, the proportion containing the true population value of \(\mu\) is:
sum(Samples_df_mu$Containsmu ==TRUE)
[1] 9570
This brings us back to the question “what does 95% confidence mean?”. An approximate 95% confidence interval means that if we take a large number of samples and calculate confidence intervals from each of them, then approximately 95% of the samples will produce intervals containing the true population parameter. In reality, we’ll only have on sample, and won’t know whether or not our interval contains the true parameter value. Assuming we have taken the sample and calculated the interval correctly, we can rest assured in the knowledge that that 95% of all intervals taken would contain the true parameter value, and hope that ours is among that 95%.
It might be tempting to say that “there is approximately a 95% chance” that the population parameter lies within the confidence interval, but this is incorrect. In the statistical framework used here (known as classical, or frequentist statistics), the population parameter is assumed to be a fixed, but (typically) unknown number. It either is within the interval, or it isn’t. We just (typically) don’t know which. There’s nothing random about whether or not the parameter value is in our interval, so it doesn’t make sense to speak of it in terms of chance or randomness. Randomness comes into play due to the fact that we selected a random sample, which will produce a statistic likely to differ from the population parameter due to sampling variability. A different statistical framework, known as Bayesian statistics approaches this differently, and would allow us to use randomness and chance to describe our beliefs about any uncertain quantity, including a population proportion. In this class, however, we’ll stick to the classical frequentist interpretation.
Of course, you might ask why we needed to calculate a confidence interval for the proportion of on-time flights in the first place, since we actually have data on all 20,591 flights in the population and already know the true proportion of on-time arrivals and mean arrival delay. The answer is that we don’t. But, in most real situations, we will only have data from a single sample, not the entire population, and we won’t know the true population parameter. We’ll be able to build on the ideas of sampling distributions and standard error that we learned about in this section to calculate confidence intervals in those scenarios.
3.3 Bootstrapping
3.3.1 Mercury Concentration in Florida Lakes
A 2004 study by Lange, T., Royals, H. and Connor, L. examined Mercury accumulation in large-mouth bass, taken from a sample of 53 Florida Lakes. If Mercury accumulation exceeds 0.5 ppm, then there are environmental concerns. In fact, the legal safety limit in Canada is 0.5 ppm, although it is 1 ppm in the United States.
In our sample, we have data on 53 lakes, out of more than 30,000 lakes in the the state of Florida. We’ll attempt to draw conclusions about the entire population, consisting of all lakes in Florida, using data from our sample of 53. It is not clear how the lakes in this sample of 53 were selected, or how representative they are of all lakes in the state of Florida. Let’s assume for our purposes that the lakes in the sample can be reasonably thought of as being representative of all lakes in Florida.
Our data come from a sample of 53 lakes, out of more then 30,000 in the entire state of Florida. The mercury levels of the 53 lakes in the sample are shown in the table below.
Lake Location Mercury
Alligator S 1.23
Annie S 1.33
Apopka N 0.04
Blue Cypress S 0.44
Brick S 1.20
Bryant N 0.27
Cherry N 0.48
Crescent N 0.19
Deer Point N 0.83
Dias N 0.81
Dorr N 0.71
Down S 0.50
Eaton N 0.49
East Tohopekaliga S 1.16
Farm-13 N 0.05
George N 0.15
Griffin N 0.19
Harney N 0.77
Hart S 1.08
Hatchineha S 0.98
Iamonia N 0.63
Istokpoga S 0.56
Jackson N 0.41
Josephine S 0.73
Kingsley N 0.34
Kissimmee S 0.59
Lochloosa N 0.34
Louisa S 0.84
Miccasukee N 0.50
Minneola N 0.34
Monroe N 0.28
Newmans N 0.34
Ocean Pond N 0.87
Ocheese Pond N 0.56
Okeechobee S 0.17
Orange N 0.18
Panasoffkee N 0.19
Parker S 0.04
Placid S 0.49
Puzzle N 1.10
Rodman N 0.16
Rousseau N 0.10
Sampson N 0.48
Shipp S 0.21
Talquin N 0.86
Tarpon S 0.52
Tohopekaliga S 0.65
Trafford S 0.27
Trout S 0.94
Tsala Apopka N 0.40
Weir N 0.43
Wildcat N 0.25
Yale N 0.27
The histogram shows the distribution of mercury levels in the 53 lakes in the sample. Lakes exceeding the US standard of 1 ppm are shown in red.
Lakes_Hist <-ggplot(data=FloridaLakes, aes(x=Mercury)) +geom_histogram(aes(fill=Mercury<=1), color="white", binwidth =0.1) +ggtitle("Mercury Levels in Sample of 53 Florida Lakes") +xlab("Mercury Level") +ylab("Frequency") +theme_bw()Lakes_Hist
The proportion of lakes with mercury levels exceeding 1 ppm is calculated below.
p_hat <-sum(FloridaLakes$Mercury >1)/53p_hat
[1] 0.1132075
We see that in our sample of 53 lakes, approximately 11% have mercury levels exceeding the US standard of 1 ppm. Suppose we want to estimate the proportion of all Florida Lakes whose mercury level exceeds this standard. As we saw in the previous section, we would not expect the population proportion to exactly match the sample, due to random variability between samples. We can use the sample proportion as an estimate (\(\hat{p} = 0.1132\)), and construct a confidence interval for the unknown population proportion \(p\).
In order to construct the confidence interval, we need to know how much the sample proportion of lakes exceeding 1 ppm \(\hat{p}\) could vary between different samples of size 53. That is, we need to know the standard error of \(\hat{p}\). In the previous section, we calculated the standard error by taking 10,000 different samples of the same size as ours from the population, calculating the proportion for each sample, and then calculating the standard deviation of the proportions obtained from these 10,000 different samples. This procedure will not work here, however, because unlike the previous example where we really did have data on the entire population of all flights from New York to Chicago, we do not have data on all 30,000+ lakes in Florida. We cannot take a lot of different samples of size 53 from the population of all lakes, and thus, cannot obtain the sampling distribution for the the proportion of lakes exceeding 1 ppm, or estimate the standard error of \(\hat{p}\).
3.3.2 Bootstrap Sampling
All we have is a single sample of 53 lakes. We need to figure out how much the proportion of lakes with mercury levels exceeding 1 ppm would vary between different samples of size 53, using only the information contained in our one sample.
To do this, we’ll implement a popular simulation-based strategy, known as bootstrapping.
Let’s assume our sample is representative of all Florida lakes. Then, we’ll duplicate the sample many times to create a large set that will look like the population of all Florida Lakes. We can then draw samples of 53 from that large population, and record the mean mercury level for each sample of 53.
An illustration of the bootstrapping procedure is shown below, using a sample of 12 colored dots, instead of the 53 lakes.
In fact, duplicating the sample many times and selecting new samples of size \(n\) has the same effect as drawing samples of size \(n\) from the original sample, by putting the item drawn back in each time, a procedure called sampling with replacement. Thus, we can skip the step of copying/pasting the sample many times, and instead draw our samples with replacement.
This means that in each new sample, some lakes will be drawn multiple times and others not at all. It also ensures that each sample is different, allowing us to estimate variability in the sample mean between the different samples of size 53.
An illustration of the concept of bootstrapping, using sampling with replacement is shown below.
The variability in sample means in our newly drawn samples is used to approximate the variability in proportion \(\hat{p}\) that would occur between different samples of 53 lakes, drawn from the population of all Florida Lakes.
The point of bootstrapping is to observe how much a statistic (in this case the proportion of lakes with Mercury levels exceeding 1 ppm) varies between bootstrap samples. This can act as an estimate of how much that statistic would vary between different samples of size \(n\), drawn from the population.
The steps of bootstrap sampling can be summarized in the following algorithm.
Bootstrap Algorithm
For an original sample of size \(n\):
Take a sample size \(n\) by randomly sampling from the original, with replacement. Thus, some observations will show up multiple times, and others not at all. This sample is called a bootstrap sample.
Calculate the statistic of interest in the bootstrap sample (in this case \(\hat{p}\), the proportion of lakes whose mercury levels exceed 1 ppm).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the statistic of interest that is calculated in each bootstrap sample.
Look at the distribution of the statistic across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the statistic of interest.
3.3.3 Bootstrap Samples of Lakes
The sample_n() function samples the specified number rows from a data frame, with or without replacement.
The lakes in the first sample are shown below. Notice that some lakes occur multiple times, and others not at all.
Bootstrap Sample 1
BootstrapSample1 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample1 %>%select(ID, Lake, Mercury) |>kable()
ID
Lake
Mercury
3
Apopka
0.04
4
Blue Cypress
0.44
4
Blue Cypress
0.44
5
Brick
1.20
5
Brick
1.20
5
Brick
1.20
6
Bryant
0.27
6
Bryant
0.27
7
Cherry
0.48
10
Dias
0.81
10
Dias
0.81
10
Dias
0.81
13
Eaton
0.49
17
Griffin
0.19
18
Harney
0.77
18
Harney
0.77
19
Hart
1.08
20
Hatchineha
0.98
22
Istokpoga
0.56
22
Istokpoga
0.56
23
Jackson
0.41
25
Kingsley
0.34
25
Kingsley
0.34
26
Kissimmee
0.59
27
Lochloosa
0.34
30
Minneola
0.34
32
Newmans
0.34
33
Ocean Pond
0.87
34
Ocheese Pond
0.56
35
Okeechobee
0.17
37
Panasoffkee
0.19
37
Panasoffkee
0.19
38
Parker
0.04
38
Parker
0.04
40
Puzzle
1.10
40
Puzzle
1.10
40
Puzzle
1.10
41
Rodman
0.16
41
Rodman
0.16
42
Rousseau
0.10
43
Sampson
0.48
43
Sampson
0.48
44
Shipp
0.21
44
Shipp
0.21
45
Talquin
0.86
51
Tohopekaliga
0.65
51
Tohopekaliga
0.65
51
Tohopekaliga
0.65
48
Trout
0.94
48
Trout
0.94
52
Wildcat
0.25
52
Wildcat
0.25
53
Yale
0.27
We calculate the proportion of lakes with mercury levels exceeding 1 ppm in this bootstrap sample. Note that if a lake shows up more than once in the bootstrap sample, then it is counted however many times it shows up.
sum(BootstrapSample1$Mercury >1) /53
[1] 0.1320755
Bootstrap Sample #2
We take a second bootstrap sample. We display only the first 10 lakes, though the bootstrap sample still has 53 lakes.
Notice that the lakes chosen and omitted differ from the first sample.
BootstrapSample2 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample2 %>%select(ID, Lake, Mercury)
# A tibble: 53 × 3
ID Lake Mercury
<int> <chr> <dbl>
1 1 Alligator 1.23
2 3 Apopka 0.04
3 4 Blue Cypress 0.44
4 5 Brick 1.2
5 6 Bryant 0.27
6 7 Cherry 0.48
7 9 Deer Point 0.83
8 10 Dias 0.81
9 10 Dias 0.81
10 11 Dorr 0.71
# ℹ 43 more rows
Proportion exceeding 1 ppm:
sum(BootstrapSample2$Mercury >1) /53
[1] 0.0754717
Bootstrap Sample #3
We’ll take one more bootstrap sample and calculate the proportion of lakes with mercury levels exceeding 1 ppm. The first 10 lakes in this third sample are shown.
BootstrapSample3 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample3 %>%select(ID, Lake, Mercury)
# A tibble: 53 × 3
ID Lake Mercury
<int> <chr> <dbl>
1 1 Alligator 1.23
2 3 Apopka 0.04
3 3 Apopka 0.04
4 5 Brick 1.2
5 5 Brick 1.2
6 7 Cherry 0.48
7 7 Cherry 0.48
8 8 Crescent 0.19
9 9 Deer Point 0.83
10 10 Dias 0.81
# ℹ 43 more rows
Proportion exceeding 1 ppm:
sum(BootstrapSample3$Mercury >1) /53
[1] 0.1132075
3.3.4 Bootstrap Distribution
Now that we have seen how bootstrap sampling works, we’ll take a large number (10,000) different bootstrap samples and examine how the proportion of lakes with mercury levels exceeding 1 ppm varies between samples.
We’ll use a for-loop to take many different bootstrap samples and record the observed proportion in a vector called p_hat_b
p_hat <-sum(FloridaLakes$Mercury >1)/53#calculate sample statisticBootstrap_prop <-rep(NA, 10000) #setup vector to hold bootstrap statisticsfor (i in1:10000){BootstrapSample <-sample_n(FloridaLakes, 53, replace=TRUE) #take bootstrap sampleBootstrap_prop[i] <-sum(BootstrapSample$Mercury >1)/53# calc. prop exceeding 1}Lakes_Bootstrap_Prop <-data.frame(Bootstrap_prop) #store values in a dataframe
The distribution of proportions observed in the 10,000 different bootstrap samples is shown below. This distribution is called the bootstrap distribution.
Lakes_Bootstrap_Prop_plot <-ggplot(data=Lakes_Bootstrap_Prop, aes(x=Bootstrap_prop)) +geom_histogram(color="white", fill="lightblue", binwidth=0.02) +xlab("Prop > 1 in Bootstrap Sample ") +ylab("Frequency") +ggtitle("Bootstrap Distribution for Prop. of Lakes Exeeding 1 ppm Hg") +theme(legend.position ="none")Lakes_Bootstrap_Prop_plot
The bootstrap distribution is meant to approximate the sampling distribution of the statistic of interest (in this case the proportion exceeding 1 ppm). Because it is based on the sample, the bootstrap distribution will be centered at the sample statistic (\(\hat{p}\) in this case) while the sampling distribution would have been centered at the population parameter (\(p\)), which is unknown. The important things, however, is that the variability in the bootstrap distribution gives a good approximation of the amount of variability in the sampling distribution, so we can use the standard deviation of the bootstrap distribution (called bootstrap standard error) in our confidence interval calculation.
3.3.5 Bootstrap SE Confidence Interval
We calculate the standard deviation of this bootstrap distribution, which is an estimate of the standard error of \(\hat{p}\). It measures how much the proportion of lakes exceeding 1 ppm varies between samples of size 53.
Since the bootstrap distribution is roughly symmetric and bell-shaped, we can calculate a 95% confidence interval for the proportion of all Florida lakes with mercury levels exceeding 1 ppm, using bootstrap standard error confidence interval method.
We are 95% confident that the proportion of all Florida lakes with mercury levels exceeding 1 ppm is between 0.0268 and 0.1996.
3.4 More Bootstrap Examples
3.4.1 Bootstrapping Other Statistics
We’ve seen how to use bootstrapping to calculate confidence intervals for an unknown population parameter \(p\), using an estimate \(\hat{p}\), calculated from a sample of size \(n\). This procedure can be applied to calculate confidence intervals for a wide range of population parameters, using statistics calculated from a sample.
For example, we could calculate confidence intervals any of the following parameters, using the corresponding sample statistic.
We follow the same algorithm as we did when working with a proportion, and simply calculate whatever statistic we are interested in step 2, in place of \(\hat{p}\), as we did previously.
The bootstrap algorithm is given again, below.
Bootstrap Algorithm
For an original sample of size \(n\):
Take a sample of size \(n\) by randomly sampling from the original sample with replacement. (Thus some observations will show up multiple times and others not at all.)
Calculate the statistic of interest in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the statistic of interest that is calculated in each bootstrap sample.
Look at the distribution of the statistic across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the statistic of interest.
We’ll now go through examples, calculating bootstrap confidence intervals for each of the parameters listed above.
3.4.2 CI for Mean
The histogram shows the distribution of mercury levels of the 53 lakes in our sample. The mean and standard deviation in mercury levels for these 53 lakes is shown.
Lakes_Hist <-ggplot(data=FloridaLakes, aes(x=Mercury)) +geom_histogram(color="white", fill="lightblue", binwidth =0.2) +ggtitle("Mercury Levels in Sample of Florida Lakes") +xlab("Mercury Level") +ylab("Frequency") Lakes_Hist
We’ll calculate the mean and standard deviation in mercury level for the 53 lakes in the sample.
We want to calculate a 95% confidence interval for the mean mercury level among all Florida lakes. We’ll use bootstrapping again, this time using the sample mean, rather than the proportion exceeding 1 ppm, as our statistic of interest.
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample of 53 lakes, with replacement.
Calculate the mean mercury level in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the mean mercury level in each bootstrap sample.
Look at the distribution of the mean across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the mean mercury level.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
The first 10 lakes in the bootstrap sample are shown below. Notice again that some lakes occur multiple times, and others not at all.
BootstrapSample1 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample1 %>%select(ID, Lake, Mercury)
# A tibble: 53 × 3
ID Lake Mercury
<int> <chr> <dbl>
1 1 Alligator 1.23
2 2 Annie 1.33
3 2 Annie 1.33
4 3 Apopka 0.04
5 3 Apopka 0.04
6 3 Apopka 0.04
7 3 Apopka 0.04
8 6 Bryant 0.27
9 9 Deer Point 0.83
10 9 Deer Point 0.83
# ℹ 43 more rows
We calculate the mean mercury level among the lakes in the bootstrap sample.
mean(BootstrapSample1$Mercury)
[1] 0.4790566
Bootstrap Sample #2
BootstrapSample2 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample2 %>%select(ID, Lake, Mercury)
# A tibble: 53 × 3
ID Lake Mercury
<int> <chr> <dbl>
1 1 Alligator 1.23
2 1 Alligator 1.23
3 1 Alligator 1.23
4 2 Annie 1.33
5 5 Brick 1.2
6 6 Bryant 0.27
7 8 Crescent 0.19
8 8 Crescent 0.19
9 9 Deer Point 0.83
10 9 Deer Point 0.83
# ℹ 43 more rows
Mean Mercury Level:
mean(BootstrapSample2$Mercury)
[1] 0.5479245
Bootstrap Sample #3
BootstrapSample3 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample3 %>%select(ID, Lake, Mercury)
# A tibble: 53 × 3
ID Lake Mercury
<int> <chr> <dbl>
1 1 Alligator 1.23
2 3 Apopka 0.04
3 3 Apopka 0.04
4 4 Blue Cypress 0.44
5 5 Brick 1.2
6 5 Brick 1.2
7 6 Bryant 0.27
8 7 Cherry 0.48
9 8 Crescent 0.19
10 9 Deer Point 0.83
# ℹ 43 more rows
Mean Mercury Level:
mean(BootstrapSample3$Mercury)
[1] 0.5437736
Now, we’ll take 10,000 bootstrap samples, and record the mean mercury concentration in each sample.
mean <-mean(FloridaLakes$Mercury) #calculate sample statisticBootstrap_Mean <-rep(NA, 10000) # setup vector to hold bootstrap statisticsfor (i in1:10000){BootstrapSample <-sample_n(FloridaLakes, 53, replace=TRUE) # take bootstrap sampleBootstrap_Mean[i] <-mean(BootstrapSample$Mercury) # calculate mean in bootstrap sample}Lakes_Bootstrap_Results_Mean <-data.frame(Bootstrap_Mean) #store results in data frame
The bootstrap distribution for the mean mercury level is shown below, along with its standard error.
Lakes_Bootstrap_Mean_Plot <-ggplot(data=Lakes_Bootstrap_Results_Mean, aes(x=Bootstrap_Mean)) +geom_histogram(color="white", fill="lightblue") +xlab("Mean Mercury in Bootstrap Sample ") +ylab("Frequency") +ggtitle("Bootstrap Distribution for Sample Mean in Florida Lakes") +theme(legend.position ="none") Lakes_Bootstrap_Mean_Plot
Bootstrap Standard Error
We’ll calculate the bootstrap standard error of the mean. This is a measure of how much the mean varies between samples of size 53.
Notice that the standard error of the mean is much less than the sample standard deviation of 0.341.
Interpretations of sample standard deviation and standard error of the mean
The sample standard deviation measures the amount of variability in mercury levels between the 53 individual lakes in our sample.
The standard error of the mean measures the amount of variability in sample mean mercury levels between different samples of size 53.
There is more variability between mercury levels in individual lakes than there is between average mercury levels in different samples of size 53.
Since the bootstrap distribution is roughly symmetric and bell-shaped, we can use the bootstrap standard error method to calculate an approximate 95% confidence interval for the mean mercury level among all Florida lakes.
We are 95% confident that the average mercury level among all Florida lakes is between 0.44 and 0.62 parts per million.
It is important to note that we are not saying that we are 95% confident that an individual lake lie in this range, or that 95% of all individual lakes lie in this range. We are only saying that we are confident that the average mercury level among all lakes lies in this range. A confidence interval is a statement about a population parameter (in this case the average mercury level), rather than about individual lakes in the population. Since there is more variability about individual lakes than overall averages, we’ll need to make a wider interval when talking about the mercury level for an individual lake.
3.4.3 CI for Standard Deviation
Now, we’ll calculate a confidence interval for the standard deviation in mercury levels among all Florida lakes. Recall that the sample standard deviation (\(s\)) was:
Sample_SD <-sd(FloridaLakes$Mercury)Sample_SD
[1] 0.3410356
We’ll use this estimate to calculate a confidence interval for the population standard deviation \(\sigma\).
This time, our statistic of interest is the sample standard deviation \(s\).
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample of 53 lakes, with replacement.
Calculate the standard deviation in mercury level in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the standard deviation mercury level in each bootstrap sample.
Look at the distribution of the standard deviations across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the standard deviation in mercury level.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
BootstrapSample1 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample1 %>%select(ID, Lake, Mercury)
BootstrapSample3 <-sample_n(FloridaLakes, 53, replace=TRUE) %>%arrange(Lake)BootstrapSample3 %>%select(ID, Lake, Mercury)
# A tibble: 53 × 3
ID Lake Mercury
<int> <chr> <dbl>
1 2 Annie 1.33
2 3 Apopka 0.04
3 5 Brick 1.2
4 6 Bryant 0.27
5 6 Bryant 0.27
6 6 Bryant 0.27
7 7 Cherry 0.48
8 8 Crescent 0.19
9 9 Deer Point 0.83
10 9 Deer Point 0.83
# ℹ 43 more rows
Standard Deviation Mercury Level:
sd(BootstrapSample3$Mercury)
[1] 0.3294457
Now, we’ll take 10,000 bootstrap samples, and record the standard deviation in mercury concentration in each sample.
Sample_SD <-sd(FloridaLakes$Mercury) #calculate sample statisticBootstrap_SD <-rep(NA, 10000) # setup vector to hold bootstrap statisticsfor (i in1:10000){BootstrapSample <-sample_n(FloridaLakes, 53, replace=TRUE) # take bootstrap sampleBootstrap_SD[i] <-sd(BootstrapSample$Mercury) # calculate standard deviation in bootstrap sample}Lakes_Bootstrap_Results_SD <-data.frame(Bootstrap_SD) #store results in data frame
The bootstrap distribution for the mean mercury level is shown below, along with its standard error.
Lakes_Bootstrap_SD_Plot <-ggplot(data=Lakes_Bootstrap_Results_SD, aes(x=Bootstrap_SD)) +geom_histogram(color="white", fill="lightblue") +xlab("SD in Mercury in Bootstrap Sample ") +ylab("Frequency") +ggtitle("Bootstrap Distribution for Sample SD in Florida Lakes") +theme(legend.position ="none") Lakes_Bootstrap_SD_Plot
Bootstrap Standard Error:
We’ll calculate the bootstrap standard error of the standard deviation. This is a measure of how much the standard deviation varies between samples.
Since the bootstrap distribution is roughly symmetric and bell-shaped, we can use the bootstrap standard error method to calculate an approximate 95% confidence interval for the standard deviation in mercury levels among all Florida lakes.
We are 95% confident that the standard deviation in mercury levels among all Florida lakes is between 0.28 and 0.4 parts per million.
3.4.4 CI for Median
We already calculated a confidence interval for the mean mercury level among all Florida lakes. We could calculate a bootstrap confidence interval for the median mercury level as well, but since the distribution of mercury levels in the lakes is roughly symmetric, the mean is a reasonable measure of center, and there is not a clear reason for using the median instead.
When a distribution is skewed or contains large outliers, however, the median is a more robust measure of center than the mean. Recall the distribution of 100 Seattle house prices seen in Chapters 1 and 2.
ggplot(data=Houses, aes(x=price)) +geom_histogram(fill="lightblue", color="white") +ggtitle("Distribution of House Prices") +xlab("Price") +ylab("Frequency")
These 100 houses are a sample of all houses sold in Seattle in 2014 and 2015, so we can use statistics from our sample to draw conclusions about all houses sold in Seattle in this time period.
In this subsection, we’ll use bootstrapping to calculate a 95% confidence interval for the median price among all houses sold in Seattle in this time period.
We calculate the sample median price.
Sample_Median <-median(Houses$price)Sample_Median
[1] 507.5
Bootstrap Steps
Take a sample of 100 houses by randomly sampling from the original sample of 100 houses, with replacement.
Calculate the median price in the bootstrap sample.
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the median price in each bootstrap sample.
Look at the distribution of the median price across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the median price.
We’ll illustrate the procedure on 3 bootstrap samples.
Now, we’ll take 10,000 bootstrap samples, and record the median price in each sample.
Sample_Med <-median(Houses$price) #calculate sample medianBootstrap_Med <-rep(NA, 10000) # setup vector to hold bootstrap statisticsfor (i in1:10000){BootstrapSample <-sample_n(Houses, 100, replace=TRUE) # take bootstrap sampleBootstrap_Med[i] <-median(BootstrapSample$price) # calculate standard deviation in bootstrap sample}Houses_Bootstrap_Results_Med <-data.frame(Bootstrap_Med) #store results in data frame
The bootstrap distribution for the median price is shown below, along with its standard error.
Houses_Bootstrap_Med_Plot <-ggplot(data=Houses_Bootstrap_Results_Med, aes(x=Bootstrap_Med)) +geom_histogram(color="white", fill="lightblue") +xlab("Median Price in Bootstrap Sample ") +ylab("Frequency") +ggtitle("Bootstrap Distribution for Median Price in Seattle Houses") +theme(legend.position ="none") Houses_Bootstrap_Med_Plot
Bootstrap Standard Error:
We’ll calculate the bootstrap standard error of the median. This is a measure of how much the median varies between samples.
The standard error measures the amount of variability in median house price between different samples of size 100.
Note that this is different than the sample standard deviation, which represents the standard deviation in prices between the 100 different houses in the sample.
Notice that the bootstrap distribution for the median is not symmetric and bell-shaped. Thus, we cannot be assured that 95% of samples will produce a statistic within two standard errors of the mean, so the standard error confidence interval method is not appropriate here. Instead, we’ll calculate a confidence interval by taking the middle 95% of the values in the bootstrap distribution. A confidence interval calculated this way is called a percentile bootstrap interval.
We’ll calculate the 0.025 quantile and the 0.975 quantile of this bootstrap distribution. These are the points below which lie 2.5% and 97.5% of the medians in the bootstrap distribution. Thus, the middle 95% of the medians lie between these values.
# A tibble: 2 × 4
Location MeanHg StDevHg N
<fct> <dbl> <dbl> <int>
1 N 0.425 0.270 33
2 S 0.696 0.384 20
In our sample of 33 Northern Lakes and 20 Southern Lakes, we saw a difference of 0.27 ppm. We’ll calculate a confidence interval to estimate how big or small this difference could be among all Florida lakes.
We’ll use a statistical model to calculate the average mercury levels in Northern and Southern Florida.
Thus, we can obtain a confidence interval for the difference in average mercury levels by fitting a regression model to each of our bootstrap samples and recording the value of the sample statistic \(b_1\), which represents this difference. Alternatively, we could calculate the mean from each group separately and subtract.
When comparing groups, we make one modification in Step #1 of the bootstrap process. Rather than drawing a sample of size \(n\) at random, with replacement, we’ll draw the same number of observations from each group as were in the original sample. In this case, we had 33 northern lakes, and 20 southern lakes.
Bootstrap Steps
Take a sample of 33 northern lakes and 20 southern lakes by randomly sampling from the original sample, with replacement.
Fit a regression model with location as the explanatory variable and record the value of \(b_1\), representing the difference between the means for each group (South-North).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the difference in means in each bootstrap sample.
Look at the distribution of the differences in means across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the difference in means between mercury levels in Northern and Southern Florida.
We’ll illustrate the procedure on 3 bootstrap samples.
# A tibble: 53 × 4
ID Lake Location Mercury
<int> <chr> <fct> <dbl>
1 1 Alligator S 1.23
2 2 Annie S 1.33
3 2 Annie S 1.33
4 3 Apopka N 0.04
5 4 Blue Cypress S 0.44
6 4 Blue Cypress S 0.44
7 4 Blue Cypress S 0.44
8 7 Cherry N 0.48
9 8 Crescent N 0.19
10 10 Dias N 0.81
# ℹ 43 more rows
We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re interested in the second coefficient, \(b_1\), which represents the mean difference between lakes in Southern and Northern Florida
Mb1 <-lm(data=BootstrapSample1, Mercury ~ Location) ## fit linear modelMb1
# A tibble: 53 × 4
ID Lake Location Mercury
<int> <chr> <fct> <dbl>
1 1 Alligator S 1.23
2 1 Alligator S 1.23
3 1 Alligator S 1.23
4 1 Alligator S 1.23
5 3 Apopka N 0.04
6 3 Apopka N 0.04
7 3 Apopka N 0.04
8 4 Blue Cypress S 0.44
9 4 Blue Cypress S 0.44
10 4 Blue Cypress S 0.44
# ℹ 43 more rows
Bootstrap Sample 2
Mb2 <-lm(data=BootstrapSample2, Mercury ~ Location) ## fit linear modelMb2
# A tibble: 53 × 4
ID Lake Location Mercury
<int> <chr> <fct> <dbl>
1 1 Alligator S 1.23
2 1 Alligator S 1.23
3 2 Annie S 1.33
4 2 Annie S 1.33
5 2 Annie S 1.33
6 2 Annie S 1.33
7 3 Apopka N 0.04
8 3 Apopka N 0.04
9 6 Bryant N 0.27
10 6 Bryant N 0.27
# ℹ 43 more rows
Mb3 <-lm(data=BootstrapSample3, Mercury ~ Location) ## fit linear modelMb3
We’ll now take 10,000 different bootstrap samples and look at the bootstrap distribution for \(b_1\), the difference in mean mercury levels between lakes in Southern and Northern Florida
M <-lm(data=FloridaLakes, Mercury~Location) #fit model to original sampleSample_b1 <- M$coefficients[2] # record b1 value (second coefficient)Bootstrap_b1 <-rep(NA, 10000) #vector to store b1 valuesfor (i in1:10000){NLakes <-sample_n(FloridaLakes %>%filter(Location=="N"), 33, replace=TRUE) ## sample 33 northern lakesSLakes <-sample_n(FloridaLakes %>%filter(Location=="S"), 20, replace=TRUE) ## sample 20 southern lakesBootstrapSample <-rbind(NLakes, SLakes) ## combine Northern and Southern LakesM <-lm(data=BootstrapSample, Mercury ~ Location) ## fit linear modelBootstrap_b1[i] <- M$coefficients[2] ## record b1 }NS_Lakes_Bootstrap_Results <-data.frame(Bootstrap_b1) #save results as dataframe
The bootstrap distribution for the difference in means, \(b_1\), is shown below, along with the standard error for the difference.
NS_Lakes_Bootstrap_Plot_b1 <-ggplot(data=NS_Lakes_Bootstrap_Results, aes(x=Bootstrap_b1)) +geom_histogram(color="white", fill="lightblue") +xlab("Mean Difference (b1) in Bootstrap Sample") +ylab("Frequency") +ggtitle("Northern vs Southern Lakes: Bootstrap Distribution for b1") NS_Lakes_Bootstrap_Plot_b1
Bootstrap Standard Error:
We’ll calculate the bootstrap standard error of the difference in means \(b_1\). This is a measure of how much the difference in means varies between samples.
We are 95% confident that the mean mercury level among all lakes in Southern Florida is between 0.08 and 0.46 higher than the mean mercury level among all lakes in Northern Florida.
3.4.6 CI for Regression Slope
Now, we’ll examine the relationship between mercury concentration and pH in Florida lakes. The scatterplot displays these variables, along with the least squares regression line.
On average, lakes with pH level 0 are expected to have a mercury level of 1.53 ppm.
For each one-unit increase in pH, mercury level is expected to decrease by 0.15 ppm.
These estimates are sample statistics, calculated from our sample of 53 lakes. We can think of our regression equation estimates \(b_0\) and \(b_1\) as estimates of parameters \(\beta_0\) and \(\beta_1\), which pertain to the slope and intercept of the regression line pertaining to the entire population of all lakes in Florida. We’ll use \(b_0\) and \(b_1\) to estimate \(\beta_0\) and \(\beta_1\) in the same way that we used sample proportion \(\hat{p}\) to estimate population proportion \(p\) and sample mean \(\bar{x}\) to estimate population mean \(\mu\).
The intercept, \(\beta_0\) has little meaning here, but the slope \(\beta_1\) represents the average change in mercury level for each one-unit increase in pH, among all Florida lakes. We’ll use bootstrapping to find a confidence interval for this quantity.
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample, with replacement.
Fit a regression model with pH as the explanatory variable and record the value of slope \(b_1\).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of slope of the regression line for each bootstrap sample.
Look at the distribution of the slopes across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the slope relating mercury and pH levels.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
BootstrapSample1 <-sample_n(FloridaLakes , 53, replace=TRUE) %>%arrange(ID) %>%select(ID, Lake, pH, Mercury) # take bootstrap sampleBootstrapSample1
# A tibble: 53 × 4
ID Lake pH Mercury
<int> <chr> <dbl> <dbl>
1 1 Alligator 6.1 1.23
2 1 Alligator 6.1 1.23
3 2 Annie 5.1 1.33
4 2 Annie 5.1 1.33
5 3 Apopka 9.1 0.04
6 4 Blue Cypress 6.9 0.44
7 5 Brick 4.6 1.2
8 9 Deer Point 5.8 0.83
9 9 Deer Point 5.8 0.83
10 10 Dias 6.4 0.81
# ℹ 43 more rows
We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re again interested in the second coefficient, \(b_1\), which now represents the slope of the regression line.
Mb1 <-lm(data=BootstrapSample1, Mercury ~ pH) # fit linear modelMb1
We’ll now take 10,000 different bootstrap samples and look at the bootstrap distribution for \(b_1\), the slope of the regression line relating mercury level and pH.
M <-lm(data=FloridaLakes, Mercury~pH) #fit model to original sampleSample_b1 <- M$coefficients[2] # record b1 value (second coefficient)Bootstrap_b1 <-rep(NA, 10000) #vector to store b1 valuesfor (i in1:10000){BootstrapSample <-sample_n(FloridaLakes , 53, replace=TRUE) #take bootstrap sampleM <-lm(data=BootstrapSample, Mercury ~ pH) # fit linear modelBootstrap_b1[i] <- M$coefficients[2] # record b1 }Lakes_Bootstrap_Slope_Results <-data.frame(Bootstrap_b1) #save results as dataframe
The bootstrap distribution for the slopes, \(b_1\), is shown below, along with the standard error for the difference.
Lakes_Bootstrap_Plot_Slope <-ggplot(data=Lakes_Bootstrap_Slope_Results, aes(x=Bootstrap_b1)) +geom_histogram(color="white", fill="lightblue") +xlab("Slope in Bootstrap Sample") +ylab("Frequency") +ggtitle("Bootstrap Distribution for Slope") Lakes_Bootstrap_Plot_Slope
Bootstrap Standard Error:
We’ll calculate the bootstrap standard error of the slope \(b_1\). This is a measure of how much the slope varies between samples.
We are 95% confident that among all Florida lakes, for each 1 unit increase in pH, mercury level decreases between 0.2 and 0.1, on average.
3.4.7 CI for Regression Response
In addition to calculating a confidence interval for the slope of the regression line relating mercury and pH levels in a lake, we can also calculate a confidence interval for the average mercury level among all lakes with a given pH.
We’ll calculate a confidence interval for the average mercury level among all lakes with a neutral pH level of 7.
so the expected mercury level among all lakes with \(\text{pH} = 7\) is \(b_0+7b_1 = 1.5309-0.1523(7)=0.4648\) ppm.
This quantity is a statistic calculated from a sample of 53 lakes, so we would not expect the average mercury level among all lakes in the population to be exactly equal to 0.4648. Again, we’ll use this sample statistic as an estimate of the population parameter, and use bootstrapping to estimate the variability associated with this statistic, in order to make a confidence interval.
Bootstrap Steps
Take a sample of 53 lakes by randomly sampling from the original sample, with replacement.
Fit a regression model with location as the explanatory variable and record the values of \(b_0\) and \(b_1\). Use these to calculate \(b_0+7b_1\).
Repeat steps 1 and 2 many (say 10,000) times, keeping track of \(b_0\) and \(b_1\), and calculating \(b_0+7b_1\) in each bootstrap sample.
Look at the distribution of the expected response, \(b_0 + 7b_1\), across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for the expected mercury level among all lakes with pH level of 7.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
BootstrapSample1 <-sample_n(FloridaLakes , 53, replace=TRUE) %>%arrange(ID) %>%select(ID, Lake, pH, Mercury) # take bootstrap sampleBootstrapSample1
We fit a regression model to the bootstrap sample and calculate the regression coefficients. We’re interested in the second coefficient, \(b_1\), which represents the mean difference between lakes in Southern and Northern Florida
Mb1 <-lm(data=BootstrapSample1, Mercury ~ pH) ## fit linear modelb0 <- Mb1$coefficients[1] # record value of b0 (first coefficient)b1 <- Mb1$coefficients[2] # record value of b1 (second coefficient)b0+7*b1 #calculate b0+7*b1
(Intercept)
0.4353724
Bootstrap Sample 2
BootstrapSample2 <-sample_n(FloridaLakes , 53, replace=TRUE) %>%arrange(ID) %>%select(ID, Lake, pH, Mercury)BootstrapSample2
# A tibble: 53 × 4
ID Lake pH Mercury
<int> <chr> <dbl> <dbl>
1 2 Annie 5.1 1.33
2 2 Annie 5.1 1.33
3 3 Apopka 9.1 0.04
4 4 Blue Cypress 6.9 0.44
5 6 Bryant 7.3 0.27
6 6 Bryant 7.3 0.27
7 10 Dias 6.4 0.81
8 10 Dias 6.4 0.81
9 12 Down 7.2 0.5
10 12 Down 7.2 0.5
# ℹ 43 more rows
Mb2 <-lm(data=BootstrapSample2, Mercury ~ pH) # fit linear modelb0 <- Mb2$coefficients[1] # record value of b0 (first coefficient)b1 <- Mb2$coefficients[2] # record value of b1 (second coefficient)b0+7*b1 #calculate b0+7*b1
(Intercept)
0.5738289
Bootstrap Sample 3
BootstrapSample3 <-sample_n(FloridaLakes , 53, replace=TRUE) %>%arrange(ID) %>%select(ID, Lake, pH, Mercury)BootstrapSample3
# A tibble: 53 × 4
ID Lake pH Mercury
<int> <chr> <dbl> <dbl>
1 2 Annie 5.1 1.33
2 3 Apopka 9.1 0.04
3 6 Bryant 7.3 0.27
4 6 Bryant 7.3 0.27
5 7 Cherry 5.4 0.48
6 7 Cherry 5.4 0.48
7 8 Crescent 8.1 0.19
8 10 Dias 6.4 0.81
9 10 Dias 6.4 0.81
10 10 Dias 6.4 0.81
# ℹ 43 more rows
Mb3 <-lm(data=BootstrapSample3, Mercury ~ pH) # fit linear modelb0 <- Mb3$coefficients[1] # record value of b0 (first coefficient)b1 <- Mb3$coefficients[2] # record value of b1 (second coefficient)b0+7*b1 #calculate b0+7*b1
(Intercept)
0.4223812
We’ll now take 10,000 different bootstrap samples and record the values of \(b_0\), \(b_1\), which we’ll then use to calculate \(b_0+7b_1\).
M <-lm(data=FloridaLakes, Mercury~pH) #fit model to original sampleSample_b0 <- M$coefficients[1] # record b0 value (second coefficient)Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)Sample_Exp7 <- Sample_b0 +7*Sample_b1 # calculate sample expected mercury when pH=7Bootstrap_b0 <-rep(NA, 10000) #vector to store b1 valuesBootstrap_b1 <-rep(NA, 10000) #vector to store b1 valuesfor (i in1:10000){BootstrapSample <-sample_n(FloridaLakes , 53, replace=TRUE) #take bootstrap sampleM <-lm(data=BootstrapSample, Mercury ~ pH) # fit linear modelBootstrap_b0[i] <- M$coefficients[1] # record b0 Bootstrap_b1[i] <- M$coefficients[2] # record b1 }Bootstrap_Exp7 <- Bootstrap_b0 +7*Bootstrap_b1 # calcualte expected response for each bootstrap sampleLakes_Bootstrap_Exp7_Results <-data.frame(Bootstrap_b0, Bootstrap_b1, Bootstrap_Exp7) #save results as dataframe
The bootstrap distribution for the expected mercury level among all lakes with pH level 7, \(b_0+7b_1\), is shown below, along with the standard error for this quantity.
Lakes_Bootstrap_Plot_Exp7 <-ggplot(data=Lakes_Bootstrap_Exp7_Results, aes(x=Bootstrap_Exp7)) +geom_histogram(color="white", fill="lightblue") +xlab("Expected Mercury Level in Bootstrap Sample") +ylab("Frequency") +ggtitle( "Bootstrap Distribution for Exp. Mercury when pH=7") Lakes_Bootstrap_Plot_Exp7
Bootstrap Standard Error:
We’ll calculate the bootstrap standard error of expected mercury concentration \(b_0 + 7b_1\). This is a measure of how much the estimated expected concentration varies between samples.
We are 95% confident that average mercury level among all Florida lakes with pH level 7 is between 0.4 and 0.5 ppm.
Again, we are not saying that we think an individual lake with a pH level of 7 will lie in this range, only that the average mercury level among all such lakes lies in this range.
3.4.8 More CI’s in Regression
We saw in the previous two examples how to calculate a confidence interval for the slope of a regression line, and for an expected response in regression. In fact, we can calculate confidence intervals for any function involving regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\), in a similar manner.
For example, let’s consider the model for Seattle house prices that involved square feet, whether or not the house was on the waterfront, and an interaction term between these variables.
We fit the model and obtain the parameter estimates shown below.
M <-lm(data=Houses, price~sqft_living + waterfront + sqft_living:waterfront) #fit model to original sampleSample_b0 <- M$coefficients[1] # record b0 value (second coefficient)Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)Sample_b2 <- M$coefficients[3] # record b1 value (second coefficient)Sample_b3 <- M$coefficients[4] # record b1 value (second coefficient)M
Consider the following quantities that we might be interested in estimating:
The expected price of a 2,000 square foot waterfront house.
The expected price of a 1,500 square foot non-waterfront house.
The difference between the expected price of a house 1,800 square foot house on the waterfront, compared to a house the same size that is not on the waterfront.
The difference in the rate of change in house prices for each additional 100 square feet for houses on the waterfront, compared to houses not on the waterfront.
Each of these quantities can be expressed as a linear function of our regression coefficients \(b_0, b_1, b_2, b_3\). We just need to find the appropriate function of the \(b_j\)’s, and then calculate a bootstrap confidence interval for that quantity, using the same steps we’ve seen in the previous examples.
Substituting into the regression equation, we see that:
The expected price of a 2,000 square foot waterfront house is given by \[b_0 + 2000b_1 + b_2 + 2000b_3\]
We calculate this estimate from the model, based on our sample of 100 houses:
We estimate that the average price of all 2,000 square foot waterfront houses in Seattle is 937 thousand dollars.
The expected price of a 1,500 square foot non-waterfront house is given by \[b_0 + 1500b_1\]
Sample_b0 +1500*Sample_b1 # calculate b0+1500b1+
(Intercept)
394.9499
We estimate that the average price of all 1,500 square foot non-waterfront houses in Seattle is 395 thousand dollars.
The difference between the expected price of a house 1,800 square foot house on the waterfront, compared to a house the same size that is not on the waterfront is given by:
We estimate that on average a 1,800 square foot house on the waterfront will cost 414 thousand dollars more than a 1,800 square foot house not on the waterfront.
The difference in the rate of change in house prices for each additional 100 square feet for houses on the waterfront, compared to houses not on the waterfront.
This question is asking about the difference in slopes of the regression lines relating price and square feet for houses on the waterfront, compared to those not on the waterfront.
For houses on the waterfront, the regression equation is
These slope pertain to the expected change in price for each additional 1 square foot. So, for a 100-square foot increase, the price of a waterfront house is expected to increase by \(100(b_1+b_3)\), compared to an increase of \(100b_1\) for a non-waterfront house. Thus, the difference in the rates of change is \(100b_3\).
100*Sample_b3 # calculate 100b3
sqft_living:waterfrontYes
43.26671
We estimate that the price of waterfront houses increases by 43 thousand dollars more for each additional 100 square feet than the price of non-waterfront houses.
These estimates calculated from the sample are statistics, which, like all the other statistics we’ve seen are likely to vary from the true values of the corresponding population parameters, due to variability between samples. We can use bootstrapping to calculate confidence intervals for the relevant population parameters, using these sample statistics (the functions of \(b_j\)’s), just as we’ve done for the other statistics we’ve seen.
Bootstrap Steps
Take a sample of 100 houses by randomly sampling from the original sample, with replacement.
Fit a regression model with location as the explanatory variable and record the values of regression coefficients \(b_0, b_1, b_2, b_3\). Use these to calculate each of the four desired quantities (i.e. \(b_0 + 2000b_1 + b_2 +2000b_3\))
Repeat steps 1 and 2 many (say 10,000) times, keeping track of the regression coefficients and calculating the desired quantities in each bootstrap sample.
Look at the distribution of the quantities of interest, across bootstrap samples. The variability in this bootstrap distribution can be used to approximate the variability in the sampling distribution for each of these quantities.
We’ll illustrate the procedure on 3 bootstrap samples.
Bootstrap Sample 1
We take the first bootstrap sample and fit a model with interaction. For brevity, we won’t list out the houses in each of the bootstrap samples, as the idea should be clear by now. Model coefficients are shown below.
BootstrapSample1 <-sample_n(Houses , 100, replace=TRUE) %>%arrange(Id) %>%select(Id, price, sqft_living, waterfront) Mb1 <-lm(data=BootstrapSample1, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interactionb0 <- Mb1$coefficients[1] # record value of b0 (first coefficient)b1 <- Mb1$coefficients[2] # record value of b1 (second coefficient)b2 <- Mb1$coefficients[3] # record value of b2 (third coefficient)b3 <- Mb1$coefficients[4] # record value of b3 (fourth coefficient)Mb1
BootstrapSample2 <-sample_n(Houses , 100, replace=TRUE) %>%arrange(Id) %>%select(Id, price, sqft_living, waterfront) Mb2 <-lm(data=BootstrapSample2, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interactionb0 <- Mb2$coefficients[1] # record value of b0 (first coefficient)b1 <- Mb2$coefficients[2] # record value of b1 (second coefficient)b2 <- Mb2$coefficients[3] # record value of b2 (third coefficient)b3 <- Mb2$coefficients[4] # record value of b3 (fourth coefficient)Mb2
BootstrapSample3 <-sample_n(Houses , 100, replace=TRUE) %>%arrange(Id) %>%select(Id, price, sqft_living, waterfront) Mb3 <-lm(data=BootstrapSample3, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interactionb0 <- Mb3$coefficients[1] # record value of b0 (first coefficient)b1 <- Mb3$coefficients[2] # record value of b1 (second coefficient)b2 <- Mb3$coefficients[3] # record value of b2 (third coefficient)b3 <- Mb3$coefficients[4] # record value of b3 (fourth coefficient)Mb3
We’ll now take 10,000 different bootstrap samples and record the values of \(b_0\), \(b_1\), \(b_3\), and \(b_4\), which we’ll then use to calculate each of our four desired quantities.
M <-lm(data=Houses, price~sqft_living + waterfront + sqft_living:waterfront) #fit model to original sampleSample_b0 <- M$coefficients[1] # record b0 value (second coefficient)Sample_b1 <- M$coefficients[2] # record b1 value (second coefficient)Sample_b2 <- M$coefficients[3] # record b1 value (second coefficient)Sample_b3 <- M$coefficients[4] # record b1 value (second coefficient)Sample_Q1 <- Sample_b0 +2000*Sample_b1 +Sample_b2+2000*Sample_b3 # calculate b0+2000b1+b2+2000b3Sample_Q2 <- Sample_b0 +1500*Sample_b1 # calculate b0+1500b1+Sample_Q3 <- Sample_b2+1800*Sample_b3 # calculate b2+1800b3Sample_Q4 <-100*Sample_b3 # calculate 100b3Bootstrap_b0 <-rep(NA, 10000) #vector to store b0 valuesBootstrap_b1 <-rep(NA, 10000) #vector to store b1 valuesBootstrap_b2 <-rep(NA, 10000) #vector to store b2 valuesBootstrap_b3 <-rep(NA, 10000) #vector to store b3 valuesfor (i in1:10000){BootstrapSample <-sample_n(Houses, 1000, replace=TRUE) #take bootstrap sampleMb <-lm(data=BootstrapSample, price ~ sqft_living + waterfront + sqft_living:waterfront) # fit linear model with interactionBootstrap_b0[i] <- Mb$coefficients[1] # record value of b0 (first coefficient)Bootstrap_b1[i] <- Mb$coefficients[2] # record value of b1 (second coefficient)Bootstrap_b2[i] <- Mb$coefficients[3] # record value of b2 (third coefficient)Bootstrap_b3[i] <- Mb$coefficients[4] # record value of b3 (fourth coefficient)}Bootstrap_Q1 <- Bootstrap_b0 +2000*Bootstrap_b1 + Bootstrap_b2 +2000*Bootstrap_b3Bootstrap_Q2 <- Bootstrap_b0 +1500*Bootstrap_b1 Bootstrap_Q3 <- Bootstrap_b2 +1800*Bootstrap_b3Bootstrap_Q4 <-100*Bootstrap_b3Houses_Bootstrap_Results <-data.frame(Bootstrap_b0, Bootstrap_b1, Bootstrap_b2, Bootstrap_b3, Bootstrap_Q1, Bootstrap_Q2 , Bootstrap_Q3 , Bootstrap_Q4) #save results as dataframe
Bootstrap Distribution for\(b_0 + 2000b_1 + b_2 + 2000b_3\)
Houses_Bootstrap_Plot_Q1 <-ggplot(data=Houses_Bootstrap_Results, aes(x=Bootstrap_Q1)) +geom_histogram(color="white", fill="lightblue") +xlab("Expected Price of 2000 Sq. Ft. Waterfront House") +ylab("Frequency") +ggtitle( "Bootstrap Distribution b0+2000b1+b2+2000b3") Houses_Bootstrap_Plot_Q1
We are 95% confident that average price among all 2,000 square foot Seattle waterfront houses is between 924.8114379 and 1084.9359281 thousand dollars.
Bootstrap Distribution for\(b_0 + 1500b_1\)
Houses_Bootstrap_Plot_Q2 <-ggplot(data=Houses_Bootstrap_Results, aes(x=Bootstrap_Q2)) +geom_histogram(color="white", fill="lightblue") +xlab("Expected Price of 1500 Sq. Ft. Non-Waterfront House") +ylab("Frequency") +ggtitle( "Bootstrap Distribution b0+1500b1") Houses_Bootstrap_Plot_Q2
We are 95% confident that average price among all 1,500 square foot Seattle non-waterfront houses is between 383.3247413 and 406.5749705 thousand dollars.
Bootstrap Distribution for\(b_2 + 1800b_3\)
Houses_Bootstrap_Plot_Q3 <-ggplot(data=Houses_Bootstrap_Results, aes(x=Bootstrap_Q3)) +geom_histogram(color="white", fill="lightblue") +xlab("Expected Price Difference WF vs NWF for 1800 sq. Ft. House") +ylab("Frequency") +ggtitle( "Bootstrap Distribution b2+1800b3") Houses_Bootstrap_Plot_Q3
We are 95% confident that the average price among all 1800 square feet waterfront houses in Seattle is between 333.441699 and 494.9698429 thousand dollars more than the average price among all non-waterfront houses of the same size.
Bootstrap Distribution for\(100b_3\)
Houses_Bootstrap_Plot_Q4 <-ggplot(data=Houses_Bootstrap_Results, aes(x=Bootstrap_Q4)) +geom_histogram(color="white", fill="lightblue") +xlab("Expected Difference per 100 square feet") +ylab("Frequency") +ggtitle( "Bootstrap Distribution 100b3") Houses_Bootstrap_Plot_Q4
Bootstrap Standard Error: We’ll calculate the bootstrap standard error of the slope \(100b_3\). This is a measure of how much the slope varies between samples.
We are 95% confident that for each 100 square foot increase, the average price among all waterfront houses increases by between 38.9150334 and 47.6183836 thousand dollars more than the increase in average price among all non-waterfront.
3.4.9 Bootstrapping Cautions
While bootstrapping is a popular and robust procedure for calculating confidence intervals, it does have cautions and limitations. We should be sure to use the bootstrap procedure appropriate for our context. A standard-error bootstrap interval is appropriate when the sampling distribution for our statistic is roughly symmetric and bell-shaped. When this is not true, a percentile bootstrap interval can be used as long as there are no gaps or breaks in the bootstrap distribution. In situations where there are gaps and breaks in the bootstrap distribution, then the bootstrap distribution may not be a reasonable approximation of the sampling distribution we are interested in.
3.5 Hypothesis Testing
3.5.1 Mercury Levels in Florida Lakes
Recall the 2004 study by Lange, T., Royals, H. and Connor, L., which examined Mercury accumulation in large-mouth bass, taken from a sample of 53 Florida Lakes. If Mercury accumulation exceeds 0.5 ppm, then there are environmental concerns. In fact, the legal safety limit in Canada is 0.5 ppm, although it is 1 ppm in the United States.
In our sample, we have data on 53 lakes, out of more than 30,000 lakes in the the state of Florida.
We are interested in whether mercury levels are higher or lower, on average, in Northern Florida compared to Southern Florida.
We’ll divide the state along route 50, which runs East-West, passing through Northern Orlando.
We add a variable indicating whether each lake lies in the northern or southern part of the state.
Lake Location Mercury
Alligator S 1.23
Annie S 1.33
Apopka N 0.04
Blue Cypress S 0.44
Brick S 1.20
Bryant N 0.27
Cherry N 0.48
Crescent N 0.19
Deer Point N 0.83
Dias N 0.81
Dorr N 0.71
Down S 0.50
Eaton N 0.49
East Tohopekaliga S 1.16
Farm-13 N 0.05
George N 0.15
Griffin N 0.19
Harney N 0.77
Hart S 1.08
Hatchineha S 0.98
Iamonia N 0.63
Istokpoga S 0.56
Jackson N 0.41
Josephine S 0.73
Kingsley N 0.34
Kissimmee S 0.59
Lochloosa N 0.34
Louisa S 0.84
Miccasukee N 0.50
Minneola N 0.34
Monroe N 0.28
Newmans N 0.34
Ocean Pond N 0.87
Ocheese Pond N 0.56
Okeechobee S 0.17
Orange N 0.18
Panasoffkee N 0.19
Parker S 0.04
Placid S 0.49
Puzzle N 1.10
Rodman N 0.16
Rousseau N 0.10
Sampson N 0.48
Shipp S 0.21
Talquin N 0.86
Tarpon S 0.52
Tohopekaliga S 0.65
Trafford S 0.27
Trout S 0.94
Tsala Apopka N 0.40
Weir N 0.43
Wildcat N 0.25
Yale N 0.27
We are interested in investigating whether average mercury levels are higher in either Northern Florida or Southern Florida than the other.
The boxplot and table below show the distribution of mercury levels among the 33 northern and 20 southern lakes in the sample.
\(b_1 = 0.272= 0.6965 - 0.4245\) is equal to the difference in mean mercury levels between Northern and Southern lakes. (We’ve already seen that for categorical variables, the least-squares estimate is the mean, so this makes sense.)
We can use \(b_1\) to assess the size of the difference in mean mercury concentration levels.
3.5.3 Hypotheses and Key Question
Since the lakes we observed are only a sample of 53 lakes out of more than 30,000, we cannot assume the difference in mercury concentration for all Northern vs Southern Florida lakes is exactly 0.272. Instead, we need to determine whether a difference of this size in our sample is large enough to provide evidence of a difference in average mercury level between all Northern and Southern lakes in Florida.
One possible explanation for us getting the results we did in our sample is that there really is no difference in average mercury levels between all lakes in Northern and Southern Florida, and we just happened, by chance, to select more lakes with higher mercury concentrations in Southern Florida than in Northern Florida. A different possible explanation is that there really is a difference in average mercury level between lakes in Northern and Southern Florida.
In a statistical investigation, the null hypothesis is the one that says there is no difference between groups , or no relationship between variables in the larger population, and that any difference/relationship observed in our sample occurred merely by chance. The alternative hypothesis contradicts the null hypothesis, stating that there is a difference/relationship.
Stated formally, the hypotheses are:
Null Hypothesis: There is no difference in average mercury level between all lakes in Northern Florida and all lakes in Southern Florida.
Alternative Hypothesis: There is a difference in average mercury level between all lakes in Northern Florida and all lakes in Southern Florida.
A statistician’s job is to determine whether the data provide strong enough evidence to rule out the null hypothesis.
The question we need to investigate is:
*“How likely is it that we would have observed a difference in means (i.e. a value of* \(b_1\)) as extreme as 0.6965-0.4245 = 0.272 ppm, merely by chance, if there is really no relationship between location and mercury level?”
3.5.4 Permutation Test for Difference in Means
We can answer the key question using a procedure known as a permutation test. In a permutation test, we randomly permute (or shuffle) the values of our explanatory variable to simulate a situation where there is no relationship between our explanatory and response variable. We observe whether it is plausible to observe values of a statistic (in this case the difference in means) as extreme or more extreme than what we saw in the actual data.
We’ll simulate situations where there is no relationship between location and mercury level, and see how often we observe a difference in means (\(b_1\)) as extreme as 0.272.
Procedure:
Randomly shuffle the locations of the lakes, so that any relationship between location and mercury level is due only to chance.
Calculate the difference in mean mercury levels (i.e. value of \(b_1\)) in “Northern” and “Southern” lakes, using the shuffled data. The statistic used to measure the size of the difference or relationship in the sample is called the test statistic.
Repeat steps 1 and 2 many (say 10,000) times, recording the test statistic (difference in means, \(b_1\)) each time.
Analyze the distribution of the test statistic (mean difference), simulated under the assumption that there is no relationship between location and mercury level. Look whether the value of the test statistic we observed in the sample (0.272) is consistent with values simulated under the assumption that the null hypothesis is true.
Notice that the locations of the lakes have now been mixed up and assigned randomly. So, any relationship between location and mercury level will have occurred merely by chance.
We create a boxplot and calculate the difference in mean mercury levels for the shuffled data.
Notice that the sample means are not identical. We observe a difference of -0.0776212 just by chance associated with the assignment of the lakes to their random location groups.
This difference is considerably smaller than the difference of 0.272 that we saw in the actual data, suggesting that perhaps a difference as big as 0.272 would not be likely to occur by chance. Before we can be sure of this, however, we should repeat our simulation many times to get a better sense for how big of a difference we might reasonable expect to occur just by chance.
We’ll write a for loop to perform 10,000 permutations and record the value of \(b_1\) (the difference in sample means) for each simulation.
b1 <- Lakes_M$coef[2] ## record value of b1 from actual data## perform simulationb1Sim <-rep(NA, 10000) ## vector to hold resultsShuffledLakes <- FloridaLakes ## create copy of datasetfor (i in1:10000){#randomly shuffle locationsShuffledLakes$Location <- ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] ShuffledLakes_M<-lm(data=ShuffledLakes, Mercury ~ Location) #fit model to shuffled datab1Sim[i] <- ShuffledLakes_M$coef[2] ## record b1 from shuffled model}NSLakes_SimulationResults <-data.frame(b1Sim) #save results in dataframe
The histogram shows the distribution of differences in the group means observed in our simulation. The red lines indicate the difference we actually observed in the data (0.272), as well as an equally large difference in the opposite direction (-0.272).
NSLakes_SimulationResultsPlot <-ggplot(data=NSLakes_SimulationResults, aes(x=b1Sim)) +geom_histogram(fill="lightblue", color="white") +geom_vline(xintercept=c(b1, -1*b1), color="red") +xlab("Lakes: Simulated Value of b1") +ylab("Frequency") +ggtitle("Distribution of b1 under assumption of no relationship")NSLakes_SimulationResultsPlot
The red lines are quite extreme, relative to the simulated values shown in the histogram. Based on the simulation, it is rare to obtain a difference as extreme as the 0.272 value we saw in the actual data, by chance when there is actually no difference in average mercury levels between Northern and Southern Florida lakes.
We calculate the precise number of simulations (out of 10,000) resulting in difference in means more extreme than 0.27195.
sum(abs(b1Sim) >abs(b1))
[1] 39
The proportion of simulations resulting in difference in means more extreme than 0.272 is:
sum(abs(b1Sim) >abs(b1))/10000
[1] 0.0039
We only observed a difference between the groups as extreme or more extreme than the 0.272 difference we saw in the sample in a proportion of 0.0039 of our simulations (less than 1%).
The probability of getting a difference in means as extreme or more extreme than 0.272 ppm by chance, when there is no relationship between location and mercury level is about 0.0039. In other words, it is very unlikely that we would have observed a result like we did by chance alone. Thus, we have strong evidence that there is a difference in average mercury level between lakes in Northern and Southern Florida. In this case, there is strong evidence that mercury level is higher in Southern Florida lakes than Northern Florida lakes.
Recall that in the previous chapter, we found that we could be 95% confident that the mean mercury level among all lakes in Southern Florida is between 0.08 and 0.46 higher than the mean mercury level among all lakes in Northern Florida.
3.5.7 p-values
The p-value represents the probability of getting a test statistic as extreme or more extreme than we did in our sample when the null hypothesis is true.
In this situation, the p-value represents the probability of observing a difference in sample means as extreme or more extreme than 0.272 if there is actually no difference in average mercury level among all lakes in Northern Florida, compared to Southern Florida.
In our study, the p-value was 0.0039, which is very low. This provides strong evidence against the null hypothesis that there is no difference in average mercury levels between all Northern and Southern Florida lakes.
A low p-value tells us that the difference in average Mercury levels that we saw in our sample is unlikely to have occurred by chance, providing evidence that there is indeed a difference in average Mercury levels between Northern and Southern lakes.
The p-value does not tell us anything about the size of the difference! If the difference is really small (say 0.001 ppm), perhaps there is no need to worry about it. It is possible to get a small p-value even when the true difference is very small (especially when our sample size is large). In addition to a p-value, we should consider whether a difference is big enough to be meaningful in a practical way, before making any policy decisions.
For now, we can use the difference in sample means of 0.272 ppm as an estimate of the size of the difference. Based on our limited knowledge of mercury levels, this does seem big enough to merit further investigation, and possible action.
At this point, a reasonable question is “how small must a p-value be in order to provide evidence against the null hypothesis?” While it is sometimes common to establish strict cutoffs for what counts as a small p-value (such as \(<0.05\)), the American Statistical Association does not recommend this. In reality, a p-value of 0.04 is practically no different than a p-value of 0.06. Rather than using strict cutoffs for what counts as small, it is better to interpreting p-values on a sliding scale, as illustrated in the diagram below. A p-value of 0.10 or less provides at least some evidence against a null hypothesis, and the smaller the p-value is, the stronger the evidence gets.
knitr::include_graphics("pvals.png")
3.6 More Hypothesis Test Examples
3.6.1 Other Test Statistics
The permutation test procedure can be used to test hypotheses involving lots of different test statistics, in addition to testing for a difference in means, as we saw seen in the previous section. For example we could test whether there is evidence of:
a difference in the amount of variability in mercury levels between lakes in Northern Florida, compared to southern Florida
a difference in mean price between King County houses in very good, good, and average or below conditions
a relationship between mercury level and pH level among all Florida lakes
For each of these investigations, the null hypothesis will be that there is no difference or relationship among all lakes (that is, whatever difference or relationship occurred in the sample occurred just by random chance). We’ll need to find a test statistic that measures the quantity we’re interested in (for example, difference in means). Then, we use the permutation procedure to simulate a scenario where our null hypothesis is true, and see if test statistic we saw in our data is consistent with the ones we simulate under the null hypothesis.
3.6.2 General Permutation Test Procedure
Procedure:
Randomly shuffle the values or categories of the explanatory variable, so that any relationship between the explanatory and response variable occurs just by chance.
Calculate the test statistic on the shuffled data.
Repeat steps 1 and 2 many (say 10,000) times, recording the test statistic each time.
Analyze the distribution of the test statistic, simulated under the assumption that the null hypothesis is true. Look whether the value of the test statistic we observed in the sample is consistent with values simulated under the assumption that the null hypothesis is true. (We might calculate a p-value, which represents the proportion of simulations in which we observed a test statistic as extreme or more extreme than the one we saw in our actual sample.)
Next, we’ll apply these steps to questions 2, 4, and 5 from the previous subsection.
3.6.3 Difference in Standard Deviation
We’ll test whether there is evidence of a difference in variability between lakes in Northern Florida, compared to Southern Florida. Since standard deviation is a measure of variability, we’ll use the difference in standard deviation in Northern vs Southern lakes as our test statistic.
Recall that the standard deviation among the 53 Northern Florida Lakes in our sample was 0.270 ppm, which is lower than the 0.384 ppm in Southern Florida.
kable(LakesTable)
Location
MeanHg
StDevHg
N
N
0.4245455
0.2696652
33
S
0.6965000
0.3838760
20
The test statistic we observe in our sample is \(0.2696-0.3839 = -0.1142\) ppm.
We need to determine whether a difference this large could have plausibly occurred in our sample, just by chance, if there is really no difference in standard deviation among all lakes in Northern Florida, compared to Southern Florida.
Null Hypothesis: There is no difference in standard deviation of mercury levels between all lakes in Northern Florida and all lakes in Southern Florida.
Alternative Hypothesis: There is a difference in standard deviation of mercury levels between all lakes in Northern Florida and all lakes in Southern Florida.
We’ll apply the general hypothesis testing procedure, using standard deviation as our test statistic.
Procedure:
Randomly shuffle the locations of the lakes, so that any relationship between the location and mercury level occurs just by chance.
Calculate the difference in standard deviation between lakes in the two samples of the shuffled data.
Repeat steps 1 and 2 many (say 10,000) times, recording the difference in standard deviations each time.
Analyze the distribution of difference in standard deviations, simulated under the assumption that there is no difference in standard deviations between North and South. Look whether the value of the test statistic we observed in the sample is consistent with values simulated under the assumption that there is no difference in standard deviations.
R Code for Permutation Test
We’ll write a for loop to perform 10,000 permutations and record the value of \(b_1\) (the difference in sample means) for each simulation.
SDTab <- FloridaLakes %>%group_by(Location) %>%summarize(SD=sd(Mercury))DiffSD <- SDTab$SD[2] - SDTab$SD[1] ## perform simulationDiffSim <-rep(NA, 10000) ## vector to hold resultsShuffledLakes <- FloridaLakes ## create copy of datasetfor (i in1:10000){#randomly shuffle locationsShuffledLakes$Location <- ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] SDTabSim <- ShuffledLakes %>%group_by(Location) %>%summarize(SD=sd(Mercury))DiffSim[i] <- SDTabSim$SD[2] - SDTabSim$SD[1] #record difference in SD for simulated data}NSLakes_SDSimResults <-data.frame(DiffSim) #save results in dataframe
The distribution of the simulated differences in standard deviation is shown below. Recall that these were simulated assuming that the null hypothesis, that there is no difference in standard deviation of mercury levels among all lakes in Northern Florida, compared to Southern Florida is true.
The red lines represent differences as extreme as -0.1142 that we saw in our sample.
NSLakes_SDSimResultsPlot <-ggplot(data=NSLakes_SDSimResults, aes(x=DiffSim)) +geom_histogram(fill="lightblue", color="white") +geom_vline(xintercept=c(DiffSD, -1*DiffSD), color="red") +xlab("Simulated Difference in SD's") +ylab("Frequency") +ggtitle("Distribution of Difference in SD under assumption of no relationship")NSLakes_SDSimResultsPlot
We calculate the number of simulations (out of 10,000) resulting in standard deviations greater the 0.1142.
sum(abs(DiffSim) >abs(DiffSD))
[1] 614
p-value: Proportion of simulations (out of 10,000) resulting in difference in standard deviations greater the 0.1142.
mean(abs(DiffSim) >abs(DiffSD))
[1] 0.0614
This p-value represents the probability of observing a difference in sample standard deviations as extreme as 0.1142 in a samples of size 33 and 20 by chance, if in fact, the standard deviation in mercury concentration levels is the same for lakes in Northern Florida as in Southern Florida.
Since the p-value is small, it is unlikely that we would observe a difference in standard deviations as extreme as 0.1142 by chance. There is evidence that lakes in Southern Florida exhibit more variability in mercury levels than lakes in Northern Florida (though the evidence is not as strong as it was when we were testing for a difference in means).
Note that we have avoided the fallacy of using 0.05 as a strict cutoff for rejecting the null hypothesis.
Although the difference in standard deviations is statistically discernible, it is hard to say whether it is practically meaningful. Without knowing a lot about mercury levels, and their impact on the ecosystem, it’s harder to tell whether an estimated difference in standard deviations of 0.11 ppm is meaningful or not. It would be good to consult a biologist before making any decisions based on these results.
3.6.4 Slope of Regression Line
In addition to the mercury levels of the Florida lakes, we have data on the pH level of each lake. pH level measures the acidity of a lake, ranging from 0 to 14, with 7 being neutral, and lower levels indicating more acidity. We plot the pH level against the mercury level in our sample of 53 lakes.
We can use the slope of the regression line \(b_1\) to measure the strength relationship between Mercury and pH. Based on our sample, each one-unit increase in pH, mercury level is expected to decrease by 0.15 ppm.
If there was really no relationship, then the slope among all lakes would be 0. But, of course, we would not expect the slope in our sample to exactly match the slope for all lakes. Our question of interest is whether it is plausible that we could have randomly selected a sample resulting in a slope as extreme as 0.15 by chance, when there is actually no relationship between mercury and pH levels, among all lakes. In other words, could we plausible have drawn the sample of 53 lakes shown in blue from a population like the one in red, shown below?
Key Question:
How likely is it that we would have observed a slope (i.e. a value of \(b_1\)) as extreme as 0.15 by chance, if there is really no relationship between mercury level and pH?
Null Hypothesis: Among all Florida lakes, there is no relationship between mercury level and pH.
Alternative Hypothesis: Among all Florida lakes, there is a relationship between mercury level and pH.
Procedure:
Randomly shuffle the pH values, so that any relationship between acceleration mercury and pH is due only to chance.
Fit a regression line to the shuffled data and record the slope of the regression line.
Repeat steps 1 and 2 many (say 10,000) times, recording the slope (i.e. value of \(b_1\)) each time.
Analyze the distribution of slopes, simulated under the assumption that there is no relationship between mercury and pH. Look whether the actual slope we observed is consistent with the simulation results.
We’ll illustrate the first three permutations.
First Permutation
ShuffledLakes <- FloridaLakes ## create copy of datasetShuffledLakes$pH <- ShuffledLakes$pH[sample(1:nrow(ShuffledLakes))]
The red line indicates the slope of the regression line fit to the shuffled data. The blue line indicates the regression line for the actual lakes in the sampe, which has a slope of -0.15.
None of our three simulations resulted in a slope near as extreme as the -0.15 that we saw in the actual data. This seems to suggest that it is unlikely that we would have observed a slope as extreme as -0.15 if there is actually no relationship between mercury and pH among all lakes.
That said, we should repeat the simulation many more times to see whether getting a slope as extreme as -0.15 is plausible.
b1 <- Lakes_M_pH$coef[2] ## record value of b1 from actual data## perform simulationb1Sim <-rep(NA, 10000) ## vector to hold resultsShuffledLakes <- FloridaLakes ## create copy of datasetfor (i in1:10000){#randomly shuffle acceleration timesShuffledLakes$pH <- ShuffledLakes$pH[sample(1:nrow(ShuffledLakes))] ShuffledLakes_M<-lm(data=ShuffledLakes, Mercury ~ pH) #fit model to shuffled datab1Sim[i] <- ShuffledLakes_M$coef[2] ## record b1 from shuffled model}Lakes_pHSimulationResults <-data.frame(b1Sim) #save results in dataframe
b1 <- Lakes_M_pH$coef[2] ## record value of b1 from actual dataLakes_pHSimulationResultsPlot <-ggplot(data=Lakes_pHSimulationResults, aes(x=b1Sim)) +geom_histogram(fill="lightblue", color="white") +geom_vline(xintercept=c(b1, -1*b1), color="red") +xlab("Simulated Value of b1") +ylab("Frequency") +ggtitle("Distribution of b1 under assumption of no relationship")Lakes_pHSimulationResultsPlot
p-value: Proportion of simulations resulting in value of \(b_1\) more extreme than -0.15
mean(abs(b1Sim) >abs(b1))
[1] 0
The p-value represents the probability of observing a slope as extreme or more extreme than -0.15 by chance when there is actually no relationship between mercury level and pH.
It is extremely unlikely that we would observe a value of \(b_1\) as extreme as -0.15 by chance, if there is really no relationship between mercury level and pH. In fact, this never happened in any of our 10,000 simulations!
There is very strong evidence of a relationship mercury level and pH.
A low p-value tells us only that there is evidence of a relationship, not that it is practically meaningful. We have seen that for each one-unit increase in pH, mercury level is expected to decrease by 0.15 ppm on average, which seems like a pretty meaningful decrease, especially considering that mercury levels typically stay between 0 and 1.
We used the slope as our test statistic to measure the evidence of the relationship between the explanatory and response variables. In fact, we could have also used the correlation coefficient \(r\) as our test statistic, and we would have gotten the same p-value. Either slope or correlation may be used for a hypothesis test involving two quantitative variables, but we will use slope in this class.
3.6.5 F-Statistic
Recall when we examined the prices of houses in King County, WA, whose conditions were rated as either very good, good, or average or below. Suppose we want to test the hypotheses:
Null Hypothesis: There is no difference in average prices between houses of the three different conditions, among all houses in King County, WA.
Alternative Hypothesis: There is a difference in average prices between houses of the three different conditions, among all houses in King County, WA.
We notice differences in price. Surprisingly, houses in good condition cost more than 300 thousand dollars more than those in very good condition, on average.
If we were only comparing two groups, we could use the difference in average price between them as a test statistic. But since we’re comparing three, we need a statistic that can measure the size of differences between all three groups. An F-statistic can do this, so we’ll use the F-statistic as our test statistic here.
Our question of interest is “How likely is it to observe an F-statistic as extreme or more extreme than 0.605 if there is actually no difference in average price between houses of the three conditions?”
We’ll use a permutation-based hypothesis test to investigate this question.
Procedure:
Randomly shuffle the conditions of the houses, so that any relationship between condition and price is due only to chance.
Using the shuffled data, calculate an F-statistic for a predicting price, comparing a full model that uses condition as an explanatory variable, to a reduced model with no explanatory variables.
Repeat steps 1 and 2 many (say 10,000) times, recording the F-statistic each time.
Analyze the distribution of F-statistics, simulated under the assumption that there is no relationship between condition and price. Look whether the actual F-statistic we observed is consistent with the simulation results.
We’ll illustrate the first three permutations.
First Permutation
ShuffledHouses <- Houses ## create copy of datasetShuffledHouses$condition <- ShuffledHouses$condition[sample(1:nrow(ShuffledHouses))]
We’ll simulate 10,000 permutations and record the F-statistic for each set of permuted data.
Fstat <-anova(M_House_Cond, M0_House)$F[2] ## record value of F-statistic from actual data## perform simulationFSim <-rep(NA, 10000) ## vector to hold resultsShuffledHouses <- Houses ## create copy of datasetfor (i in1:10000){#randomly shuffle acceleration timesShuffledHouses$condition <- ShuffledHouses$condition[sample(1:nrow(ShuffledHouses))] ShuffledHouses_M1<-lm(data=ShuffledHouses, price ~ condition) #fit full model to shuffled dataShuffledHouses_M0<-lm(data=ShuffledHouses, price ~1) #fit reduced model to shuffled dataFSim[i] <-anova(ShuffledHouses_M1, ShuffledHouses_M0)$F[2] ## record F from shuffled model}House_Cond_SimulationResults <-data.frame(FSim) #save results in dataframe
The distribution of the F-statistics is shown below. Recall that these are simulated under the assumption that there is no difference in average price between houses of the three different conditions, i.e. no relationship between price and condition.
The red line shows the location of the F-statistic we saw in our data (0.60). Since F-statistics cannot be negative, we don’t need to worry about finding an F-statistic as extreme in the opposite direction.
House_Cond_SimulationResults_Plot <-ggplot(data=House_Cond_SimulationResults, aes(x=FSim)) +geom_histogram(fill="lightblue", color="white") +geom_vline(xintercept=c(Fstat), color="red") +xlab("Simulated Value of F") +ylab("Frequency") +ggtitle("Distribution of F under assumption of no relationship")House_Cond_SimulationResults_Plot
The F-statistic in our actual we observed does not appear to be very extreme.
p-value: Proportion of simulations resulting in value of F more extreme than 0.60.
mean(FSim > Fstat)
[1] 0.5548
The p-value represents the probability of observing an F-statistic as extreme as 0.60 by chance, in samples of size 61, 30, and 9, if in fact there is no relationship between price and size of car.
More than half of our simulations resulted in an F-statistic as extreme or more extreme than the one we saw in our actual data, even though the simulation was performed in a situation where there was no relationship between price and condition. Thus, it is very plausible that we would observe an F-statistic as extreme or more extreme than we saw in our data, even if there is no relationship between price and condition (or no difference in average price between the conditions), among all houses.
Since the p-value is large, we cannot reject the null hypothesis. We do not have evidence to say that average price differs between houses of the different condition types.
It is important to note that we are not saying that we believe the average price is the same for each condition. Recall that the average prices among the conditions in our sample differed by more than 300 thousand dollars! It’s just that given the size of our samples, and the amount of variability in our data, we cannot rule out the possibility that this difference occurred purely by chance.
3.7 Responsible Hypothesis Testing
While hypothesis tests are a powerful tool in statistics, they are also one that has been widely misused, to the detriment of scientific research. The hard caused by these misuses caused the American Statistical Association to release a 2016 statement, intended to provide guidance and clarification to scientists who use hypothesis testing and p-values in their research.
The statement provides the following six principles for responsible use of hypothesis tests and p-values.
P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
The statement provides important guidance for us to consider as we work with hypothesis testing in this class, as well as in future classes and potentially in our own research.
A hypothesis test can only tell us the strength of evidence against the null hypothesis. The absence of evidence against the null hypothesis should not be interpreted as evidence for the null hypothesis.
We should never say that the data support/prove/confirm the null hypothesis.
We can only say that the data do not provide evidence against the null hypothesis.
What to conclude from p-values and what not to:
A low p-value provides evidence against the null hypothesis. It suggests the test statistic we observed is inconsistent with the null hypothesis.
A low p-value does not tell us that the difference or relationship we observed is meaningful in a practical sense. Researchers should look at the size of the difference or strength of the relationship in the sample before deciding whether it merits being acted upon.
A high p-value means that the data could have plausibly been obtained when the null hypothesis is true. The test statistic we observed is consistent with what we would have expected to see when the null hypothesis is true, and thus we cannot rule out the null hypothesis.
A high p-value does not mean that the null hypothesis is true or probably true. A p-value can only tell us the strength of evidence against the null hypothesis, and should never be interpreted as support for the null hypothesis.
Just because our result is consistent with the null hypothesis does not mean that we should believe that null hypothesis is true. Lack of evidence against a claim does not necessarily mean that the claim is true.
In this scenario, we got a small p-value, but we should also be aware of what we should conclude if the p-value is large. Remember that the p-value only measures the strength of evidence against the null hypothesis. A large p-value means we lack evidence against the null hypothesis. This does not mean, however, that we have evidence supporting null hypothesis.
A hypothesis test can be thought of as being analogous to a courtroom trial, where the null hypothesis is that the defendant did not commit the crime. Suppose that after each side presents evidence, the jury remains unsure whether the defendant committed the crime. Since the jury does not have enough evidence to be sure, they must, under the law of the United States find the defendant “not guilty.” This does not mean that the jury thinks the defendant is innocent, only that they do not have enough evidence to be sure they are guilty. Similarly in a hypothesis test, a large p-value indicates a lack of evidence against the null hypothesis, rather than evidence supporting it. As such, we should avoid statements suggesting we “support”, “accept”, or “believe” the null hypothesis, and simply state that we lack evidence against it.
Things to say when the p-value is large:
The data are consistent with the null hypothesis.
We do not have enough evidence against the null hypothesis.
We cannot reject the null hypothesis.
The null hypothesis is plausible.
Things NOT to say when the p-value is large:
The data support the null hypothesis.
The data provide evidence for the null hypothesis.
We accept the null hypothesis.
We conclude that the null hypothesis is true.
Thus, if we had obtained a large p-value in the comparison in mercury levels between Northern and Southern lakes, the appropriate conclusion would be
“We do not have evidence that the average mercury level differs between lakes in Northern Florida, compared to Southern Florida.”
Even if we got a large p-value it would be incorrect to say “There is no difference in average mercury levels between lakes in Northern Florida and Southern Florida.”
We would just be saying that given the size of our sample and the amount of variability on the date, we cannot rule out the possibility of observing a difference like we did by chance, when there really is no difference.