STAT 255 Notes
Preface
1
Exploratory Data Analysis
1.1
Getting Started in R
1.1.1
Previewing the Data
1.1.2
Modifying the Data
1.2
Data Visualization
1.2.1
Histogram
1.2.2
Density Plot
1.2.3
Boxplot
1.2.4
Violin Plot
1.2.5
Scatterplot
1.2.6
Bar Graph
1.2.7
Stacked and Side-by-Side Bar Graphs
1.2.8
Correlation Plot
1.2.9
Scatterplot Matrix
1.3
Summary Tables
1.3.1
Calculating Summary Statistics
1.3.2
Grouped Summaries
2
Introduction to Statistical Models
2.1
Fitting Models to Data
2.1.1
Terminology
2.1.2
Model with Quantitative Explanatory Variable
2.1.3
Model with Categorical Variable
2.1.4
Model with Multiple Explanatory Variables
2.1.5
Model with No Explanatory Variable
2.2
Variability Explained by a Model
2.2.1
Quantifying Variability
2.2.2
Total Variability
2.2.3
Residuals
2.2.4
Variability Explained by Sq. Ft. Model
2.2.5
Linear Correlation Coefficient
2.2.6
Variability Explained by Waterfront Model
2.2.7
Variability Explained by Multiple Regression Model
2.2.8
Summary: SST, SSR, SSM,
\(R^2\)
2.2.9
\(R^2\)
Visually
2.2.10
Model Comparison Summary
2.3
Models with Interaction
2.3.1
Definition of Interaction
2.3.2
Interaction Term
2.3.3
Interaction Models in R
2.3.4
\(R^2\)
for Interaction Model
2.3.5
Considerations for Using Interactions
2.3.6
Interaction vs Correlation
2.4
Least Squares Estimation (LSE)
2.4.1
Estimating Regression Coefficients
2.4.2
Mathematics of LSE for SLR
2.4.3
LSE for Categorical Variable
2.4.4
LSE More Generally
2.5
ANalysis Of VAriance
2.5.1
Submodels
2.5.2
F-Statistics
2.5.3
Comparing 3 or More Categories
2.5.4
F-Statistic Illustration
2.5.5
Alternative F-Statistic Formula
3
Hypothesis Testing via Permutation
3.1
Test for Difference in Means
3.1.1
Mercury Levels in Florida Lakes
3.1.2
Model for Mercury Level
3.1.3
Hypotheses and Key Question
3.1.4
Permutation Test for Difference in Means
3.1.5
Five Permutations in R
3.1.6
R Code for Permutation Test
3.1.7
p-values
3.2
General Permutation Tests
3.2.1
Other Test Statistics
3.2.2
General Permutation Test Procedure
3.2.3
Difference in Standard Deviation
3.2.4
Permutation Test for Slope
3.2.5
F-Statistic
3.3
Responsible Hypothesis Testing
4
Bootstrap Interval Estimation
4.1
Sampling Distributions
4.1.1
Sampling From a Population
4.1.2
Confidence Intervals
4.2
Bootstrapping
4.2.1
Mercury Levels in Florida Lakes
4.2.2
Bootstrap Sampling
4.2.3
Bootstrap Samples of Lakes
4.2.4
Bootstrap Distribution
4.2.5
Bootstrap SE Confidence Interval
4.2.6
Bootstrap Distribution vs Sampling Distribution
4.3
Bootstrap Confidence Interval Example
4.3.1
Bootstrapping Other Statistics
4.3.2
CI for Mean
4.3.3
CI for Standard Deviation
4.3.4
CI for Median
4.3.5
CI for Difference in Means
4.3.6
CI for Regression Slope
4.3.7
CI for Regression Response
4.3.8
More CI’s in Regression
4.3.9
Bootstrapping Cautions
4.4
Estimating Standard Error
4.4.1
Standard Error vs Standard Deviation
4.4.2
Sample Size and Standard Error
4.4.3
Standard Error Formulas
4.4.4
One-Sample Mean Example
4.4.5
Difference in Means Example
4.4.6
Regression Example
4.4.7
Theory-Based Confidence Intervals
4.4.8
CI Method Comparison
5
Normal Error Regression Model
5.1
The Normal Error Regression Model
5.1.1
Example: Ice Cream dispenser
5.1.2
Signal and Noise
5.1.3
Normal Distribution
5.1.4
Signal and Noise in Icecream Example
5.1.5
Normal Error Regression Model
5.1.6
Examples of Normal Error Regression Model
5.1.7
Implications of Normal Error Regression Model
5.1.8
Philosophical Question
5.2
Inference in Normal Error Regression Model
5.2.1
lm
summary
Output
5.2.2
t-distribution
5.2.3
Difference in Means Example
5.2.4
Simple Linear Regression Example
5.2.5
Multiple Regression Example
5.2.6
MR with Interaction Example
5.2.7
Limitations
5.3
F-Distributions
5.3.1
F-Distribution
5.3.2
House Condition Example
5.3.3
Interaction Example
5.4
Regression Model Assumptions
5.4.1
Regression Assumptions
5.4.2
Checking Model Assumptions
5.4.3
Summary of Checks for Model Assumptions
5.4.4
Example: N v S Lakes
5.4.5
Example: pH Model
5.4.6
Example: House Prices
5.5
Intervals for Expected Response
5.5.1
Parameter Values and Expected Responses
5.5.2
Estimation and Prediction
5.5.3
Estimation and Prediction in SLR
5.5.4
Intervals in R
5.5.5
SLR Calculations (Optional)
5.5.6
Car Price and Acceleration Time
5.5.7
Florida Lakes Est. and Pred.
5.6
Transformations
5.6.1
Model Assumptions for Cars Data
5.6.2
Log Transformation
5.6.3
Log Transform for Car Prices
5.6.4
Interpretations in Log Model
5.6.5
Log Model Predictions
5.6.6
Log Model Interpretations
5.6.7
Log Model CI for
\(\beta_0\)
,
\(\beta_1\)
5.6.8
Log Model CI for Expected Response
5.6.9
Log Model Prediction Interval
5.6.10
Confidence Interval Comparison
5.6.11
Prediction Interval Comparison
5.6.12
Log Model Visualization
5.6.13
Comments on Transformations
5.7
Case Studies
5.7.1
Flights from NY to CHI
5.7.2
Smoking During Pregnancy
5.7.3
Smoking During Pregnancy (cont)
5.7.4
Exam Scores
5.7.5
Simulating the Regression Effect
5.7.6
NFL Wins
5.8
Impact of Model Assumption Violations
6
Model Building
7
Classification and Logistic Regression
8
Predictive Modeling
Published with bookdown
Stat 255: Statistics for Data Science Notes
Chapter 6
Model Building