STAT 255 Notes
Preface
1
Exploratory Data Analysis
1.1
Getting Started in R
1.1.1
Previewing the Data
1.1.2
Modifying the Data
1.2
Data Visualization
1.2.1
Histogram
1.2.2
Density Plot
1.2.3
Boxplot
1.2.4
Violin Plot
1.2.5
Scatterplot
1.2.6
Bar Graph
1.2.7
Stacked and Side-by-Side Bar Graphs
1.2.8
Correlation Plot
1.2.9
Scatterplot Matrix
1.3
Summary Tables
1.3.1
Calculating Summary Statistics
1.3.2
Grouped Summaries
2
Introduction to Statistical Models
2.1
Fitting Models to Data
2.1.1
Terminology
2.1.2
Model with Quantitative Explanatory Variable
2.1.3
Model with Categorical Variable
2.1.4
Model with Multiple Explanatory Variables
2.1.5
Model with No Explanatory Variable
2.2
Variability Explained by a Model
2.2.1
Quantifying Variability
2.2.2
Total Variability
2.2.3
Residuals
2.2.4
Variability Explained by Sq. Ft. Model
2.2.5
Linear Correlation Coefficient
2.2.6
Variability Explained by Waterfront Model
2.2.7
Variability Explained by Multiple Regression Model
2.2.8
Summary: SST, SSR, SSM,
\(R^2\)
2.2.9
\(R^2\)
Visually
2.2.10
Model Comparison Summary
2.3
Models with Interaction
2.3.1
Definition of Interaction
2.3.2
Interaction Term
2.3.3
Interaction Models in R
2.3.4
\(R^2\)
for Interaction Model
2.3.5
Considerations for Using Interactions
2.3.6
Interaction vs Correlation
2.4
Least Squares Estimation (LSE)
2.4.1
Estimating Regression Coefficients
2.4.2
Mathematics of LSE for SLR
2.4.3
LSE for Categorical Variable
2.4.4
LSE More Generally
2.5
ANalysis Of VAriance
2.5.1
Submodels
2.5.2
F-Statistics
2.5.3
Comparing 3 or More Categories
2.5.4
F-Statistic Illustration
2.5.5
Alternative F-Statistic Formula
3
Hypothesis Testing via Permutation
3.1
Test for Difference in Means
3.1.1
Mercury Levels in Florida Lakes
3.1.2
Model for Mercury Level
3.1.3
Hypotheses and Key Question
3.1.4
Permutation Test for Difference in Means
3.1.5
Five Permutations in R
3.1.6
R Code for Permutation Test
3.1.7
p-values
3.2
General Permutation Tests
3.2.1
Other Test Statistics
3.2.2
General Permutation Test Procedure
3.2.3
Difference in Standard Deviation
3.2.4
Permutation Test for Slope
3.2.5
F-Statistic
3.3
Responsible Hypothesis Testing
4
Bootstrap Interval Estimation
4.1
Sampling Distributions
4.1.1
Sampling From a Population
4.1.2
Confidence Intervals
4.2
Bootstrapping
4.2.1
Mercury Levels in Florida Lakes
4.2.2
Bootstrap Sampling
4.2.3
Bootstrap Samples of Lakes
4.2.4
Bootstrap Distribution
4.2.5
Bootstrap SE Confidence Interval
4.2.6
Bootstrap Distribution vs Sampling Distribution
4.3
Bootstrap Confidence Interval Example
4.3.1
Bootstrapping Other Statistics
4.3.2
CI for Mean
4.3.3
CI for Standard Deviation
4.3.4
CI for Median
4.3.5
CI for Difference in Means
4.3.6
CI for Regression Slope
4.3.7
CI for Regression Response
4.3.8
More CI’s in Regression
4.3.9
Bootstrapping Cautions
4.4
Estimating Standard Error
4.4.1
Standard Error vs Standard Deviation
4.4.2
Sample Size and Standard Error
4.4.3
Standard Error Formulas
4.4.4
One-Sample Mean Example
4.4.5
Difference in Means Example
4.4.6
Regression Example
4.4.7
Theory-Based Confidence Intervals
4.4.8
CI Method Comparison
5
Normal Error Regression Model
5.1
The Normal Error Regression Model
5.1.1
Example: Ice Cream dispenser
5.1.2
Signal and Noise
5.1.3
Normal Distribution
5.1.4
Signal and Noise in Icecream Example
5.1.5
Normal Error Regression Model
5.1.6
Examples of Normal Error Regression Model
5.1.7
Implications of Normal Error Regression Model
5.1.8
Philosophical Question
5.2
Inference in Normal Error Regression Model
5.2.1
lm
summary
Output
5.2.2
t-distribution
5.2.3
Difference in Means Example
5.2.4
Simple Linear Regression Example
5.2.5
Multiple Regression Example
5.2.6
MR with Interaction Example
5.2.7
Limitations
5.3
F-Distributions
5.3.1
F-Distribution
5.3.2
House Condition Example
5.3.3
Interaction Example
5.4
Regression Model Assumptions
5.4.1
Regression Assumptions
5.4.2
Checking Model Assumptions
5.4.3
Summary of Checks for Model Assumptions
5.4.4
Example: N v S Lakes
5.4.5
Example: pH Model
5.4.6
Example: House Prices
5.5
Intervals for Expected Response
5.5.1
Parameter Values and Expected Responses
5.5.2
Estimation and Prediction
5.5.3
Estimation and Prediction in SLR
5.5.4
Intervals in R
5.5.5
SLR Calculations (Optional)
5.5.6
Car Price and Acceleration Time
5.5.7
Florida Lakes Est. and Pred.
5.6
Transformations
5.6.1
Cars Assumptions Check
5.6.2
Log Transformation
5.6.3
Log Transform for Car Prices
5.6.4
Log Model Predictions
5.6.5
Log Model Interpretations
5.6.6
Log Model CI for
\(\beta_0\)
,
\(\beta_1\)
5.6.7
Log Model CI for Expected Response
5.6.8
Log Model Prediction Interval
5.6.9
Confidence Interval Comparison
5.6.10
Prediction Interval Comparison
5.6.11
Log Model Visualization
5.6.12
Comments on Transformations
5.7
Case Studies
5.7.1
Flights from NY to CHI
5.7.2
Smoking During Pregnancy
5.7.3
Smoking During Pregnancy (cont)
5.7.4
Exam Scores
5.7.5
Simulating the Regression Effect
5.7.6
NFL Wins
5.8
Impact of Model Assumption Violations
6
Building Models for Interpretation
6.1
Model Building - SAT Scores
6.1.1
Modeling for Interpretation
6.1.2
SAT Scores Dataset
6.1.3
Research Question
6.1.4
Teacher Salary and SAT score
6.1.5
A Deeper Investigation
6.1.6
Student-to-Teacher Ratio
6.1.7
Multicollinearity
6.1.8
Check Model Assumptions
6.1.9
Quadratic Term
6.1.10
Account for Region?
6.1.11
Predictions and Intervals
6.2
Modeling Car Price
6.2.1
Model for Price of 2015 Cars
6.2.2
Acc. and Qrt. Mile Time
6.2.3
Adding Weight to Model
6.2.4
Adding More Variables
6.2.5
Check of Model Assumptions
6.2.6
Coefficients and Exponentiation
6.2.7
Confidence and Prediction Intevals
6.2.8
Model Building Summary
7
Predictive Modeling
7.1
Modeling for Prediction
7.1.1
Overview
7.1.2
Illustration of Predictive Modeling
7.1.3
Predicting New Data
7.1.4
Evaluating Predictions - RMSPE
7.1.5
Training Data Error
7.1.6
Graph of RMSPE
7.1.7
Best Model
7.1.8
Model Complexity, Training Error, and Test Error
7.2
Variance-Bias Tradeoff
7.2.1
What Contributes to Prediction Error?
7.2.2
Variance and Bias
7.2.3
Variance-Bias Tradeoff
7.2.4
Modeling for Prediction
7.2.5
Cross-Validation
7.2.6
Cross-Validation Illustration
7.2.7
CV in R
7.3
Ridge Regression
7.3.1
Complexity in Model Coefficients
7.3.2
Ridge Regression Penalty
7.3.3
Choosing
\(\lambda\)
7.3.4
Ridge Regression on Housing Dataset
7.3.5
Ridge vs OLS
7.3.6
Lasso and Elastic Net
7.4
Decision Trees
7.4.1
Basics of Decision Trees
7.4.2
Partitioning in A Decision Tree
7.4.3
Next Splits
7.4.4
Recursive Partitioning
7.4.5
Model Complexity in Trees
7.4.6
Cross-Validation on Housing Data
7.4.7
Comparing OLS, Lasso, Ridge, and Tree
7.4.8
Random Forest
7.5
Regression Splines
7.5.1
Regression Splines
7.5.2
Two Models with High Bias
7.5.3
Cubic Splines
7.5.4
Predicting Test Data
7.5.5
Implementation of Splines
7.6
Summary and Comparision
7.6.1
Modeling with OLS
7.6.2
Ridge Regression with Housing Data
7.6.3
Decision Tree
7.6.4
Comparing Performance
7.6.5
Predictions on New Data
7.7
Ethical Considerations in Predictive Modeling
7.7.1
Assumptions in Predictive Models
7.7.2
Amazon Hiring Algorithm
7.7.3
Facial Recognition
7.7.4
Comments
7.7.5
Modeling for Prediction
8
Classification and Logistic Regression
Published with bookdown
Stat 255: Statistics for Data Science Notes
Chapter 8
Classification and Logistic Regression