A Guide on Data Analysis
Preface
How to cite these books
0.0.1
Volume 1: Foundations of Data Analysis
0.0.2
Volume 2: Regression Techniques for Data Analysis
0.0.3
Volume 3: Advanced Modeling and Data Challenges
0.0.4
Volume 4: Experimental Design
1
Introduction
1.1
General Recommendations
2
Prerequisites
2.1
Matrix Theory
2.1.1
Rank of a Matrix
2.1.2
Inverse of a Matrix
2.1.3
Definiteness of a Matrix
2.1.4
Matrix Calculus
2.1.5
Optimization in Scalar and Vector Spaces
2.1.6
Cholesky Decomposition
2.2
Probability Theory
2.2.1
Axioms and Theorems of Probability
2.2.2
Central Limit Theorem
2.2.3
Random Variable
2.2.4
Moment Generating Function
2.2.5
Moments
2.2.6
Skewness
2.2.7
Kurtosis
2.2.8
Distributions
3
Descriptive Statistics
4
Basic Statistical Inference
4.1
Hypothesis Testing Framework
4.1.1
Null and Alternative Hypotheses
4.1.2
Errors in Hypothesis Testing
4.1.3
The Role of Distributions in Hypothesis Testing
4.1.4
The Test Statistic
4.1.5
Critical Values and Rejection Regions
4.1.6
Visualizing Hypothesis Testing
4.2
Key Concepts and Definitions
4.2.1
Random Sample
4.2.2
Sample Statistics
4.2.3
Distribution of the Sample Mean
4.3
One-Sample Inference
4.3.1
For Single Mean
4.3.2
For Difference of Means, Independent Samples
4.3.3
For Difference of Means, Paired Samples
4.3.4
For Difference of Two Proportions
4.3.5
For Single Proportion
4.3.6
For Single Variance
4.3.7
Non-parametric Tests
4.4
Two-Sample Inference
4.4.1
For Means
4.4.2
For Variances
4.4.3
Power
4.4.4
Matched Pair Designs
4.4.5
Nonparametric Tests for Two Samples
4.5
Categorical Data Analysis
4.5.1
Association Tests
4.5.2
Ordinal Association
4.5.3
Ordinal Trend
4.6
Divergence Metrics and Tests for Comparing Distributions
4.6.1
Kolmogorov-Smirnov Test
4.6.2
Anderson-Darling Test
4.6.3
Chi-Square Goodness-of-Fit Test
4.6.4
Cramér-von Mises Test
4.6.5
Kullback-Leibler Divergence
4.6.6
Jensen-Shannon Divergence
4.6.7
Hellinger Distance
4.6.8
Bhattacharyya Distance
4.6.9
Wasserstein Distance
4.6.10
Energy Distance
4.6.11
Total Variation Distance
4.6.12
Summary
II. REGRESSION
5
Linear Regression
5.1
Ordinary Least Squares
5.1.1
Simple Regression (Basic) Model
6
Non-Linear Regression
6.1
Inference
6.1.1
Linear Functions of the Parameters
6.1.2
Nonlinear Functions of Parameters
6.2
Non-linear Least Squares Estimation
6.2.1
Iterative Optimization
7
Generalized Linear Models
8
Linear Mixed Models
9
Nonlinear and Generalized Linear Mixed Models
10
Nonparametric Regression
III. RAMIFICATIONS
11
Data
11.1
Data Types
11.1.1
Qualitative vs. Quantitative Data
11.1.2
Other Ways to Classify Data
11.1.3
Data by Observational Structure Over Time
11.2
Cross-Sectional Data
11.3
Time Series Data
11.3.1
Statistical Properties of Time Series Models
11.3.2
Common Time Series Processes
11.3.3
Deterministic Time Trends
11.3.4
Violations of Exogeneity in Time Series Models
11.3.5
Consequences of Exogeneity Violations
11.3.6
Highly Persistent Data
11.3.7
Unit Root Testing
11.3.8
Newey-West Standard Errors
11.4
Repeated Cross-Sectional Data
11.4.1
Key Characteristics
11.4.2
Statistical Modeling for Repeated Cross-Sections
11.4.3
Advantages of Repeated Cross-Sectional Data
11.4.4
Disadvantages of Repeated Cross-Sectional Data
11.5
Panel Data
11.5.1
Advantages of Panel Data
11.5.2
Disadvantages of Panel Data
11.5.3
Sources of Variation in Panel Data
11.5.4
Pooled OLS Estimator
11.5.5
Individual-Specific Effects Model
11.5.6
Random Effects Estimator
11.5.7
Fixed Effects Estimator
11.5.8
Tests for Assumptions in Panel Data Analysis
11.5.9
Model Selection in Panel Data
11.5.10
Alternative Estimators
11.5.11
Application
11.6
Choosing the Right Type of Data
11.7
Data Quality and Ethical Considerations
12
Variable Transformation
13
Imputation (Missing Data)
13.1
Introduction to Missing Data
13.1.1
Types of Imputation
13.1.2
When and Why to Use Imputation
13.1.3
Importance of Missing Data Treatment in Statistical Modeling
13.1.4
Prevalence of Missing Data Across Domains
13.1.5
Practical Considerations for Imputation
13.2
Theoretical Foundations of Missing Data
13.2.1
Definition and Classification of Missing Data
13.2.2
Missing Data Mechanisms
13.2.3
Relationship Between Mechanisms and Ignorability
13.3
Diagnosing the Missing Data Mechanism
13.3.1
Descriptive Methods
13.3.2
Statistical Tests for Missing Data Mechanisms
13.3.3
Assessing MAR and MNAR
13.4
Methods for Handling Missing Data
13.4.1
Basic Methods
13.4.2
Single Imputation Techniques
13.4.3
Machine Learning and Modern Approaches
13.4.4
Multiple Imputation
13.5
Evaluation of Imputation Methods
13.5.1
Statistical Metrics for Assessing Imputation Quality
13.5.2
Bias-Variance Tradeoff in Imputation
13.5.3
Sensitivity Analysis
13.5.4
Validation Using Simulated Data and Real-World Case Studies
13.6
Criteria for Choosing an Effective Approach
13.7
Challenges and Ethical Considerations
13.7.1
Challenges in High-Dimensional Data
13.7.2
Missing Data in Big Data Contexts
13.7.3
Ethical Concerns
13.8
Emerging Trends in Missing Data Handling
13.8.1
Advances in Neural Network Approaches
13.8.2
Integration with Reinforcement Learning
13.8.3
Synthetic Data Generation for Missing Data
13.8.4
Federated Learning and Privacy-Preserving Imputation
13.8.5
Imputation in Streaming and Online Data Environments
13.9
Application of Imputation
13.9.1
Visualizing Missing Data
13.9.2
How Many Imputations?
13.9.3
Generating Missing Data for Demonstration
13.9.4
Imputation with Mean, Median, and Mode
13.9.5
K-Nearest Neighbors (KNN) Imputation
13.9.6
Imputation with Decision Trees (rpart)
13.9.7
MICE (Multivariate Imputation via Chained Equations)
13.9.8
Amelia
13.9.9
missForest
13.9.10
Hmisc
13.9.11
mi
14
Model Specification Tests
15
Variable Selection
16
Hypothesis Testing
17
Marginal Effects
18
Moderation
19
Mediation
20
Prediction and Estimation
IV. CAUSAL INFERENCE
21
Causal Inference
21.1
The Formal Notation of Causality
21.2
Simpson’s Paradox
21.2.1
Comparison between Simpson’s Paradox and Omitted Variable Bias
21.2.2
Illustrating Simpson’s Paradox: Marketing Campaign Success Rates
21.2.3
Why Does This Happen?
21.2.4
How Does Causal Inference Solve This?
21.2.5
Correcting Simpson’s Paradox with Regression Adjustment
21.2.6
Key Takeaways
21.3
Experimental vs. Quasi-Experimental Designs
21.3.1
Criticisms of Quasi-Experimental Designs
21.4
Hierarchical Ordering of Causal Tools
21.5
Types of Validity in Research
21.5.1
Measurement Validity
21.5.2
Construct Validity
21.5.3
Criterion Validity
21.5.4
Internal Validity
21.5.5
External Validity
21.5.6
Ecological Validity
21.5.7
Statistical Conclusion Validity
21.5.8
Putting It All Together
21.6
Types of Subjects in a Treatment Setting
21.6.1
Non-Switchers
21.6.2
Switchers
21.6.3
Classification of Individuals Based on Treatment Assignment
21.7
Types of Treatment Effects
21.7.1
Average Treatment Effect
21.7.2
Conditional Average Treatment Effect
21.7.3
Intention-to-Treat Effect
21.7.4
Local Average Treatment Effects
21.7.5
Population vs. Sample Average Treatment Effects
21.7.6
Average Treatment Effects on the Treated and Control
21.7.7
Quantile Average Treatment Effects
21.7.8
Log-Odds Treatment Effects for Binary Outcomes
21.7.9
Summary Table: Treatment Effect Estimands
A. EXPERIMENTAL DESIGN
22
Experimental Design
22.1
Principles of Experimental Design
22.2
The Gold Standard: Randomized Controlled Trials
22.3
Selection Problem
22.3.1
The Observed Difference in Outcomes
22.3.2
Eliminating Selection Bias with Random Assignment
22.3.3
Another Representation Under Regression
22.4
Classical Experimental Designs
22.4.1
Completely Randomized Design
22.4.2
Randomized Block Design
22.4.3
Factorial Design
22.4.4
Crossover Design
22.4.5
Split-Plot Design
22.4.6
Latin Square Design
22.5
Advanced Experimental Designs
22.5.1
Semi-Random Experiments
22.5.2
Re-Randomization
22.5.3
Two-Stage Randomized Experiments
22.5.4
Two-Stage Randomized Experiments with Interference and Noncompliance
22.5.5
Switchback Experiments with Surrogate Variables
22.6
Emerging Research
22.6.1
Covariate Balancing in Online A/B Testing: The Pigeonhole Design
22.6.2
Handling Zero-Valued Outcomes
23
Sampling
24
Analysis of Variance
24.1
Completely Randomized Design
24.1.1
Single-Factor Fixed Effects ANOVA
24.1.2
Single Factor Random Effects ANOVA
24.1.3
Two-Factor Fixed Effects ANOVA
24.1.4
Two-Way Random Effects ANOVA
24.1.5
Two-Way Mixed Effects ANOVA
24.2
Nonparametric ANOVA
24.2.1
Kruskal-Wallis Test (One-Way Nonparametric ANOVA)
24.2.2
Friedman Test (Nonparametric Two-Way ANOVA)
24.3
Randomized Block Designs
24.4
Nested Designs
24.4.1
Two-Factor Nested Design
24.4.2
Unbalanced Nested Two-Factor Designs
24.4.3
Random Factor Effects
24.5
Sample Size Planning for ANOVA
24.5.1
Balanced Designs
24.5.2
Single Factor Studies
24.5.3
Multi-Factor Studies
24.5.4
Procedure for Sample Size Selection
24.5.5
Randomized Block Experiments
24.6
Single Factor Covariance Model
24.6.1
Statistical Inference for Treatment Effects
24.6.2
Testing for Parallel Slopes
24.6.3
Adjusted Means
25
Multivariate Methods
25.1
Basic Understanding
25.1.1
Multivariate Random Vectors
25.1.2
Covariance Matrix
25.1.3
Equalities in Expectation and Variance
25.1.4
Multivariate Normal Distribution
25.1.5
Test of Multivariate Normality
25.1.6
Mean Vector Inference
25.1.7
General Hypothesis Testing
25.2
Multivariate Analysis of Variance
25.2.1
One-Way MANOVA
25.2.2
Profile Analysis
25.3
Statistical Test Selection for Comparing Means
B. QUASI-EXPERIMENTAL DESIGN
26
Quasi-Experimental Methods
26.1
Identification Strategy in Quasi-Experiments
26.2
Establishing Mechanisms
26.2.1
Mediation Analysis: Explaining the Causal Pathway
26.2.2
Moderation Analysis: For Whom or Under What Conditions?
26.3
Robustness Checks
26.4
Limitations of Quasi-Experiments
26.4.1
What Are the Identification Assumptions?
26.4.2
What Are the Threats to Validity?
26.4.3
How Do You Address These Threats?
26.4.4
What Are the Implications for External Validity and Future Research?
26.5
Assumptions for Identifying Treatment Effects
26.5.1
Stable Unit Treatment Value Assumption
26.5.2
Conditional Ignorability Assumption
26.5.3
Overlap (Positivity) Assumption
26.6
Natural Experiments
26.6.1
The Problem of Reusing Natural Experiments
26.6.2
Statistical Challenges in Reusing Natural Experiments
26.6.3
Solutions: Multiple Testing Corrections
26.7
Design vs. Model-Based Approaches
26.7.1
Design-Based Perspective
26.7.2
Model-Based Perspective
26.7.3
Placing Methods Along a Spectrum
27
Regression Discontinuity
27.1
Conceptual Framework
27.2
Types of Regression Discontinuity Designs
27.2.1
Assumptions for RD Validity
27.3
Model Estimation Strategies
27.3.1
Parametric Models: Polynomial Regression
27.3.2
Nonparametric Models: Local Regression
27.4
Formal Definition
27.4.1
Identification Assumptions
27.5
Estimation and Inference
27.5.1
Local Randomization-Based Approach
27.5.2
Continuity-Based Approach
27.6
Specification Checks
27.6.1
Balance Checks
27.6.2
Sorting, Bunching, and Manipulation
27.6.3
Placebo Tests
27.6.4
Sensitivity to Bandwidth Choice
27.6.5
Assessing Sensitivity
27.6.6
Manipulation-Robust Regression Discontinuity Bounds
27.7
Fuzzy Regression Discontinuity Design
27.7.1
Compliance Types
27.7.2
Estimating the Local Average Treatment Effect
27.7.3
Equivalent Representation Using Expectations
27.7.4
Estimation Strategies
27.7.5
Practical Considerations
27.7.6
Steps for Fuzzy RD
27.8
Sharp Regression Discontinuity Design
27.8.1
Assumptions for Identification
27.8.2
Estimating the Local Average Treatment Effect
27.8.3
Estimation Methods
27.8.4
Steps for Sharp RD
27.9
Regression Kink Design
27.9.1
Identification in Sharp Regression Kink Design
27.9.2
Identification in Fuzzy Regression Kink Design
27.9.3
Estimation of RKD Effects
27.9.4
Robustness Checks
27.10
Multi-Cutoff Regression Discontinuity Design
27.10.1
Identification
27.10.2
Key Assumptions
27.10.3
Estimation Approaches
27.10.4
Robustness Checks
27.11
Multi-Score Regression Discontinuity Design
27.11.1
General Framework
27.11.2
Identification
27.11.3
Key Assumptions
27.11.4
Estimation Approaches
27.11.5
Robustness Checks
27.12
Evaluation of a Regression Discontinuity Design
27.12.1
Graphical and Formal Evidence
27.12.2
Functional Form of the Running Variable
27.12.3
Bandwidth Selection
27.12.4
Addressing Potential Confounders
27.12.5
External Validity in RD
27.13
Applications of RD Designs
27.13.1
Applications in Marketing
27.13.2
R Packages for RD Estimation
27.13.3
Example of Regression Discontinuity in Education
27.13.4
Example of Occupational Licensing and Market Efficiency
28
Temporal Discontinuity Designs
28.1
Regression Discontinuity in Time
28.1.1
Estimation and Model Selection
28.1.2
Strengths of RDiT
28.1.3
Limitations and Challenges of RDiT
28.1.4
Recommendations for Robustness Checks
28.1.5
Applications of RDiT
28.1.6
Empirical Example
28.2
Interrupted Time Series
28.2.1
Advantages of ITS
28.2.2
Limitations of ITS
28.2.3
Empirical Example
28.3
Combining both RDiT and ITS
28.3.1
Augment an ITS Model with a Local Discontinuity Term
28.3.2
Two-Stage (or Multi-Stage) Modeling
28.3.3
Hierarchical or Multi-Level Modeling
28.3.4
Empirical Example
28.3.5
Practical Guidance
28.4
Case-Crossover Study Design
28.4.1
Mathematical Foundations
28.4.2
Selection of Control Periods
28.4.3
Assumptions
29
Synthetic Difference-in-Differences
29.1
Understanding
29.1.1
Steps in SDID Estimation
29.1.2
Comparison of Methods
29.1.3
Why Use Weights?
29.1.4
Benefits of Localization in SDID
29.1.5
Designing SDID Weights
29.1.6
How SDID Enhances DID’s Plausibility
29.1.7
Choosing SDID Weights
29.1.8
Accounting for Time-Varying Covariates in Weight Estimation
29.2
Application
29.2.1
Block Treatment
29.2.2
Staggered Adoption
30
Difference-in-Differences
30.1
Empirical Studies
30.1.1
Applications of DID in Marketing
30.1.2
Applications of DID in Economics
30.2
Visualization
30.2.1
Data check
30.2.2
Treatment Assignment Heatmap
30.2.3
Raw Outcome Trajectories
30.2.4
Event-time Averages
30.3
Simple Difference-in-Differences
30.3.1
Basic Setup of DID
30.3.2
Extensions of DID
30.3.3
Goals of DID
30.4
Empirical Research Walkthrough
30.4.1
Example: The Unintended Consequences of “Ban the Box” Policies
30.4.2
Example: Minimum Wage and Employment
30.4.3
Example: The Effects of Grade Policies on Major Choice
30.5
One Difference
30.6
Two-Way Fixed Effects
30.6.1
Canonical TWFE Model
30.6.2
Limitations of TWFE
30.6.3
Diagnosing and Addressing Bias in TWFE
30.6.4
Remedies for TWFE’s Shortcomings
30.6.5
Best Practices and Recommendations
30.7
Multiple Periods and Variation in Treatment Timing
30.7.1
Staggered Difference-in-Differences
30.8
Modern Estimators for Staggered Adoption
30.8.1
Group-Time Average Treatment Effects
(Callaway and Sant’Anna 2021)
30.8.2
Cohort Average Treatment Effects
(L. Sun and Abraham 2021)
30.8.3
Stacked Difference-in-Differences
30.8.4
Panel Match DiD Estimator with In-and-Out Treatment Conditions
30.8.5
Counterfactual Estimators
30.8.6
Matrix Completion Estimator
30.8.7
Two-stage DiD Estimator
30.8.8
Reshaped Inverse Probability Weighting - TWFE Estimator
30.8.9
Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels
30.8.10
Switching Difference-in-Differences Estimator
(Clément De Chaisemartin and d’Haultfoeuille 2020)
30.8.11
Augmented/Forward DID
30.8.12
Doubly Robust Difference-in-Differences Estimators
30.8.13
Nonlinear Difference-in-Differences
30.9
Multiple Treatments
30.9.1
Multiple Treatment Groups: Model Specification
30.9.2
Understanding the Control Group in Multiple Treatment DiD
30.9.3
Alternative Approaches: Separate Regressions vs. One Model
30.9.4
Handling Treatment Intensity
30.9.5
Considerations When Individuals Can Move Between Treatment Groups
30.9.6
Parallel Trends Assumption in Multiple-Treatment DiD
30.10
Mediation Under DiD
30.10.1
Mediation Model in DiD
30.10.2
Interpreting the Results
30.10.3
Challenges in Mediation Analysis for DiD
30.10.4
Alternative Approach: Instrumental Variables for Mediation
30.11
Assumptions
30.11.1
Prior Parallel Trends Test
30.11.2
Placebo Test
30.12
Robustness Checks
30.12.1
Robustness Checks to Strengthen Causal Interpretation
30.12.2
Best Practices for Reliable DiD Implementation
30.13
Concerns in DID
30.13.1
Limitations and Common Issues
30.13.2
Matching Methods in DID
30.13.3
Control Variables in DID
30.13.4
DID for Count Data: Fixed-Effects Poisson Model
30.13.5
Handling Zero-Valued Outcomes in DID
30.13.6
Standard Errors
30.13.7
Partial Identification
31
Changes-in-Changes
31.1
Key Concepts
31.2
Estimating QTT with CiC
31.3
Application
31.3.1
ECIC package
31.3.2
QTE package
32
Synthetic Control
32.1
Marketing Applications
32.2
Key Features of SCM
32.3
Advantages of SCM
32.3.1
Compared to DiD
32.3.2
Compared to Linear Regression
32.3.3
Additional Advantages
32.4
Disadvantages of SCM
32.5
Assumptions
32.6
Estimation
32.6.1
Constructing the Synthetic Control
32.6.2
Penalized Synthetic Control
32.7
Theoretical Considerations
32.8
Inference in SCM
32.8.1
Permutation (Placebo) Inference
32.8.2
One-Sided Inference
32.9
Augmented Synthetic Control Method
32.10
Synthetic Control with Staggered Adoption
32.10.1
Partially Pooled Synthetic Control
32.11
Generalized Synthetic Control
32.11.1
The Problem with Traditional Methods
32.11.2
Generalized Synthetic Control Model
32.11.3
Identification and Estimation
32.11.4
Bootstrap Procedure for Standard Errors
32.12
Bayesian Synthetic Control
32.12.1
Bayesian Causal Inference Framework
32.12.2
Bayesian Dynamic Multilevel Factor Model
32.12.3
Bayesian Sparse Synthetic Control
32.12.4
Bayesian Inference and MCMC Estimation
32.13
Using Multiple Outcomes to Improve the Synthetic Control Method
32.13.1
Standard Synthetic Control Method
32.13.2
Using Multiple Outcomes for Bias Reduction
32.13.3
Estimation Methods
32.13.4
Empirical Application: Flint Water Crisis
32.14
Applications
32.14.1
Synthetic Control Estimation
32.14.2
The Basque Country Policy Change
32.14.3
Micro-Synthetic Control with
microsynth
33
Event Studies
33.1
Review of Event Studies Across Disciplines
33.1.1
Finance Applications
33.1.2
Management Applications
33.1.3
Marketing Applications
33.2
Key Assumptions
33.3
Steps for Conducting an Event Study
33.3.1
Step 1: Event Identification
33.3.2
Step 2: Define the Event and Estimation Windows
33.3.3
Step 3: Compute Normal vs. Abnormal Returns
33.3.4
Step 4: Compute Cumulative Abnormal Returns
33.3.5
Step 5: Statistical Tests for Significance
33.4
Event Studies in Marketing
33.4.1
Definition
33.4.2
When Can Marketing Events Affect Non-Operating Assets or Debt?
33.4.3
Calculating the Leverage Effect
33.4.4
Computing Leverage Effect from Compustat Data
33.5
Economic Significance
33.6
Testing in Event Studies
33.6.1
Statistical Power in Event Studies
33.6.2
Parametric Tests
33.6.3
Non-Parametric Tests
33.7
Sample in Event Studies
33.8
Confounders in Event Studies
33.8.1
Types of Confounding Events
33.8.2
Should We Exclude Confounded Observations?
33.8.3
Simulation Study: Should We Exclude Correlated and Uncorrelated Events?
33.9
Biases in Event Studies
33.9.1
Timing Bias: Different Market Closing Times
33.9.2
Upward Bias in Cumulative Abnormal Returns
33.9.3
Cross-Sectional Dependence Bias
33.9.4
Sample Selection Bias
33.9.5
Corrections for Sample Selection Bias
33.10
Long-run Event Studies
33.10.1
Buy-and-Hold Abnormal Returns (BHAR)
33.10.2
Long-term Cumulative Abnormal Returns (LCARs)
33.10.3
Calendar-time Portfolio Abnormal Returns (CTARs)
33.11
Aggregation
33.11.1
Over Time
33.11.2
Across Firms and Over Time
33.11.3
Statistical Tests
33.12
Heterogeneity in the Event Effect
33.12.1
Common Variables Affecting CAR in Marketing and Finance
33.13
Expected Return Calculation
33.13.1
Statistical Models for Expected Returns
33.13.2
Economic Models for Expected Returns
33.14
Application of Event Study
33.14.1
Sorting Portfolios for Expected Returns
33.14.2
erer
Package
33.14.3
Eventus
34
Instrumental Variables
34.1
Challenges with Instrumental Variables
34.2
Framework for Instrumental Variables
34.2.1
Constant-Treatment-Effect Model
34.2.2
Instrumental Variable Solution
34.2.3
Heterogeneous Treatment Effects and the LATE Framework
34.2.4
Assumptions for LATE Identification
34.2.5
Local Average Treatment Effect Theorem
34.2.6
IV in Randomized Trials (Noncompliance)
34.3
Estimation
34.3.1
Two-Stage Least Squares Estimation
34.3.2
IV-GMM
34.3.3
Limited Information Maximum Likelihood
34.3.4
Jackknife IV
34.3.5
Control Function Approach
34.3.6
Fuller and Bias-Reduced IV
34.4
Asymptotic Properties of the IV Estimator
34.4.1
Consistency
34.4.2
Asymptotic Normality
34.4.3
Asymptotic Efficiency
34.5
Inference
34.5.1
Weak Instruments Problem
34.5.2
Solutions and Approaches for Valid Inference
34.5.3
Anderson-Rubin Approach
34.5.4
tF Procedure
34.5.5
AK Approach
34.6
Testing Assumptions
34.6.1
Relevance Assumption
34.6.2
Independence (Unconfoundedness)
34.6.3
Monotonicity Assumption
34.6.4
Homogeneous Treatment Effects (Optional)
34.6.5
Linearity and Additivity
34.6.6
Instrument Exogeneity (Exclusion Restriction)
34.6.7
Exogeneity Assumption
34.7
Cautions in IV
34.7.1
Negative
\(R^2\)
in IV
34.7.2
Many-Instruments Bias
34.7.3
Heterogeneous Effects in IV Estimation
34.7.4
Zero-Valued Outcomes
34.8
Types of IV
34.8.1
Treatment Intensity
34.8.2
Decision-Maker IV
34.8.3
Proxy Variables
35
Matching Methods
35.1
Introduction and Motivation
35.1.1
Why Match?
35.1.2
Matching as “Pruning”
35.1.3
Matching with DiD
35.2
Key Assumptions
35.3
Framework for Generalization
35.4
Steps for Matching
35.4.1
Step 1: Define “Closeness” (Distance Metrics)
35.4.2
Step 2: Matching Algorithms
35.4.3
Step 3: Diagnosing Match Quality
35.4.4
Step 4: Estimating Treatment Effects
35.5
Special Considerations
35.6
Choosing a Matching Strategy
35.6.1
Based on Estimand
35.6.2
Based on Diagnostics
35.6.3
Selection Criteria
35.7
Matching vs. Regression
35.7.1
Matching Estimand
35.7.2
Regression Estimand
35.7.3
Interpretation: Weighting Differences
35.8
Software and Practical Implementation
35.9
Selection on Observables
35.9.1
Matching with
MatchIt
35.9.2
Reporting Standards
35.9.3
Optimization-Based Matching via
designmatch
35.9.4
MatchingFrontier
35.9.5
Propensity Scores
35.9.6
Mahalanobis Distance Matching
35.9.7
Coarsened Exact Matching (CEM)
35.9.8
Genetic Matching
35.9.9
Entropy Balancing
35.9.10
Matching for High-Dimensional Data
35.9.11
Matching for Multiple Treatments
35.9.12
Matching for Multi-Level Treatments
35.9.13
Matching for Repeated Treatments (Time-Varying Treatments)
35.10
Selection on Unobservables
35.10.1
Rosenbaum Bounds
35.10.2
Relative Correlation Restrictions
35.10.3
Coefficient-Stability Bounds
C. OTHER CONCERNS
36
Endogeneity
36.1
Endogenous Treatment
36.1.1
Measurement Errors
36.1.2
Simultaneity
36.1.3
Reverse Causality
36.1.4
Omitted Variable Bias
36.2
Endogenous Sample Selection
36.2.1
Unifying Model Frameworks
36.2.2
Estimation Methods
36.2.3
Theoretical Connections
36.2.4
Tobit-2: Heckman’s Sample Selection Model
36.2.5
Tobit-5: Switching Regression Model
36.2.6
Pattern-Mixture Models
37
Other Biases
37.1
Aggregation Bias
37.1.1
Simpson’s Paradox
37.2
Contamination Bias
37.3
Survivorship Bias
37.4
Publication Bias
37.5
p-Hacking
37.5.1
Theoretical Signatures of p-Hacking and Selection
37.5.2
Method Families
37.5.3
Mathematical Details and Assumptions
37.5.4
Schools of Thought and Notable Debates
37.5.5
Practical Guidance for Applied Analysts
37.5.6
Limitations and Open Problems
38
Directed Acyclic Graphs
38.1
Basic Notation and Graph Structures
38.2
Rule of Thumb for Causal Inference
38.3
Example DAG
38.4
Causal Discovery
38.5
39
Controls
39.1
Bad Controls
39.1.1
M-bias
39.1.2
Bias Amplification
39.1.3
Overcontrol Bias
39.1.4
Selection Bias
39.1.5
Case-Control Bias
39.1.6
Summary
39.2
Good Controls
39.2.1
Omitted Variable Bias Correction
39.2.2
Omitted Variable Bias in Mediation Correction
39.3
Neutral Controls
39.3.1
Good Predictive Controls
39.3.2
Good Selection Bias
39.3.3
Bad Predictive Controls
39.3.4
Bad Selection Bias
39.3.5
Summary Table: Predictive vs. Causal Utility of Controls
39.4
Choosing Controls
39.4.1
Step 1: Use a Causal Diagram (DAG)
39.4.2
Step 2: Use Algorithmic Tools
39.4.3
Step 3: Theoretical Principles
39.4.4
Step 4: Consider Sensitivity Analysis
39.4.5
Step 5: Know When Not to Control
39.4.6
Summary: Control Selection Pipeline
V. MISCELLANEOUS
40
Reporting Your Analysis
40.1
Recommended Structure
40.1.1
Phase 1: Exploratory Data Analysis (EDA)
40.1.2
Phase 2: Model Selection and Specification
40.1.3
Phase 3: Model Fitting and Diagnostic Assessment
40.1.4
Phase 4: Inference and Prediction
40.1.5
Phase 5: Conclusions and Recommendations
40.2
One Summary Table (Packages)
40.3
Exploratory Analysis
40.4
Model
40.4.1
Assumptions
40.4.2
Why this model?
40.4.3
Considerations
40.4.4
Model Fit
40.4.5
Cluster-Robust Standard Errors
40.4.6
Model to Equation
40.5
Model Comparison
40.6
Changes in an Estimate
40.6.1
Coefficient Uncertainty and Distribution
40.7
Descriptive Tables
40.7.1
Export APA theme (flextable)
40.8
Visualizations & Plots
40.9
One-Table Summary
40.10
Inference / Prediction
40.11
Appendix: Reproducible Snippets
41
Exploratory Data Analysis
41.1
Data Report
41.2
Feature Engineering
41.3
Missing Data
41.4
Error Identification
41.5
Summary statistics
41.6
Not so code-y process
41.6.1
Quick and dirty way to look at your data
41.6.2
Code generation and wrangling (visual)
41.7
Shiny-app based Tableau style
41.8
Customize your daily/automatic report
41.8.1
Appendix: Small “gotchas” to keep in mind
42
Sensitivity Analysis and Robustness Checks
42.1
The Philosophy of Robustness
42.2
Specification Curve Analysis
42.2.1
Conceptual Foundation
42.2.2
The
starbility
Package
42.2.3
Advanced Specification Curve Techniques
42.2.4
The
specr
Package
42.2.5
The
rdfanalysis
Package
42.3
Coefficient Stability
42.3.1
Theoretical Foundation: The Oster (2019) Approach
42.3.2
The
robomit
Package
42.3.3
The
mplot
Package for Graphical Model Stability
42.4
Quantifying Omitted Variable Bias
42.4.1
The
konfound
Package
42.4.2
Visualizing Sensitivity: The Threshold Plot
42.4.3
The Correlation Plot
42.4.4
Konfound for Model Objects
42.5
Advanced Topics in Sensitivity Analysis
42.5.1
Sensitivity to Outliers and Influential Observations
42.5.2
Sensitivity to Measurement Error
42.6
Reporting Sensitivity Analysis Results
42.6.1
Best Practices for Presentation
42.7
Conclusion and Recommendations
43
Replication and Synthetic Data
43.1
The Replication Standard
43.1.1
Solutions for Empirical Replication
43.1.2
Free Data Repositories
43.1.3
Exceptions to Replication
43.1.4
Replication Landscape
43.2
Synthetic Data
43.2.1
Benefits of Synthetic Data
43.2.2
Concerns and Limitations
43.2.3
Further Insights on Synthetic Data
43.2.4
Generating Synthetic Data
43.3
Application
43.3.1
Original Dataset
43.3.2
Restricted Dataset
43.3.3
Synthpop
44
High-Performance Computing
44.1
Best Practices for HPC in Data Analysis
44.2
Example Workflow in R
44.3
Recommendations
44.4
Demonstration
APPENDIX
A
Appendix
A.1
Git
A.1.1
Basic Setup
A.1.2
Creating a Repository
A.1.3
Tracking Changes
A.1.4
Viewing History and Changes
A.1.5
Ignoring Files
A.1.6
Remote Repositories
A.1.7
Collaboration
A.1.8
Branching and Merging
A.1.9
Handling Conflicts
A.1.10
Licensing
A.1.11
Citing Repositories
A.1.12
Hosting and Legal Considerations
A.2
Short-cut
A.3
Function short-cut
A.4
Citation
A.5
Install All Necessary Packages on Your Local Machine
A.5.1
Step 1: Export Installed Packages from Your Current Session
A.5.2
Step 2: Install Packages on a New Machine
B
Bookdown cheat sheet
B.1
Operation
B.2
Math Expression/ Syntax
B.2.1
Statistics Notation
B.3
Table
References
Published with bookdown
A Guide on Data Analysis
38.5
📖
Free preview — limited per publisher guidelines.
Purchase the complete
A Guide on Data Analysis
series (Vols. 1–4) on Springer.
Vol. 1
Vol. 2
Vol. 3
Vol. 4