I’m often asked by budding biostatisticians curious and eager to learn more to advise them on a developmental roadmap.
Here is what I think you need to become a great applied biostatistician, which in itself is more about the journey than the destination.
Here is a list of topics including books, loosely ordered by conceptual difficulty.
All book links are Amazon affiliate links and help support biostatistics.ca. Thank you!
Foundations of Statistics
It is important to learn the foundational concepts of statistics and probability theory. I’ve also listed some mathematics, since a lot of advanced concepts rely on a solid understanding of linear algebra and calculus.
- Intro to statistics, probability theory, and coding (preferably with R)
- These books cover fundamental concepts in biostatistics.
- Introduction to Probability & Statistics by Mendenhall, Beaver & Beaver
- Fundamentals of Biostatistics by Bernard Rosner
- Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking by Harvey Motulsky
- Statistics: An Introduction Using R by Michael J. Crawley
- Basic & Clinical Biostatistics by Susan White
- Biostatistics: A Foundation for Analysis in the Health Sciences by Wayne W. Daniel and Chad L. Cross
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham, Mine Cetinkaya-Rundel and Garrett Grolemund
- These books cover fundamental concepts in biostatistics.
- Hypothesis Testing, Power Analysis and Sample Size determination
- Jacob Cohen’s book is an authoritative textbook on the notion of statistical power. The book is quite old but it’s still indispensable to have on your bookshelf. Statistical Power Analysis for the Behavioral Sciences by Jacob Cohen
- ANOVA/ANCOVA
Mathematical Foundations
These will allow to develop the necessary mathematical theory to understand more advanced statistics
- Intro to linear algebra
- Intro to calculus.
Advanced Statistics
Once your learned the foundational tools, here are some advanced topics to master to become proficient as a statistician.
- Longitudinal data
- Experimental Design
- Randomized Controlled Trials
- Resampling methods (permutation tests, bootstrap, etc.)
- Missing Data and Data Imputation
- Sampling Theory
- Advanced Experimental Design (complex randomization schemes)
- Survival Analysis
- Bayesian Statistics
- Mathematical Statistics and Statistical Inference
This book covers hundreds of statistical tests in depth. It assumes some statistical maturity and goes into detail about each test covered. A great resource.
Modelling
One of my favorite statistics books ever! It does an amazing job to help the reader understand at its core the most widely used tool in statistics today: a regression model. Understanding Regression Analysis by Andrea L. Arias and Peter H. Westfall
This book is hands down the best one to learn how to apply regression models in scientific settings, where the statistical properties of the estimators and the inferences matter. Most books on regression modelling focus on building predictive systems, and can sometimes lead researchers astray. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis by Frank E. Harrell , Jr.
- Linear Regression
- Categorical Data (Logistic Regression, Multinomial Regression, Ordinal Regression)
- A masterpiece on logistic regression. Today, there are many misconceptions around logistic regression as it is often used to build classification procedures. It is less well-known that logistic regression can be used to do statistical inference with binary data, the same way any other Generalized Linear Model (GLM) can be used. Applied Logistic Regression by by David W. Hosmer, Stanley Lemeshow
- Generalized Linear Models (GLM, Linear, Binomial, Poisson, Negative Binomial)
- A classic textbook on Generalized Linear Models, an important tool used today in many biostatistics applications. Generalized Linear Models by P. McCullagh and John A. Nelder
- Mixed Models (LMM), Hierarchical Models, Multilevel Models (synonyms)
- A masterpiece for multilevel modelling, a technique widely used all across biostatistics, but in basically any other field that uses data. A must have! Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman & Jennifer Hill
- Generalized Additive Models (GAM), Smoothing and non-parametric regression
- Advanced modelling (Nonlinear mixed models, Generalized Estimating Equations. etc.)
- Multivariate Data Methods (PCA, LDA, Clustering, etc.)
Causal Inference
Causal inference is rapidly growing to be an indispensable part of the statisticians toolkit. It is a very large and complex field, and multiple specialized subfields are emerging. Here is an overview of important topics to master
- Potential Outcomes Model
- DAGs
- Propensity Scores
- Instrumental Variables
- Mediation and Interaction
- SEMs and Path Analysis
- Differences-in-differences
- Regression Discontinuity Design
- Time-Varying Confounding
- Emulating a Target Trial
- Targeted Learning and Causal ML
Here is a list of textbooks that cover a wide range of causal inference concepts.
- A popular and accessible text. The Book of Why breaks down complex causal inference concepts in a way that’s easy to grasp for both experts and laypeople alike. The Book of Why by Judea Pearl
- An accessible non-technical introduction to Causal Inference by one of the leading researchers in the field. Causal Inference by Paul Rosenbaum
- This book is a key resource for understanding the potential outcomes framework (Neyman-Rubin Causal Model). It provides in-depth guidance on how to design experiments and analyze data to draw causal conclusions in fields like economics, medicine, and social sciences. Causal Inference for Statistics, Social, and Biomedical Sciences by Guido Imbens and Donald Rubin
- Judea Pearl’s work lays the foundation for the structural causal models and DAGs approach. It integrates various perspectives on causation and offers mathematical tools for empirical researchers. Causality: Models, Reasoning, and Inference by Judea Pearl
- This text provides an accessible introduction to causal inference for researchers dealing with non-time-varying treatments. It extends to more complex scenarios like longitudinal data, making it ideal for those working with repeated measures or time-series data. What If? by Miguel A. Hernan & James M. Robins
- This book combines clarity with practicality. It adopts a practical approach to teaching causal inference through real-world examples and applications, making it suitable for learners at various levels.Causal Inference: The Mixtape by Scott Cunningham
- For readers who prefer a graphical approach, this primer by Pearl is an excellent introduction. It emphasizes the use of graphical models to clarify and simplify complex statistical concepts, making it easier to understand causal relationships. Causal Inference in Statistics: A Primer by Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell
- This book is perfect for data scientists and analysts who want to implement causal inference methods in Python. It walks through the application of causal inference techniques using practical examples in a coding environment. Causal Inference and Discovery in Python by Aleksander Molak
- Econometrics enthusiasts will find this text particularly helpful. It provides a hands-on guide to applying causal inference techniques in economic research, blending theory with practical applications. Mostly Harmless Econometrics: An Empiricist’s Companion by Joshua D. Angrist and Jörn-Steffen Pischke
- This textbook delves into counterfactual reasoning, a critical aspect of causal inference, offering comprehensive explanations of causal mechanisms in both experimental and observational settings. Counterfactuals and Causal Inference: Methods and Principles for Social Research by Stephen L. Morgan and Christopher Winship
- VanderWeele’s book tackles the philosophical and methodological issues surrounding causal inference, making it essential for those interested in the underlying principles of causal explanation. Explanation in Causal Inference by Tyler VanderWeele
Related Disciplines (Epidemiology, Bioinformatics, Psychology, Coding, Data Science)
- Intro to Epidemiology and Study Design in Epidemiology (measures of Association, survival curve and contingency tables, study design)
- Intro to Bioinformatics
- Intro to Psychometrics and Experimental Design in Psychology
- Intro to Machine learning and Artificial Intelligence Methods
- Advanced Coding (Reproducibility, Documentation, Package writing, etc.)
BONUS : For those who wish to become independent consultants
- Intro to statistical consulting (real-world consulting projects, either as consultant or even an intern, preferably with R)
- Basic business principles (contracts, marketing and branding, accounting, digital presence, structuring and closing deals)
Conclusion
How much time and focus to devote to each subject depends on your personal idiosyncrasies, experiences, and other factors.
This path is highly nonlinear by design and should keep you busy for a good 5 years of learning, if not 10.
Don’t hesitate to reach out to the author if you have any questions.
You can also subscribe to his FREE monthly newsletter Causal Inference in (Bio)statistics.