Statistics for Data Science (using R)
Preface
These course notes provide an applied introduction to multivariate data analysis methods and statistical models using the R system for statistical computing. Currently, they are primarily aimed at students of the “Statistics for Data Science” course of the MSc in Data Science of the University of Girona and it serves as basis to more specialised courses taught later on. They build on previous materials developed by the author while delivering training courses for scientists at Biomathematics and Statistics Scotland (BioSS) and lecturing the Multivariate Data Analysis course at the University of Edinburgh. Basic statistical knowledge and some experience working and managing data in the R environment is assumed. The course avoids mathematical/statistical theory as much as possible and concentrates on the underlying concepts, emphasising how to put them in practice using R as computing tool.
They are divided into two blocks:
Chapters 1-6: overview of some multivariate methods aimed at data dimension reduction, classification, identification of similarities, associations, and patters in data sets; with a focus on data exploration and graphical representation.
Chapters 7-12: overview of some of the families of linear, non-linear, generalised linear and additive regression models commonly used in statistical modelling, including questions related to model validation, variable selection and dealing with high dimensions.
Some bibliographic references for more details include:
Everitt, B. S. and Hothorn, T. (2011). An Introduction to Applied Multivariate Analysis with R. Springer.
Faraway, J.J. (2005). Linear Models with R. Chapman & Hall/CRC, http://www.maths.bath.ac.uk/\~jjf23/LMR.
Faraway, J.J. (2006). Extending the Linear Model with R. Chapman & Hall/CRC, http://www.maths.bath.ac.uk/\~jjf23/ELM.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R. 2nd edition, Springer.
Kuhn, M., and Johnson, K. (2013). Applied predictive modeling. Springer.
Wood S.N. (2017) Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC Press.
Zelterman, D. (2015). Applied Multivariate Statistics with R. Springer.
The written material is accompanied by R code generating the outputs. We strongly encourage the students not to just copy and paste code chunks into their R system, but to type in and experiment with the code themselves, checking the details of the functions on the R help system, reviewing what arguments are available and how they interact with the output of the function. This will for sure provide a much more enriching learning experience.