Statistics for Data Science — a Complete Guide for Aspiring ML Practitioners:
In this hyper-connected world, data are being generated and consumed at an unprecedented pace.
As much as we enjoy this superconductivity of data, it invites abuse as well. Data professionals need to be trained to use statistical methods not only to interpret numbers but to uncover such abuse and protect us from being misled.
Not many data scientists are formally trained in statistics. There are also very few good books and courses that teach these statistical methods from a data science perspective.
Through this post, I intend to shed some light on the following:
- What is Statistics?
- Statistics in relation with machine learning.
- Why you should master statistics
- What curriculum you should follow to master these topics
- How to study statistics to become a practitioner rather than a test-taker
- Practical tips and learning resources
What is Statistics?
Statistics is a set of mathematical methods and tools that enable us to answer important questions about data. It is divided into two categories:
- Descriptive Statistics – this offers methods to summarise data by transforming raw observations into meaningful information that is easy to interpret and share.
- Inferential Statistics – this offers methods to study experiments done on small samples of data and chalk out the inferences to the entire population (entire domain).
Now, statistics and machine learning are two closely related areas of study. Statistics is an important prerequisite for applied machine learning, as it helps us select, evaluate and interpret predictive models.
Statistics and Machine Learning
The core of machine learning is centered around statistics. You can’t solve real-world problems with machine learning if you don’t have a good grip of statistical fundamentals.
There are certainly some factors that make learning statistics hard. I’m talking about mathematical equations, greek notation, and meticulously defined concepts that make it difficult to develop an interest in the subject.
We can address these issues with simple and clear explanations, appropriately paced tutorials, and hands-on labs to solve problems with applied statistical methods.
From exploratory data analysis to designing hypothesis testing experiments, statistics play an integral role in solving problems across all major industries and domains.
Anyone who wishes to develop a deep understanding of machine learning should learn how statistical methods form the foundation for regression algorithms and classification algorithms, how statistics allow us to learn from data, and how it helps us extract meaning from unlabeled data.
Why should you master statistics?
Every organisation is striving to become data-driven. This is why we are witnessing such an increase in demand for data scientists and analysts.
Now, to solve problems, answer questions, and map out a strategy, we need to make sense of the data. Luckily, statistics offers a collection of tools to produce those insights.
From Data to Knowledge
In isolation, raw observations are just data. We use descriptive statistics to transform these observations into insights that make sense.
Then we can use inferential statistics to study small samples of data and extrapolate our findings to the entire population.
Statistics helps answer questions like…
- What features are the most important?
- How should we design the experiment to develop our product strategy?
- What performance metrics should we measure?
- What is the most common and expected outcome?
- How do we differentiate between noise and valid data?
All these are common and important questions that data teams have to answer on a daily basis.
The answers help us make decisions effectively. Statistical methods not only help us set up predictive modeling projects but also to interpret the results.
Statistics and Machine Learning Projects
Almost every machine learning project consists of the following tasks. And statistics play a central role in all of them in some shape or form. Here’s how:
Defining a Problem Statement
The most crucial part of predictive modeling is the actual definition of the problem that gives us the real objective to pursue.
This helps us decide the type of problem we’re dealing with (that is, regression or classification). And it also helps us decide the structure and types of the inputs, outputs and metrics with regards to the objective.
But problem framing is not always straightforward. If you’re new to Machine Learning, it may require significant exploration of the observations in the domain. Two main concepts to master here are exploratory data analysis (EDA) and data mining.
Initial Data Exploration
Data exploration involves gaining a deep understanding of both the distributions of variables and the relationships between variables in your data.
In part, domain expertise helps you gain this mastery over a specific type of variable. Nevertheless, both experts and newcomers to the field benefit from actually handling real observations from the domain.
Important related concepts in statistics boil down to learning descriptive statistics and data visualization.
Data Cleaning
Often, the data points you’ve collected from an experiment or a data repository are not pristine. The data may have been subjected to processes or manipulations that damaged its integrity. This further affects the downstream processes or models that use the data.
Common examples include missing values, data corruption, data errors (from a bad sensor), and unformatted data (observations with different scales).
If you want to master cleaning methods, you need to learn about outlier detection and missing value imputation.
Data Preparation and setting up transformation pipelines
If data contains errors and inconsistencies, you often can’t use it directly for modeling.
First, the data might need to go through a set of transformations to change its shape or structure and make it more suitable for the problem you’ve defined or the learning algorithms you’re using.
Then you can develop a pipeline of such transformations that you apply to the data to produce consistent and compatible input for the model.
You should master concepts like data sampling and feature selection methods, data transforms, scaling, and encoding.
Model Selection & Evaluation
A key step in solving a predictive problem is selecting and evaluating the learning method. Estimation statistics help you score model predictions on unseen data.
Experimental design is a subfield of statistics that drives the selection and evaluation process of a model. It demands a good understanding of statistical hypothesis tests and estimation statistics.
Fine-tuning the model
Almost every machine learning algorithm has a suite of hyperparameters that allow you to customise the learning method for your chosen problem framing.
This hyperparameter tuning is often empirical in nature, rather than analytical. It requires large suites of experiments in order to evaluate the effect of different hyperparameter settings on the performance of the model.
Statistics Curriculum for Practitioners
A good statistics curriculum for practitioners should not just cover the plethora of methods and tools I just discussed. It should also cover and explore the most commonly faced issues in the industry.
The following is a list of widely used skills you’ll need to know to ace data science and ML interviews and get a job in the field.
General Statistics Skills
- How to define statistically answerable questions for effective decision making.
- Calculating and interpreting common statistics and how to use standard data visualization techniques to communicate findings.
- Understanding of how mathematical statistics is applied to the field, concepts such as the central limit theorem and the law of large numbers.
- Making inferences from estimates of location and variability (ANOVA).
- How to identify the relationship between target variables and independent variables.
- How to design statistical hypothesis testing experiments, A/B testing, and so on.
- How to calculate and interpret performance metrics like p-value, alpha, type1 and type2 errors, and so on.
Important Statistics Concepts
- Getting Started— Understanding types of data (rectangular and non-rectangular), estimate of location, estimate of variability, data distributions, binary and categorical data, correlation, relationship between different types of variables.
- Distribution of Statistic — random numbers, the law of large numbers, Central Limit Theorem, standard error, and so on.
- Data sampling and Distributions — random sampling, sampling bias, selection bias, sampling distribution, bootstrapping, confidence interval, normal distribution, t-distribution, binomial distribution, chi-square distribution, F-distribution, Poisson and exponential distribution.
- Statistical Experiments and Significance Testing— A/B testing, conducting hypothesis tests (Null/Alternate), resampling, statistical significance, confidence interval, p-value, alpha, t-tests, degree of freedom, ANOVA, critical values, covariance and correlation, effect size, statistical power.
- Nonparametric Statistical Methods — rank data, normality tests, normalization of data, rank correlation, rank significance tests, independence test
Pracitcal Learning Tips
Most universities have designed their statistics course curricula to test the student’s cramming power. They just check if students can solve equations, define terminologies, and identify plots deriving equations, rather than focusing on applying these methods to solve real-world problems.
Aspiring practitioners, however, should follow a step-by-step process of learning and implementing statistical methods on different problems using executable Python code.
Let’s look at the two main approaches to studying statistics a bit more in depth:
Top-down approach
Let’s say you are asked to design an experiment to test the efficiency of two versions of a product feature. This feature is supposed to increase the user engagement on an online portal.
With a top-down approach, you’ll first learn more about the problem. Then once the objective is clear, you can learn to apply the appropriate statistical methods.
This keeps you engaged and offers a better practical learning experience.
Bottom-up approach
This approach is how most universities and online courses teach statistics. It focuses on learning the theoretical concepts with mathematical notation, the history of that concept, and how to implement it.
For people like me who tend to lose interest in theoretical learning, this is not the right way to learn applied statistics. It makes it too meta, which renders the subject dry and depressing without any direct link to problem solving.
As you can probably tell, I recommend a top-down approach to studying statistics.
So now let’s look at some specific resources I recommend to get you started down the right path.
Learning Resources
- Book on Practical Statistics– This will teach you statistics from a Data Science standpoint. You should read at least the first 3 chapters of this book.
- Statistics and Probability | Khan Academy– This course will prepare you well for all the statistics and probability related questions during the interview. A free course with a good compilation of video lectures and practice problems.
- Naked Statistics – For people who dread mathematics and prefer to understand practical examples, this is an amazing book that explains how statistics is applied in real-life scenarios.
- Statistical Methods for Machine Learning – This book serves as a crash course in statistical methods for machine learning practitioners. Ideally, those with a background as a developer.
Next up…
I will be creating a series of tutorials on each of the above-mentioned topics following a code-first approach so that we can understand and visualize the meaning and application of these concepts.
from Tumblr https://generouspiratequeen.tumblr.com/post/633966044494643200