Data Science Myth buster : Knowledge of statistics is not mandatory
Data Science is a field which has grown leaps and bounds in the past decade or so. The rapid growth, much like in any other field, has led to the birth of a few myths. One of the most striking ones is an often-repeated statement/idea that understanding of statistics is not mandatory for understanding data science.
I interact with aspirants regularly and have found that many among them have somehow got this idea. I have tried to understand the reasons at the basic level and here’s what I could think of:
- The predominance of Python as a tool has, rather unfortunately, helped many believe that a minimal knowledge and use of statistics while building the model(s) is fine and there is very little chance of facing any issues.
- Aspirants have also been keen to follow a certain set of rules and/or instructions, understand the code required for it, and implement it to reach a plausible conclusion. Statistics does not play a major role while one carries out this sequence of operations and hence there probably has been a tendency to overlook it.
- Understanding a subject is always harder than getting comfortable with a tool. So, unless it becomes necessary, not many are interested to invest time and energy on knowing the subject.
There can be other reasons as well but these, in my opinion, broadly, takes care of the majority in question.
Let’s try to underline how knowing the subject can make an aspirant confident and help him/her become a ‘data scientist’.
When aspirants learn the basics of data science, they often encounter predictive models at an early stage. The presence of sophisticated tools have made life easier for us and, with knowledge of the underlying assumptions and how to check them, it is usually straightforward to build these models.
And thus, it is not necessary to even know the null hypothesis one tests while building a simple regression model.
Until things work well.
And unless issues crop up.
Understanding of the theory and framework becomes useful when things don’t go as per plan.
A bad multicollinearity or heteroscedasticity problem can be dealt with most effectively if it is known why it happened. To know this ‘why’, it is important to understand the statistics involved.
The other advantage is that it can help separate the excellent data scientists from merely the good ones. The knowledge of the subject can be used to understand if things are going fine or not. Else, it might well happen that after spending hours, in the end, one realises that the effort has proven to be futile.
Machine Learning and Deep Learning are attractive terms but there is a hierarchy which must be followed. If the basics of statistics and predictive modelling is not done properly, it is usually difficult to comprehend the advanced topics.
Building a second and third floor on a fragile ground floor is never a great idea…..