In the past few years DataScience has become an emerging field in the IT industry. More and more companies are becoming data driven, and they rely their businesses on data and services around it.
Although seems very exciting and almost like magic when some experienced Data Scientist presents the DataScience project results, with all those buzz words, colorful workflows and “out of the box” solutions, the reality is a bit harsh. First thing that any Data Scientist will face when starting a new project is that their data is in the most of the cases one big, crazy, messy, unordered pile of disconnected information. If you are working with data streams things get even more complex. To extract the knowledge from the data, you first need to clean up that mess, therefore data cleansing is your first station.
Data cleansing imply removing corrupt or inaccurate records, renaming data columns, transforming the data to consistent formats, finding and removing null values, etc. Although it is tedious task, take your time with it. The efforts will pay off later during the project.
When your data is ready you can proceed with the next steps which include algorithms selection, optimization, testing, etc. That is fun part and that is when the mathematics kicks in. Core backbone of Data Science is mathematics, particularly Probability and Statistics, and Linear Algebra. Probability is needed to understand the relation between the events and to predict the value of specific features. Advanced knowledge about Linear Algebra is needed to work with matrices, linear equations and vector spaces. Many real word problems are modeled using linear algebra. If the system is non linear, linear algebra is often used for first order approximations, or for evaluating the relative error (which is sometimes the only thing you care about). The Statistics is where the most confusion occurs. Distinction between statistics and Data Science can sometimes be blurry. Statistics plays the great role in Data Science, but those two domains can not be considered as even. I’ve often heard that Data Science is advanced Statistics but that is not the truth. The truth is that you need to be handy with advanced Statistics to work with Data Science! When working on some Data Science project you will deal with Statistics in many stages. First, when exploring the data you will be looking for min, max, mean, standard deviation of values in your data. Then you will be exploring data distribution, plotting histograms, etc. When you get to know you data and decide what to do with it, it’s very likely that you will need statistics for that step as well. Maybe you will decide to perform some filtering, or to interpolate missing values, or maybe your algorithm uses some approximation techniques to optimize, in any case you will rely on Statistics. When you get the results you will want to explore the values, maybe you need to know the min/max, or to compare between original distributions if you performed some filtering in between, etc. All in all, more Statistics. If you want to make some predictions based on your data you are entering Machine Learning world. If that is your task you should explore neural networks, clustering algorithms, decision trees, regression, support vector machines, frequent pattern mining and more. Math that stands behind those models is again Linear Algebra, together with Graph Theory, Probability, Statistics, Algorithms Theory and Algorithms Complexity, etc.
When you consider everything, the truth is you need to know a lot of mathematics to work in the DataScience field. The good news is that many programming languages have extensive math libraries, so you don’t have to write everything from scratch, but advanced understanding of the key principles is required.
1 comments On How much math do you need for DataScience?
Great text. Totaly agree!