You are currently viewing Can’t solve Data Science Problems? Follow this Structured Approach

Can’t solve Data Science Problems? Follow this Structured Approach

Most data scientists adopt a haphazard approach when they work on a data science problem. Most people take the data and start throwing algorithms, hoping that they would achieve success through brute force – trial and error. In most cases, this does not result in a favorable outcome. Data Science is both an art and science, therefore, it is equally important to apply a structured approach to solve any Data Science problem with your intuition. Follow this approach and crack any Data Science problem.

1. Understand the problem statement

A Data Scientist should first formulate the primary objective as it should be clear what he/she is trying to solve. This helps in understanding the data better. A student who has cleared JEE knows more than anyone(even a data scientist) how a student can crack JEE, hence having domain expertise and understanding data is therefore important. Precise and accurate problem definition is critical for the overall success of a data analysis project. Domain knowledge can often help us reach this precision and accuracy. Having domain knowledge doesn’t let a data scientist eliminate underlying features in the dataset.

2. Identify the right datasets and define the dependent and independent variables

Most data scientists would spend atleast 70% of their time working on data. Identifying the right datasets or data sources that need to be worked on is also important. How you identify the “target” variable is essential for any business problem. Once done, the rest of the columns/features become independent variables that will be preprocessed(some eliminated if found useless) and used to predict the “target variable”.

3. Train and Test Data

It is important to identify the data that will be used to train the data science model. As an example, all data till a specific date (26-11-2021) will be used for training the model and the dates after that will be used for testing the model accuracy. Usually, the dataset is partitioned into a training dataset and a testing dataset, and the method used to decide the train and test split should be chosen carefully.

4. Data Preparation, Cleaning

Data Preparation and cleaning is a mandatory step that requires preparing the data for the model to learn. Computers can only understand numerical data. So it is important to do data conversions such as One Hot Encoding of categorical data (as an example if the Gender column has Male, Female, and Others, then this data is one hot encoded to convert to numerical values). Another example is the handling of missing values. It is important to define a strategy for missing values as the data scientist is making the best guess estimate of the missing value. Median or Mode are used for Numerical and Categorical values respectively.

5. Exploratory Data Analysis

We use Machine learning algorithms to solve the problem but nothing beats a good exploratory data analysis. Quality data beats even the most sophisticated algorithms. An Exploratory Data Analysis or EDA involves visual inspection of data through tools such as Heatmaps for correlation analysis, Bar charts and Scatter plots, Frequency charts or Histograms, Box Plots (Outlier detection), and so on. EDA gives important insights into relationships that exist within data and is key to feature engineering.

6. Scaling Numerical data

Most data science algorithms are sensitive to the scale of the data. Without scaling features, the algorithm may be biased toward the feature which has values higher in magnitude. Hence we scale features that bring every feature in the same range, and the model uses every feature wisely. Standard Scaler is used for scaling and it is computed as (x — mean)/sigma, where x is the data point.

7. Multi-collinearity and Curse of Dimensionality

Multi-collinearity arises when independent variables/columns/features are not truly independent. This essentially means that there is inherent relationship between two columns which are assumed to be independent. Also, there can be case (for example) where initially the data scientist started with 10–15 variables and then after all the Feature Engineering and data pre-processing it resulted in more than 100 variables. Hence it becomes very important to reduce the number of dimensions. Typically this is done through either Principal Component Analysis (Unsupervised Learning) or Linear Discriminant Analysis (LDA). This not only removes multicollinearity as well as it also simplifies the processing of data by the algorithm.

8. Train the model using an algorithm

Train the model with appropriate train/test split, take validation split if you’ve sufficient dataset.

9. Measure the outcome on the Training Set and Testing Set

A data scientist should measure the prediction accuracy on the training set and how well the model is able to learn the patterns from the data. The evaluation metrics should be decided based on Problem Statement. The Testing set is used to predict the outcomes and again use the metric as defined by problem statement.

10. Overfitting?

If the accuracy on the Training set is high (say 95%) and the accuracy on the testing set is poor (say 60%), then it is possible that the model is overfitting (it is not able to generalize and is failing when it encounters unseen data from the testing set). In such cases it is important to look at: a) How the training set and testing set were formed or if cross validation is necessary b) Use L1/L2 regularization to reduce overfitting.

11. Explore other algorithms

A data scientist can also explore other algorithms of similar nature to look at improving efficiency and accuracy of the model. While exploring other algorithms, then Metrics Evaluation and Overfitting check need to be repeated for every algorithm considered.

12. Consider Ensemble techniques

Consider ensemble techniques such as Bagging, Boosting or Stacking to improve accuracy of the model. Random Forest (Bagging), Adaboost, Gradient Boosting Machines, XGBoost (Boosting) can also be considered.

13. Explainability and Visual Output

Most business users do not usually understand the statistical explanations and hence it is important that the outcome of the model is explained in an intuitive manner. This involves building dashboards with intuitive visuals to explain the model and the outcomes. Usually these are developed in specialized tools such as Power BI, Tableau or Qlik. Otherwise open source libraries such as Plotly can also be used to develop such dashboards.

Building a data science model is a very complex exercise and the above mentioned steps are an attempt to put an structured approach to this complex exercise.

If you want to learn Data Science in the best-structured way by practicing hands-on coding and following the correct roadmap, then this is the best opportunity for you to kick off your career by taking a Data Science course from here:

Master the most in-demand skill of the 21st Century.

Visit the Contact page for any queries.



Leave a Reply