You are currently viewing Data Cleaning in ML: Garbage In, Garbage Out
Data Cleaning

Data Cleaning in ML: Garbage In, Garbage Out

The saying “Garbage In, Garbage Out” should become your mantra if you want to build accurate machine learning models that find real-world applications.

Quality data beats even the most sophisticated algorithms. Without clean data, your models will deliver misleading results and seriously harm your decision-making processes.

Why Learn How to Clean Data with Python?

Data Cleaning is a very crucial first step in any machine learning project. Data scientists spend 80% of their time cleaning data and only 20% of their time doing analysis. Learn some of the most common techniques for getting your data ready to analyze.

Data Cleaning

Data cleaning involves different techniques based on the problem and the data type. Different methods can be applied with each having its own trade-offs. Overall, incorrect data is either removed, corrected, or imputed.

Irrelevant data

Irrelevant data are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve. If you are sure that a piece of data is unimportant, you may drop it. Otherwise, you should ask someone who is a domain expert to explore the features before eliminating any data. You never know, a feature that seems irrelevant, could be very relevant from a domain perspective.

Duplicates

Duplicates are data points that are repeated in your dataset. It often happens when for example, data are combined from different sources or the user may hit submit button twice thinking the form wasn’t actually submitted. And therefore, they simply should be removed.

Type conversion

Make sure numbers are stored as numerical data types. A date should be stored as a date object, a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed. This is can be spotted quickly by taking a peek over the data types of each column.

Imputation

Human errors, data flow interruptions, privacy concerns, and other factors could all contribute to missing values. The main goal of imputation is to handle these missing values. Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. Either you can simply drop the row having a missing value or you can prefer Imputation as an option rather than dropping because it preserves the data size.

Handling Outliers

An outlier is a data point that differs significantly from other observations. Outlier Handling can be used on a variety of scales to produce a more accurate data representation. This has an impact on the model’s performance. Depending on the model, the effect could be large or minimal; for example, linear regression is particularly susceptible to outliers.

Scaling

In most cases, the numerical features of the dataset do not have a certain range and they differ from each other. In real life, we don’t expect age and income columns to have the same range. But from the machine learning point of view, how these two columns can be compared? Due to the higher range, machine learning models create a bias towards the income column, but we want to examine each feature equally. If not scale, the feature with a higher value range starts dominating when calculating distances.
There are two common ways of Scaling: 1. Normalization 2. Standardization

If you want to learn more about data cleaning techniques by practicing hands-on coding, then this is the best opportunity for you to kick off your career in the field of Data Science by taking a Data Science course from here:

Visit the Contact page for any queries.



Leave a Reply