You are currently viewing Learn Statistics for Data Science

Learn Statistics for Data Science

To perform better in preprocessing we should have strong knowledge of statistics. In this blog, we are going to explore why we need statistics to become a good data scientist or analyst and how can we apply it before we get into the model preparation.  

Types of Statistics

Descriptive: collecting, analyzing, presenting, and interpreting or summarizing the data.

Inferential: Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population.

Sampling Techniques

Probability Sampling Techniques

  1. Random Sampling — ( Random selection )
  2. Stratified Sampling –Non-overlapping groups (Gender, Age group)
  3. Systematic Sampling — based on N value ( 2,3,2,12,34,3,5,6,12,9,3) => (3,34,6,3)
  4. Cluster Sampling — Random Cluster

Types of Variable

A measure of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. Mean, Median and Mode.

Mean (Arithmetic Mean): The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.

Median: The median is the middle score for a set of data that has been arranged in order.

Mode: The mode is the most frequent score in our data set.

x=[2,4,5,32,2,4,6,2,2 ]

mode=2

Measures of Dispersion

What is Dispersion in Statistics?

Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to which numerical data is likely to vary about an average value. In other words, dispersion helps to understand the distribution of the data.

The types of absolute measures of dispersion are:

  1. Range: It is simply the difference between the maximum value and the minimum value given in a data set. Example: 1,4,5,7,8,10 => Range = 10 -1= 9
  2. Variance: The variance measures the average degree to which each point differs from the mean.
  3. Standard Deviation: Standard deviation is the spread of a group of numbers from the mean.
  4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into quarters. The quartile deviation is half of the distance between the third and the first quartile.
  5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic mean of the absolute deviations of the observations from a measure of central tendency is known as the mean deviation.

Variance and Standard Deviation

Example:

We found our mean is 5 and the standard deviation is 2.92. for bell curve right side scales are 5+2.92 = 7.92, 7.92+2.92= 10.84, 10.84+2.92= 13.76. For left side scale 5–2.92= 2.08, 2.08–2.92= -0.84, -0.84–2.92=-3.76.

Empirical rule

The 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.

Are you interested in learning about one of the hottest career fields today? Check out my Assignment driven Data Science course which also includes extensive Statistics assignments for your practice, and get Job ready for your first Data Scientist role.

MASTER THE MOST IN-DEMAND SKILL OF 21st CENTURY

Leave a Reply