When can you expect a variable to have a normal distribution?

$\begingroup$

I was wondering when a normal distribution can be expected. I know that things like:

  1. heights of people
  2. size of things produced by machines
  3. errors in measurements
  4. blood pressure
  5. marks on a test

    (source: Math is Fun)

follow a normal distribution. But would for example the chosen nicknames of people also fall in that category? And if so or if not, why would that be?

$\endgroup$

3

A normal distribution is a common probability distribution . It has a shape often referred to as a "bell curve."

Many everyday data sets typically follow a normal distribution: for example, the heights of adult humans, the scores on a test given to a large class, errors in measurements.

The normal distribution is always symmetrical about the mean.

The standard deviation is the measure of how spread out a normally distributed set of data is.  It is a statistic that tells you how closely all of the examples are gathered around the mean in a data set.  The shape of a normal distribution is determined by the mean and the standard deviation. The steeper the bell curve, the smaller the standard deviation.  If the examples are spread far apart, the bell curve will be much flatter, meaning the standard deviation is large. 

   

When can you expect a variable to have a normal distribution?

In general, about 68 % of the area under a normal distribution curve lies within one standard deviation of the mean.

That is, if x ¯ is the mean and σ is the standard deviation of the distribution, then 68 % of the values fall in the range between ( x ¯ − σ ) and ( x ¯ + σ ) . In the figure below, this corresponds to the region shaded pink.

When can you expect a variable to have a normal distribution?

About 95 % of the values lie within two standard deviations of the mean, that is, between ( x ¯ − 2 σ ) and ( x ¯ + 2 σ ) .

(In the figure, this is the sum of the pink and blue regions: 34 % + 34 % + 13.5 % + 13.5 % = 95 % .)

About 99.7 % of the values lie within three standard deviations of the mean, that is, between ( x ¯ − 3 σ ) and ( x ¯ + 3 σ ) .

(The pink, blue, and green regions in the figure.)

(Note that these values are approximate.)

Example 1:

A set of data is normally distributed with a mean of 5 . What percent of the data is less than 5 ?

A normal distribution is symmetric about the mean. So, half of the data will be less than the mean and half of the data will be greater than the mean.

Therefore, 50 % percent of the data is less than 5 .

Example 2:

The life of a fully-charged cell phone battery is normally distributed with a mean of 14 hours with a standard deviation of 1 hour. What is the probability that a battery lasts at least 13 hours?

The mean is 14 and the standard deviation is 1 .

50 % of the normal distribution lies to the right of the mean, so 50 % of the time, the battery will last longer than 14 hours.

The interval from 13 to 14 hours represents one standard deviation to the left of the mean. So, about 34 % of time, the battery will last between 13 and 14 hours.

Therefore, the probability that the battery lasts at least 13 hours is about 34 % + 50 % or 0.84 .

Example 3:

The average weight of a raspberry is 4.4 gm with a standard deviation of 1.3 gm. What is the probability that a randomly selected raspberry would weigh at least 3.1 gm but not more than 7.0 gm?

The mean is 4.4 and the standard deviation is 1.3 .

Note that

4.4 − 1.3 = 3.1

and

4.4 + 2 ( 1.3 ) = 7.0

So, the interval 3.1 ≤ x ≤ 7.0 is actually between one standard deviation below the mean and 2 standard deviations above the mean.

In normally distributed data, about 34 % of the values lie between the mean and one standard deviation below the mean, and 34 % between the mean and one standard deviation above the mean.

In addition, 13.5 % of the values lie between the first and second standard deviations above the mean.

Adding the areas, we get 34 % + 34 % + 13.5 % = 81.5 % .

Therefore, the probability that a randomly selected raspberry will weigh at least 3.1 gm but not more than 7.0 gm is 81.5 % or 0.815 .

Example 4:

A town has 330,000 adults. Their heights are normally distributed with a mean of 175 cm and a variance of 100 cm 2 .How many people would you expect to be taller than 205 cm?

The variance of the data set is given to be 100 cm 2 . So, the standard deviation is 100 or 10 cm.

Now, 175 + 3 ( 10 ) = 205 , so the number of people taller than 205 cm corresponds to the subset of data which lies more than 3 standard deviations above the mean.

The graph above shows that this represents about 0.15 % of the data. However, this percentage is approximate, and in this case, we need more precision. The actual percentage, correct to 4 decimal places, is 0.1318 % .

330 , 000 × 0.001318 ≈ 435

So, there will be about 435 people in the town taller than 205 cm.

The first method that almost everyone knows is the histogram. The histogram is a data visualization that shows the distribution of a variable. It gives us the frequency of occurrence per value in the dataset, which is what distributions are about.

The histogram is a great way to quickly visualize the distribution of a single variable.

1.2. Interpretation

In the picture below, two histograms show a normal distribution and a non-normal distribution.

  • On the left, there is very little deviation of the sample distribution (in grey) from the theoretical bell curve distribution (red line).
  • On the right, we see quite a different shape in the histogram, telling us directly that this is not a normal distribution.
Sometimes the deviation from a normal distribution is so obvious that it can be detected visually.

1.3. Implementation

A histogram can be created easily in python as follows:

Creating a histogram using pandas in python

1.4. Conclusion

The histogram is a great way to quickly visualize the distribution of a single variable.

2. Box Plot

The Box Plot is another visualization technique that can be used for detecting non-normal samples. The Box Plot plots the 5-number summary of a variable: minimum, first quartile, median, third quartile and maximum.

The boxplot is a great way to visualize distributions of multiple variables at the same time.

2.2 Interpretation

The boxplot is a great visualization technique because it allows for plotting many boxplots next to each other. Having this very fast overview of variables gives us an idea of distribution and as a “bonus”, we get the complete 5-number summary that will help us in further analysis.

You should look at two things:

  • Is the distribution symmetrical (as is the Normal distribution)?
  • Does the width (opposite of pointiness) correspond to the width of the normal distribution? This is hard to see on a box plot.
Normal (left), Uniform (middle) and exponential (right) boxplots vs normal bell curve

2.3. Implementation

A boxplot can be easily implemented in python as follows:

Creating a boxplot using pandas in python

2.4. Conclusion

The boxplot is a great way to visualize distributions of multiple variables at the same time, but a deviation in width/pointiness is hard to identify using box plots.

3. QQ Plot

With QQ plots we’re starting to get into the more serious stuff, as this requires a bit more understanding than the previously described methods.

QQ Plot stands for Quantile vs Quantile Plot, which is exactly what it does: plotting theoretical quantiles against the actual quantiles of our variable.

The QQ Plot allows us to see deviation of a normal distribution much better than in a Histogram or Box Plot.

3.2. Interpretation

If our variable follows a normal distribution, the quantiles of our variable must be perfectly in line with the “theoretical” normal quantiles: a straight line on the QQ Plot tells us we have a normal distribution.

Normal (left), uniform (middle) and exponential (right) QQ Plots

As seen in the picture, the points on a normal QQ Plot follow a straight line, whereas other distributions deviate strongly.

  • The uniform distribution has too many observations in both extremities (very high and very low values).
  • The exponential distribution has too many observations on the lower values, but too little in the higher values.

In practice, we often see something less pronounced but similar in shape. Over or underrepresentation in the tail should cause doubts about normality, in which case you should use one of the hypothesis tests described below.

3.3. Implementation

Implementing a QQ Plot can be done using the statsmodels api in python as follows:

Creating a QQ Plot using statsmodels

3.4. Conclusion

The QQ Plot allows us to see deviation of a normal distribution much better than in a Histogram or box plot.

4. Kolmogorov Smirnov test

If the QQ Plot and other visualization techniques are not conclusive, statistical inference (Hypothesis Testing) can give a more objective answer to whether our variable deviates significantly from a normal distribution.

If you have doubts about how and when to use hypothesis testing, here’s an article that gives an intuitive explanation to hypothesis testing.

The Kolmogorov Smirnov test computes the distances between the empirical distribution and the theoretical distribution and defines the test statistic as the supremum of the set of those distances.

The advantage of this is that the same approach can be used for comparing any distribution, not necessary the normal distribution only.

The KS test is well-known but it has not much power. It can be used for other distribution than the normal.

4.2. Interpretation

The Test Statistic of the KS Test is the Kolmogorov Smirnov Statistic, which follows a Kolmogorov distribution if the null hypothesis is true.

If the observed data perfectly follow a normal distribution, the value of the KS statistic will be 0. The P-Value is used to decide whether the difference is large enough to reject the null hypothesis:

  • If the P-Value of the KS Test is larger than 0.05, we assume a normal distribution
  • If the P-Value of the KS Test is smaller than 0.05, we do not assume a normal distribution

4.3. Implementation

The KS Test in Python using Scipy can be implemented as follows. It returns the KS statistic and its P-Value.

Applying the KS Test in Python using Scipy

4.4. Conclusion

The KS test is well-known but it has not much power. This means that a large number of observations is necessary to reject the null hypothesis. It is also sensitive to outliers. On the other hand, it can be used for other types of distributions.

5. Lilliefors test

The Lilliefors test is strongly based on the KS test. The difference is that in the Lilliefors test, it is accepted that the mean and variance of the population distribution are estimated rather than pre-specified by the user.

Because of this, the Lilliefors test uses the Lilliefors distribution rather than the Kolmogorov distribution.

Unfortunately for Lilliefors, it’s power is still lower than the Shapiro Wilk test.

5.2. Interpretation

  • If the P-Value of the Lilliefors Test is larger than 0.05, we assume a normal distribution
  • If the P-Value of the Lilliefors Test is smaller than 0.05, we do not assume a normal distribution

5.3. Implementation

The Lilliefors test implementation in statsmodels will return the value of the Lilliefors test statistic and the P-Value as follows.

Attention: in the statsmodels implementation, P-Values lower than 0.001 are reported as 0.001 and P-Values higher than 0.2 are reported as 0.2.

Applying the Lilliefors test using statsmodels

5.4. Conclusion

Although Lilliefors is an improvement to the KS test it’s power is still lower than the Shapiro Wilk test.

6. Shapiro Wilk test

The Shapiro Wilk test is the most powerful test when testing for a normal distribution. It has been developed specifically for the normal distribution and it cannot be used for testing against other distributions like for example the KS test.

The Shapiro Wilk test is the most powerful test when testing for a normal distribution.

6.2. Interpretation

  • If the P-Value of the Shapiro Wilk Test is larger than 0.05, we assume a normal distribution
  • If the P-Value of the Shapiro Wilk Test is smaller than 0.05, we do not assume a normal distribution

6.3. Implementation

The Shapiro Wilk test can be implemented as follows. It will return the test statistic called W and the P-Value.

Attention: for N > 5000 the W test statistic is accurate but the p-value may not be.

Applying the Shapiro Wilk test using statsmodels in Python

6.4. Conclusion

The Shapiro Wilk test is the most powerful test when testing for a normal distribution. You should definitely use this test.

7. Conclusion — which approach to use!

For quick and visual identification of a normal distribution, use a QQ plot if you have only one variable to look at and a Box Plot if you have many. Use a histogram if you need to present your results to a non-statistical public.

As a statistical test to confirm your hypothesis, use the Shapiro Wilk test. It is the most powerful test, which should be the decisive argument.

When testing against other distributions, you cannot use Shapiro Wilk and should use for example the Anderson-Darling test or the KS test.