09. Pre-testing

Almost all statistical tests rely on certain requirements to be met. In this lesson we will learn how to ascertain whether this is the case. We will largely focus on testing for normality, outliers and heteroscedasticity.

Learning Goals #

Evaluate whether data follows a normal distribution.
Study whether there is evidence to support that datapoints are outliers.
Test whether the variances of two samples are homogenous.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers

READ SECTION 9.4.1

Normality testing

1. Normality testing #

We start our data evaluation with the test for normality. Indeed, most parametric statistical tests distinguish themselves by requiring the data to follow a specific distribution. For the tests covered in this course this largely concerns the normal distribution. In other words, we want to study whether our data follows the normal distribution. This is also sometimes referred to as the normality criterion.

1.1. Graphical method #

Of course, one method that we can attempt to exploit is leveraging our plotting skills. Figure 1A and 1B show the box-and-whisker plot and histogram, respectively, as we have been taught in Lesson 2. The box-and-whisker plot suggests the data is symmetrically distributed, but this is contradicted by the histogram.

Figure 1. Three different methods of plotting the same data that give different perspectives as to the data meeting the normality criterion.

A more useful method is to create a probability-probability (P-P) plot, where the empirical CDF of the data is plotted against the CDF of the distribution to which the data is assumed to adhere to.

SEE BOOK FOR A DETAILED GUIDE

Section 9.4.1.1 features a detailed step by step example of creating a P-P plot, as well as a Q-Q plot.

In essence, the actual values are converted to $z$ -values, through the relation $z=(x-\mu)/\sigma$ and the CDF is computed for each. These are then plotted against the CDF of the points were they to ideally follow the normal distributions. If the actual data is normally distributed, they will appear to follow the straight dotted line.

1.2. Statistical tests #

It is also possible to express the normality criterion with a number. A well known example is the Kolmogorov-Smirnov test. This test compares the actual CDF and the theoretically-ideal CDF with $H_0$ reflecting that the data is normally distributed, and $H_1$ that it is not. The test actually examines only the largest difference available within the dataset. This is shown in Figure 9.26 in the book.

Figure 2. The Kolmogorov-Smirnov test is quite sensitive to the presence of outliers.

The Kolmogorov-Smirnov test is actually appropriate for a large sample size ( $n$ > 50). For smaller sample sizes we tend to use the Lilliefors’ correction [1]. This is essentially the same test, but with stricter critical values.

Both the Kolmogorov-Smirnov and Lilliefors’ tests can be executed using the following function where h is the test result, p the $p$ -value and x the data vector. The public MATLAB File Exchange features several functions created by others that can conduct the Shapiro-Wilk test.

				
					% Kolmogorov-Smirnov Test
[h,p] = kstest(x)

% Lilliefors' Test
[h,p] = lillietest(x)

Sorry, our website does not feature a Lilliefors’ test currently.

				
					##### Kolmogorov-Smirnov Test
using HypothesisTests

# Define the data
x = [3.18, 3.47, 3.34, 3.18, 3.44, 3.06, 2.96, 3.41, 3.02, 
     3.13, 3.58, 3.04, 2.96, 3.63, 2.83, 3.01, 3.70, 3.06]

# Perform Kolmogorov-Smirnov Test
ks_test = ExactOneSampleKSTest(x, Normal())

# Extract results
h = pvalue(ks_test) < 0.05  # Reject null hypothesis if p < 0.05
p = pvalue(ks_test)

# Print results
println("Kolmogorov-Smirnov Test:")
println("Hypothesis rejected (data not normal): ", h)
println("p-value: ", p)


##### Lilliefors' Test
# The Lilliefoers' Test is not featured on the website currently

For small sample sizes, the Shapiro-Wilk test [2] may be preferred as it has been shown to yield better statistical power than the Kolmogorov-Smirnov test with a Lilliefors correction [3].

READ SECTION 9.4.2

Outlier testing

2. Outlier testing #

The Kolmogorov-Smirnov test shows us that outliers can be a serious problem. We will now study some methods to determine whether a datapoint is an outlier, but must be mindful of the fact that – like any statistical test – most methods are likely to make mistakes. To make things worse, we will see that some tests can contradict each other.

WARNING

The removal of outliers is a very sensitive topic that must be approached with great care. Removing a datapoint is arguably approaching subjectivity and essentially equals manipulation of the dataset. Regulated environments therefore often do not allow outliers to be removed.

2.1. Critical range #

A relatively simple test is the critical range method that can be used if $\sigma$ is known. This test assess $H_0$ that the data contains no outliers, against $H_1$ that the data contains outliers. $H_0$ is rejected if

Equation 9.56: $(x_{\text{max}}-x_{\text{min}})>CR_{\alpha,n}\sigma$

Here, $x_{\text{min}}$ and $x_{\text{max}}$ are the minimum and maximum values of the dataset. $CR_{\alpha,n}$ is a tabulated (see Table 1) critical value that depends on the significance level and the sample size.

Table 1. Critical values for the critical range method. Values specified for different sample sizes and significance levels.

$n$	$CR_{0.1,n}$	$CR_{0.05,n}$	$CR_{0.025,n}$
2	2.3	2.8	3.2
3	2.9	3.3	3.7
4	3.2	3.6	4
5	3.5	3.9	4.2
6	3.7	4	4.4
7	3.8	4.2	4.5
8	3.9	4.3	4.6
9	4	4.4	4.7
10	4.1	4.5	4.8
11	4.2	4.5	4.9
12	4.3	4.6	4.9
13	4.3	4.7	5
14	4.4	4.8	5.1

The following repeated measurements were obtained x = [23.7 23.5 22.4 22.9 26.5 23.9 23.0 22.1];. Is there an outlier if the precision of the method is 0.5? Run the test at a significance level of 0.05.

Correct! Remember that any statistical test cannot tell you whether the null hypothesis is true or false. It can only tell you whether there is sufficient evidence to reject the null hypothesis!

This is not correct. Try again! Did you take the correct value from the table (4.3)?

2.2. Dixon Q and Grubbs tests #

If $\sigma$ is not known an alternative option is the Dixon Q test. The hypotheses are the same as with the critical range method, but the Q statistic is calculated different depending (i) the sample size, and (ii) whether an outlier is suspected at the lower or high end of the sorted data.

EQUATIONS TO CALCULATE Q STATISTIC

See the book for the equations to calculate the Q statistics. They can be found from Equation 9.51a – 9.53b.

Similar to the critical range method, the Q statistic is compared to a critical Q value that is tabulated.

Figure 3. With two outliers present, the Dixon test no longer succeeds in finding them. See Section 9.4.2.1 for an elaborate calculation.

The Dixon test is useful, but particularly sensitive to the present of two outliers as is also demonstrated by the example in Figure 3 (see Section 9.4.2.1 of the book for an elaborate calculation).

Dixon test can be executed using the following function where h is the test result, p the $p$ -value, x the data vector, alpha the significance level and tails whether it is a left, right or two-sided test. The function can be downloaded here (Dixon.m, .ZIP). Unpack in the Current Folder window. See the annotated code for a clarification on the use.

				
					[h,p,stats] = Dixon(x,alpha,tail);

Sorry, our website does not feature a Lilliefors’ test currently.

				
					using Statistics, Distributions

include("Dixon.jl")

# Example Usage
x = [3.18, 3.47, 3.34, 3.18, 3.44, 3.06, 2.96, 3.41, 3.02, 3.13, 3.58, 3.04, 2.96, 3.63, 2.83, 3.01, 3.70, 3.06]

# Run the Dixon test
h, stats = Dixon(x,0.05,"both")

An alternative to the Dixon test is the Grubbs test that can be used to detect two outliers. It is described in further detail in Section 9.4.2.2.

Grubbs test can be executed using the following function where h is the test result, p the $p$ -value, x the data vector, alpha the significance level and tails whether it is a left, right or two-sided test. The function can be downloaded here (Grubbs.m, .ZIP). Unpack in the Current Folder window. See the annotated code for a clarification on the use.

The alternative method employed the is outlier function and returns for each value a boolean that indicates whether the value is an outlier (1) or not (0).

				
					[h,p,stats] = Grubbs(x,alpha,tail);

% Alternative Method
TF = isoutlier(x,'grubbs')

Sorry, our website does not feature a Lilliefors’ test currently.

				
					using Statistics, Distributions

include("Grubbs.jl")

# Example Usage
x = [3.18, 3.47, 3.34, 3.18, 3.44, 3.06, 2.96, 3.41, 3.02, 3.13, 3.58, 3.04, 2.96, 3.63, 2.83, 3.01, 3.70, 3.06]

# Run the Dixon test
h,p,stats = Grubbs(x,0.05,"both")

2.3. Median Absolute Deviation (MAD) #

An incredibly useful mathematical outlier threshold tool is the so-called median absolute deviation (MAD) distance.

This is a robust measure, sometimes used as robust alternative to the standard deviation, to quantify the variability and equals the median of the absolute differences of each value and the median value (see Equation 9.57). The tool can then be used to see if the values a certain number (e.g. 3) MAD away of the median of the data and use this as threshold to determine whether a value is then an outlier or not. See Section 9.4.2.4 for more details.

				
					% Median Absolute Deviation Method
TF = isoutlier(x);

Sorry, our website does not feature a Lilliefors’ test currently.

				
					using Statistics

# Function to compute MAD
function mad(x; threshold=3)
    median_x = median(x)
    deviations = abs.(x .- median_x)
    mad_value = median(deviations)
    return deviations .> threshold * mad_value
end

# Example data
x = [1, 2, 3, 4, 100, 6, 7]

# Detect outliers using MAD method
TF = mad(x)
println(TF)

3. Heteroscedasticity #

Testing for the homogeneity of variances, or homoscedasticity, is a comparison of variances such as treated in Lesson 7. The

F

-test can be used for the comparison of two variances (see also Section 9.3.7).

3.1. Bartlett Test #

Another well known homoscedasticity test is the Bartlett’s test, that allows several variances to be compared. The test statistic, which follows the $\chi^2$ -distribution, is computed as

Equation 9.59: $\chi^2_{\text{obs}}=\frac{[\nu_{\text{pool}} \log{s^2_{\text{pool}}}-\sum^k_{i=1} \nu_i \log{s^2_i}]}{C}$

Here, $k$ is the number of groups, $\nu_{\text{pool}}$ the total number of degrees of freedom, and $s^2_{\text{pool}}$ the pooled variance given by $\sum^k_{i=1} \nu_i s^2_i / \nu_{\text{pool}}$ . Finally, $C$ is given by

Equation 9.60: $C=1+\frac{(\sum^k_{i=1} \frac{1}{\nu_i})-\frac{1}{\nu_{\text{pool}}}}{3(k-1)}$

As always, the observed statistic is compared to the critical value, now at $k$ – 1 degrees of freedom. Bartlett’s test is extremely sensitive to deviations of normality. We therefore typically use the robust Levene’s test, which is treated in the next lesson.

Concluding remarks #

We have now learned methods to test important prerequisites of statistical tests, such as the normality criterion, the presence of outliers and homoscedasticity. The latter was actually already covered by the $F$ -test from Lesson 7.

Now that we can test this, the next question becomes what happens when one of these tests fail. This is covered in the next lesson.

References #

[1] Lilliefors, H. W. On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown. Journal of the American Statistical Association, 62(318), 1967, 399–402, DOI: https://doi.org/10.1080/01621459.1967.10482916

[2] Shapiro, S.S., Wilk, M.B., An analysis of variance test for normality (complete samples), Biometrika, Volume 52(3-4), 1965, 591–611, DOI: https://doi.org/10.1093/biomet/52.3-4.591

[3] Razali, N; Wah, Y.B. Power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests. J. of Stat. Mod. and Anal. 2011, 2(1), 21–33, LINK: ResearchGate

INFORMATION REPOSITORY

Extra

MSc. Chemometrics & Statistics

MSc. Separation Science

09. Pre-testing

Learning Goals #

READ SECTION 9.4.1

1. Normality testing #

1.1. Graphical method #

SEE BOOK FOR A DETAILED GUIDE

1.2. Statistical tests #

READ SECTION 9.4.2

2. Outlier testing #

WARNING

2.1. Critical range #

2.2. Dixon Q and Grubbs tests #

EQUATIONS TO CALCULATE Q STATISTIC

2.3. Median Absolute Deviation (MAD) #

3. Heteroscedasticity #

3.1. Bartlett Test #

Concluding remarks #

References #

Is this article useful?

$n$	$CR_{0.1,n}$	$CR_{0.05,n}$	$CR_{0.025,n}$
2	2.3	2.8	3.2
3	2.9	3.3	3.7
4	3.2	3.6	4
5	3.5	3.9	4.2
6	3.7	4	4.4
7	3.8	4.2	4.5
8	3.9	4.3	4.6
9	4	4.4	4.7
10	4.1	4.5	4.8
11	4.2	4.5	4.9
12	4.3	4.6	4.9
13	4.3	4.7	5
14	4.4	4.8	5.1

$n$	$CR_{0.1,n}$	$CR_{0.05,n}$	$CR_{0.025,n}$
2	2.3	2.8	3.2
3	2.9	3.3	3.7
4	3.2	3.6	4
5	3.5	3.9	4.2
6	3.7	4	4.4
7	3.8	4.2	4.5
8	3.9	4.3	4.6
9	4	4.4	4.7
10	4.1	4.5	4.8
11	4.2	4.5	4.9
12	4.3	4.6	4.9
13	4.3	4.7	5
14	4.4	4.8	5.1

INFORMATION REPOSITORY

09. Pre-testing

Learning Goals #

READ SECTION 9.4.1

1. Normality testing #

1.1. Graphical method #

SEE BOOK FOR A DETAILED GUIDE

1.2. Statistical tests #

READ SECTION 9.4.2

2. Outlier testing #

WARNING

2.1. Critical range #

2.2. Dixon Q and Grubbs tests #

EQUATIONS TO CALCULATE Q STATISTIC

2.3. Median Absolute Deviation (MAD) #

3. Heteroscedasticity #

3.1. Bartlett Test #

Concluding remarks #

References #

Is this article useful?

Share This Article :

$n$	$CR_{0.1,n}$	$CR_{0.05,n}$	$CR_{0.025,n}$
2	2.3	2.8	3.2
3	2.9	3.3	3.7
4	3.2	3.6	4
5	3.5	3.9	4.2
6	3.7	4	4.4
7	3.8	4.2	4.5
8	3.9	4.3	4.6
9	4	4.4	4.7
10	4.1	4.5	4.8
11	4.2	4.5	4.9
12	4.3	4.6	4.9
13	4.3	4.7	5
14	4.4	4.8	5.1