INFORMATION REPOSITORY

10. Robust Statistics

Updated on February 9, 2026
In the lessons thus far we have worked with statistics that assumed several criteria to be fulfilled: data had to be normally distributed, be free of outliers, and so on. However, what happens when these criteria are not met? In this lesson we will learn about robust alternative methods. We will see that the simplicity of these methods that renders them so robust comes at a cost.

Learning Goals #

  • Recognize when classical parametric tests fail, and explain how violations of normality and homoscedasticity affect statistical inference.
  • Describe the principles of robust and rank-based methods, including why and how they reduce sensitivity to outliers and non-ideal data.
  • Compute and interpret non-parametric tests, in particular the Wilcoxon/Mann-Whitney and Kruskal-Wallis tests, in terms of distributional differences between groups.
  • Make informed choices between parametric, robust, and non-parametric tests, understanding the trade-off between statistical power and robustness.
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.5.1-9.5.2

Introduction to robust methods

1. What is a robust method? #

Robust or non-parametric methods operate by abstracting the data to some degree to distil an answer. This is best understood by revising the example in Lesson 2

				
					x = [1, 2, 3, 4, 5, 6, 42];
				
			

x = [1; 2; 3; 4; 5; 6; 42]

Here, the mean (9) was strongly affected by the presence of an outlier (42) yet the median (4) was not. The median completely ignores outliers and simply selects the middle value of the sorted vector. For the median, it does not matter how serious the outliers are, it is robust against their presence. Key characteristics of such non-parametric methods are shown in Figure 1.

Figure 1. Comparison between parametric and non-parametric (robust) tests.

Is there then no disadvantage of the median? Well, the median completely ignores all the other actual values themselves. It simply selects the middle value. You could therefore say that it represents the complete dataset less. We will see in this lesson that due to this abstract treatment of the data, robust methods feature significantly less statistical power. Table 1 shows an overview of parametric and non-parametric methods.

Table 1. Overview of parametric and non-parametric tests.
Objective Parametric Non-parametric
Describing a sample
Mean
Standard deviation
Median
IQR
Comparing mean with value
Comparing two means (matched)
t-test
Sign’s test
Wilcoxon’s test
Comparing two means (non-matched)
t-test
Mann-Whitney U test
Comparing variances
F-test (k=2)
Bartlett test (k\ge2)
Levene’s test (k\ge2)
Comparing several means
ANOVA
Kruskal-Wallis
Measuring variables relationship
Pearson correlation
Spearman correlation

2. Comparison of mean with reference #

2.1. Signs test #

In Lesson 4, we learned about the comparison of a sample mean with a reference as our first hypothesis test. The first non-parametric alternative is the sign’s test. 

CASE: RETENTON TIME STABILITY

According to a protocol the median retention time of an analyte should be \tilde{x}_0 = 3.5 min. The repeated measurements shown below are obtained. Is the median of the population of data 3.5? Note that the hypotheses are H_0: \tilde{x}=\tilde{x}_0 and H_1: \tilde{x}\neq\tilde{x}_0.

				
					x = [3.3, 3.2, 3.4, 4.7, 3.7, 3.3, 3.4, 3.2, 3.5];
				
			

The sign’s test exclusively considers the direction of the deviation and literally considers the signs. The procedure is as follows:

  1. Subtract the reference from each value in the dataset.
  2. Count the number of positive and negative values (i.e. literally consider the signs).
  3. Take the minimum of the two values.
  4. Compare r_{\text{obs}} with a tabulated r_{\text{crit}} value.
EXERCISE 1: RETENTION STABILITY

Considering the case above, what is the outcome of your hypothesis test?

The signs test can also be used for a matched pairs comparison of two sample means, by first subtracting the two datasets from each other prior to following Step 1 above.

The code shown below conducts the Sign’s test. If desired, variables such as alpha (\alpha) and the tails of the test, tail, can be further specified. The variable tail can be either ‘left’, ‘right’, or ‘both’.

				
					% Sign Test
p = signtest(x,reference);

% Extended Configuration
p = signtest(x,reference,'Alpha',alpha,'Tail',tail);

% Example
x           = [4.6 4.7 4.8 4.9 5.1];
reference   = 5.0;
p           = signtest(x,reference,'Alpha',0.01,'Tail','left');
				
			
				
					using HypothesisTests

x           = [4.6, 4.7, 4.8, 4.9, 5.1]
reference   = 5.0
alpha       = 0.01

# This is a two-sided test
test        = SignTest(x, reference)
p           = pvalue(test)

println("p-value = ", p)
				
			

The sign’s test reduces the content of the data to the mere signs of the values compared to the reference. It is not surprising that the statistical power is thus lower than that of a parametric t-test.

2.2. Wilcoxon signed-rank test #

The Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test comparison of two means. It is used when two paired measurements are compared but the assumptions of normality are questionable.

Unlike the sign test, it considers both the direction and the magnitude of the differences by ranking their absolute values.

EXERCISE 2: ROBUST RANKING?

Why does ranking the differences make the Wilcoxon signed-rank test more robust than the paired t-test?

The procedure for the Wilcoxon’s signed-rank test is:

  1. Compute the differences d between paired observations.
  2. Assign a sign (+/–) to each difference.
  3. Rank the absolute values of d, ignoring the signs (ties receive average ranks).
  4. Sum the ranks for positive differences (T^{+}) and negative differences (T^{-}).
  5. The test statistic T_{\text{obs}} is the smaller of T^{+} and T^{-}.
  6. Compare this value to the critical value T_{\text{critical}} for the Wilcoxon signed-rank test at the chosen significance level, or compute the corresponding p-value.
CASE: PEAK AREA COMPARISON ON TWO SYSTEMS

The next example compares the peak areas of a compound of interest which was measured for 9 different objects in a sample on two different systems. See Sections 9.3.4.2 & 9.5.3 for more details.

The null hypothesis is that the median difference equals zero, meaning that there is no systematic increase or decrease between the paired measurements. Because the test relies on ranks rather than absolute values, it is robust against outliers and insensitive to skewed distributions. If T_{\text{obs}} > T_{\text{crit}}, then H_0 is accepted.

The code shown below conducts the Sign’s test on a vector of differences (x_diff), or the differences computed from the original values in x1 and x2. If desired, variables such as alpha (\alpha) and the tails of the test, tail, can be further specified. The variable tail can be either ‘left’, ‘right’, or ‘both’.

				
					% Wilcoxon Signed-Rank Test: Directly On Differences
p = signrank(x_diff);

% Wilcoxon Signed-Rank Test: On Two Paired Samples
p = signrank(x1,x2);

% Extended Configuration
p = signrank(x1,x2,'alpha',alpha,'tail',tail);

% Example
x1          = [5.1 5.3 5.0 5.2 5.4 5.1 5.3];   % Before
x2          = [5.0 5.4 4.9 5.3 5.3 5.2 5.2];   % After
p           = signtest(x1,x2,'alpha',0.01,'tail','left');
				
			
				
					using HypothesisTests

# Wilcoxon Signed-Rank Test: Directly On Differences
p           = SignedRankTest(x_diff)

# Wilcoxon Signed-Rank Test: On Two Paired Samples
p           = SignedRankTest(x1,x2)

# Example
x1          = [5.1, 5.3, 5.0, 5.2, 5.4, 5.1, 5.3]
x2          = [5.0, 5.4, 4.9, 5.3, 5.3, 5.2, 5.2]

x_diff      = x1 .- x2
test        = SignedRankTest(x_diff)
p           = pvalue(test)
				
			
Table 2. Overview of peak areas for compound of interest of nine objects measured on two systems (x_1 and x_2).
Object System 1 System 2 Difference Sign Rank
1
27185
25904
1281
+
4
2
20068
21207
-1139
5
3
32593
35582
-2989
7
4
176438
172343
4095
+
9
5
29319
28466
853
+
3
6
21634
18502
3132
+
8
7
3387
3273
114
+
1.5
8
28181
29889
-1708
6
9
3466
3352
114
+
1.5
EXERCISE 3: PEAK AREAS

In a system comparison study, peak areas of all objects within the same sample are measured one two different systems. Because the variation effects may be non-Gaussian and uneven across the sample, a Wilcoxon signed-rank test provides a robust alternative to the paired t-test. Use the procedure above to conduct the test using the values of Table 2. See Section 9.5.2 for a detailed analysis of this question. As a result of the test, you find that your observed T-value is (round to one decimal)

As a result of this observed statistic, you conclude that the null hypothesis is

The Wilcoxon signed-rank test is more powerful than the sign test, but it relies on the assumption that the distribution of differences is symmetric. This assumption can be assessed by comparison with a symmetric reference distribution or by other approaches, which are beyond the scope of this course.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.5.3

Unpaired Mann-Whitney U test

2.3. Mann-Whitney U test #

When comparing two independent samples and the assumptions of the parametric t-test are not met, the Mann–Whitney U test provides a non-parametric alternative. It serves the same role as the unpaired t-test, but operates on ranks rather than raw values, making it more robust to non-normality and outliers.
Wilcoxon's Rank-Sum

The Mann–Whitney U test is also known as Wilcoxon’s rank-sum test. Despite the similarity in name, it should not be confused with the Wilcoxon signed-rank test, which is used for paired data. Here, the samples are independent and originate from different objects.

The procedure to perform the Mann–Whitney U test is:

  1. All observations from both samples (or group) are first combined and ranked, assigning average ranks to tied values.
  2. The ranks are then summed separately for each group. This yields R_{\text{A}} and R_{\text{B}} in Equation 9.61.
  3. The test statistics U_{\text{A}} and U_{\text{B}} are calculated from the sample sizes and rank sums.
EQUATION 9.61

U_A = n_A n_B + \frac{n_A (n_A + 1)}{2} – R_A

U_B = n_A n_B + \frac{n_B (n_B + 1)}{2} – R_B

  1. The smaller of these two values is taken as the observed statistic U_{\text{obs}}.
  2. U_{\text{obs}} is compared to the tabulated critical value for the chosen significance level. If U_{\text{obs}} exceeds the critical value, the null hypothesis is accepted; otherwise, it is rejected.
Table 3. Example of ranked data for the Mann-Whitney U unpaired comparison of two samples.
SampleValueRank
A25.91
B26.62
B26.73
A26.84
B27.05
A27.26
A27.37
A27.48
A27.79
A27.810.5
B27.810.5
B27.912
A28.113
A28.514
B29.315
B29.416
B30.317

The test evaluates whether two samples are likely to originate from the same population, without requiring assumptions about the underlying distribution.

In the following example code, x1 and x2 are each the vectors containing all the data pertaining their group. The variable tail can be either ‘left’, ‘right’, or ‘both’.

				
					% Mann-Whitney U Test
[p,h,stats] = ranksum(x1,x2);

% Extended Configuration
[p,h,stats] = ranksum(x1,x2,'alpha',alpha,'tail',tail);

% Example
x1          = [9.8 10.1 10.5 10.7 11.0 11.2 11.4];   % Sample One
x2          = [9.9 10.2 10.4 10.8 10.9 11.3 11.5];   % Sample Two
[p,h,stats] = ranksum(x1,x2,'Tail','right');
				
			
				
					# Mann-Whitney U Test
ans         = MannWhitneyUTest(x1,x2)

# Example
x1          = [9.8 10.1 10.5 10.7 11.0 11.2 11.4]
x2          = [9.9 10.2 10.4 10.8 10.9 11.3 11.5]
[p,h,stats] = MannWhitneyUTest(x1,x2)
				
			

For small sample sizes, the Mann–Whitney U test is limited by the availability of tabulated critical values. Importantly, the test does not strictly assess equality of medians, but rather whether two samples originate from the same underlying probability distribution [1,2]. Differences in data spread and distribution shape therefore play an important role, and ideally the two samples should have similarly shaped distributions [3].

EXERCISE 4: METABOLITES

The metabolite level of two groups of patients was analysed using LC-MS, and the concentrations of a target compound were obtained. The results are reported in Table 3. Group A received a drug which is believed to alter the metabolite level, whereas group B received a placebo. Each group consists of different patients.

Because the distribution of concentrations is skewed and contains tied values, a non-parametric test is used. The measured concentrations from both batches were combined and ranked, as shown in the table.

Using the Mann–Whitney U test, assess whether the concentrations from batch A and batch B can reasonably be considered to originate from the same underlying population (meaning the drug did not alter the metabolite levels).

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.5.4

Levene’s and Brown-Forsythe Test

3. Comparison of variances #

When comparing the variability of two or more groups, classical tests such as the F-test or Bartlett’s test rely on the assumption that the data are normally distributed. In practice, however, deviations from normality and the presence of outliers can strongly affect these tests, making robust alternatives preferable.

3.1. Levene's test #

Levene’s test was developed as a robust method for comparing variances that is less sensitive to non-normality and outliers [4]. Instead of working directly with the original data, the test first transforms each observation into its absolute deviation from the group mean. These transformed values are then analyzed using a one-way ANOVA framework. The resulting test statistic approximately follows an F-distribution and is compared to a critical latex]F[/latex] value.

EQUATION 9.62

W_{\text{obs}} = \frac{n_{\text{tot}} – k}{k – 1} \cdot \frac{\sum_{j=1}^{k} n_j \, (\bar{z}_j – \bar{z})^2} {\sum_{j=1}^{k} \sum_{i=1}^{n_j} (z_{i,j} – \bar{z}_j)^2}

Here, z_{i,j} = \lvert x_{i,j} – \bar{x}_j \rvert, \bar{z}j denotes the mean of the z values within group j, and \bar{z} is the overall mean across all groups. Furthermore, n{\text{tot}} represents the total number of observations, and k is the number of groups being compared.

Quadratic Levene Alternative

A quadratic variant of Levene’s test replaces the absolute deviations z_{i,j} = \lvert x_{i,j} – \bar{x}j \rvert
with squared deviations,z{i,j}^2 = (x_{i,j} – \bar{x}_j)^2.
This modification can increase statistical power for certain types of data. However, because this approach still relies on the group mean \bar{x}_j, which is not a robust statistic.

3.2. Brown-Forsythe test #

The Brown–Forsythe test improves robustness further by computing deviations from the group median rather than the mean [5]. This makes the test particularly suitable when data contain outliers or are strongly skewed, while still testing whether group variances can be considered equal.

In the following example code, x is a matrix in which each column represents one group of data. Alternatively, x can be a vector with the new variable group representing a vector of group numbers for each value in x.

				
					% Levene Absolute
p         = vartestn(x,'TestType','LeveneAbsolute'); 

% Levene Quadratic
p         = vartestn(x,'TestType','LeveneQuadratic'); 

% Brown-Forsythe
p         = vartestn(x,'TestType','Brown-Forsythe'); 

% Extended Configuration
[p,stats] = vartestn(x,group,'TestType','LeveneQuadratic'); 
				
			
				
					using Statistics, DataFrames, StatsBase, LinearAlgebra, Distributions

function levene_test(x, group)
    df = DataFrame(x=x, g=group)
    glevels = unique(group)
    # compute absolute deviations from group median
    df[:, :z] = [abs(val - median(df[df.g .== g, :x])) for (val,g) in zip(df.x, df.g)]
    # perform one-way ANOVA on z
    group_means = [mean(df[df.g .== g, :z]) for g in glevels]
    grand_mean  = mean(df.z)
    n = length(x)
    k = length(glevels)
    
    # Between-group sum of squares
    ss_between = sum([sum(df[df.g .== g, :z].^0) * (m - grand_mean)^2 for (g,m) in zip(glevels, group_means)])
    
    # Within-group sum of squares
    ss_within = sum([sum((df[df.g .== g, :z] .- mean(df[df.g .== g, :z])).^2) for g in glevels])
    
    df_between = k - 1
    df_within  = n - k
    
    F = (ss_between / df_between) / (ss_within / df_within)
    p = 1 - cdf(FDist(df_between, df_within), F)
    
    return F, p
end

# Example
x           = [5.1, 5.3, 5.0, 5.2, 5.4, 5.1, 5.3, 4.9, 5.0, 5.2]
group       = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

F, p        = levene_test(x, group)
println("F = ", F, ", p = ", p)
				
			
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.5.5

Robust Analysis of Variance

4. Analysis of Variance #

4.1. Kruskal-Wallis test #

The non-parametric counterpart to one-way ANOVA (see Lesson 8) is closely related to Levene’s test. The Kruskal–Wallis test, named after William Kruskal and Wilson Wallis, extends Wilcoxon’s rank-sum test to more than two groups. Although it is often introduced as a test of equal group medians (with the alternative that at least one group differs), the test in fact evaluates whether the underlying probability distributions of the groups are comparable, in the same spirit as the Mann–Whitney U test, with steps being:

  1. Rank the data from all combined groups
  2. Compute the test statistic H_{\text{obs}}, according to
EQUATION 9.63

H_{\text{obs}} = \frac{12}{n_{\text{tot}}(n_{\text{tot}}+1)} \sum_{j=1}^{k} \frac{R_j^2}{n_j} – 3(n_{\text{tot}}+1)

Here, k denotes the number of groups, n_{\text{tot}} the total number of observations, j the number of datapoints in group j, and R_j the sum of the ranks in that group, given by R_j = \sum_{i=1}^{n_j} r_{i,j}, where r_{i,j} is the rank of observation i in group j. Unlike one-way ANOVA, the Kruskal–Wallis test statistic H_{\text{obs}} does not follow an F-distribution; for sufficiently large group sizes (n_j>5), it approximately follows a \chi^2 distribution.

 

  1. Compare H_{\text{obs}} with H_{\text{crit}} at k − 1 degrees of freedom in a one-tailed \chi^2test.
Table 4. Example original data in a parametric ANOVA format (left) of peptide concentrations determined using three different LC-MS methods. The data is translated into ranks for a non-parametric Kruskal-Wallis test (right). Data concerns corresponds to the first three columns used for Table 4 of Lesson 8 and is identical to Table 9.10 of the book.
ANOVA Kruskal-Wallis

1

2

3

1

2

3

3.18
3.47
3.34
10.5
15
12
3.18
3.44
3.06
10.5
14
7.5
2.96
3.41
3.02
2.5
13
5
3.13
3.58
3.04
9
16
6
2.96
3.63
2.83
2.5
17
1
3.01
3.70
3.06
4
18
7.5

Table 4 revisits the data from Lesson 8 and Figure 9.23 in the book, and shows how the actual values are translated into ranks. The ANOVA conducted on these ranks according to the Kruskal-Wallis yields the typical ANOVA table as is shown in Figure 2.

For the Kruskal-Wallis test, the function can be used in two ways. The simplest form is simply providing the function with a matrix in x, in which each column constitutes a group. Alternatively, an array of datapoints can be used as x, with in the second input argument an array of group designations for each value in x.

				
					% Kruskal-Wallis Test
[p,anovatab,stats] = kruskalwallis(x,group); 

% Test Data
x =[3.18, 3.47, 3.34;
    3.18, 3.44, 3.06;
    2.96, 3.41, 3.02;
    3.13, 3.58, 3.04;
    2.96, 3.63, 2.83;
    3.01, 3.70, 3.06];
    
% Example Calculation
[p,anovatab,stats] = kruskalwallis(x);
				
			
				
					using HypothesisTests

x_mat       = [3.18 3.47 3.34;
                3.18 3.44 3.06;
                2.96 3.41 3.02;
                3.13 3.58 3.04;
                2.96 3.63 2.83;
                3.01 3.70 3.06]

# Flatten data and create group labels
x_vec       = vec(x_mat)  # column-major flatten
group1      = x_mat[:, 1]
group2      = x_mat[:, 2]
group3      = x_mat[:, 3]

# Kruskal-Wallis Test
test        = KruskalWallisTest(group1, group2, group3)
p           = pvalue(test)
H           = test.H     # Kruskal-Wallis statistic
				
			
Figure 2. Example of a Kruskal-Wallis table. Note that the equation for H is mathematically identical to Equation 9.63. See also Sections 9.3.9 & 9.5.5, as well as Lesson 8.

Note that the Kruskal–Wallis statistic can be written either in terms of group-wise rank sums or, equivalently, as an ANOVA-style ratio of between-group to total variance computed on the ranks. This explains the difference between the equation shown in Figure 2, and Equation 9.63, although they can be rewritten into each other and thus are identical.

4.2. Friedman test #

The Friedman test is the non-parametric counterpart of a two-way ANOVA without interaction, and extends the logic of rank-based inference to blocked or repeated-measures designs. Instead of analysing the raw data, observations are ranked within each block, after which the test evaluates whether the treatment conditions systematically differ in their rank distributions.

As with other rank-based methods, the Friedman test sacrifices statistical power compared to a parametric two-way ANOVA when all assumptions are met, because information about absolute differences is discarded. However, this loss of metric information makes the test highly robust against non-normality, heteroscedasticity, and outliers, which are common in real experimental data.

For the Friedman test, the data are provided as a matrix x in which each column represents a treatment or condition (Factor 1) and each row represents a block or subject (Factor 2). The test evaluates whether the treatment conditions differ systematically after accounting for block-to-block variability by ranking the data within each row.

The variable reps specifies the number of repeated measurements for each treatment–block combination. When there is a single observation per combination, reps = 1. The Friedman test is the non-parametric analogue of a two-way ANOVA without interaction, making no assumptions about the underlying data distribution.

				
					% Friedman Test
[p,anovatab,stats] = friedman(x,reps); 

Example Calculation
x                  = [5.1 5.2   5.4 5.5   5.0 5.1
                      4.9 5.0   5.2 5.3   4.8 4.9
                      5.0 5.1   5.3 5.4   4.9 5.0];
reps               = 2;
[p,anovatab,stats] = friedman(x,reps);
				
			
				
					using Statistics, Distributions

function friedman_test(data::AbstractMatrix)
    n, k = size(data)  # n = blocks, k = treatments

    # 1) Rank within each row
    ranks = similar(data)
    for i in 1:n
        ranks[i, :] = sortperm(sortperm(data[i, :])) .+ 1  # row rank 1..k
    end

    # 2) Sum ranks for each treatment
    R = sum(ranks, dims=1) |> vec

    # 3) Friedman statistic
    T = (12.0 / (n*k*(k+1))) * sum(R.^2) - 3.0*n*(k+1)

    # 4) p-value via chi‑squared with (k‑1) df
    chi2dist = Chisq(k-1)
    p = 1 - cdf(chi2dist, T)

    return T, p
end

x           = [5.1 5.2 5.4 5.5 5.0 5.1;
                4.9 5.0 5.2 5.3 4.8 4.9;
                5.0 5.1 5.3 5.4 4.9 5.0]

T, p        = friedman_test(x)

println("Friedman χ² = ", T, ", p = ", p)

				
			

Concluding remarks #

The Kruskal-Wallis test forms a natural conclusion to this lesson on robust statistics. At first glance, it may seem counter-intuitive that the test relies on ranks rather than the original data values, effectively discarding information about magnitude and scale. This loss of information indeed makes the Kruskal-Wallis test less statistically powerful than a classical one-way ANOVA when all parametric assumptions are satisfied.

However, it is precisely this loss of metric information that makes the test more robust. By focusing only on relative ordering, rank-based methods are far less sensitive to outliers, skewed distributions, and violations of normality. As a result, the Kruskal-Wallis test provides more reliable inference in many realistic, imperfect datasets where classical assumptions do not hold.

This lesson has highlighted the fundamental trade-off between power and robustness. Starting from parametric tests, moving through robust estimators, and ending with rank-based methods such as the Kruskal-Wallis test, we have seen how simplifying assumptions and reduced information can improve stability and reliability. Together, these methods form a complementary toolkit for choosing the most appropriate statistical approach based on the nature and quality of the data.

References #

[1] R. M. Conroy, Stata J. Promot. Commun. Stat. Stata, 2012, 12(2), 182–190, DOI: 10.1177/1536867X1201200202.

[2] G. W. Divine, H. J. Norton, A. E. Bar.n and E. Juarez-Colunga, The American Statistician, 2018, 72, 278–286, DOI: 10.1080/00031305.2017.1305291.

[3] A. Hart, BMJ, 2001, 323, 391–393, DOI: 10.1136/bmj.323.7309.391.

[4] Levene, in Contributions to probability and statistics; essays in honor of Harold Hotelling, Stanford University Press, Stanford, 1960, pp. 278–292.

[5] M.B. Brown, J. Am. Stat. Assoc., 1974, 69, 364–367, DOI: 10.1080/01621459.1974.10482955

Is this article useful?