10. Robust Statistics

In the lessons thus far we have worked with statistics that assumed several criteria to be fulfilled: data had to be normally distributed, be free of outliers, and so on. However, what happens when these criteria are not met? In this lesson we will learn about robust alternative methods. We will see that the simplicity of these methods that renders them so robust comes at a cost.

Learning Goals #

Recognize when classical parametric tests fail, and explain how violations of normality and homoscedasticity affect statistical inference.
Describe the principles of robust and rank-based methods, including why and how they reduce sensitivity to outliers and non-ideal data.
Compute and interpret non-parametric tests, in particular the Wilcoxon/Mann-Whitney and Kruskal-Wallis tests, in terms of distributional differences between groups.
Make informed choices between parametric, robust, and non-parametric tests, understanding the trade-off between statistical power and robustness.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers

READ SECTIONS 9.5.1-9.5.2

Introduction to robust methods

1. What is a robust method? #

Robust or non-parametric methods operate by abstracting the data to some degree to distil an answer. This is best understood by revising the example in Lesson 2:

				
					x = [1, 2, 3, 4, 5, 6, 42];

x = [1; 2; 3; 4; 5; 6; 42]

Here, the mean (9) was strongly affected by the presence of an outlier (42) yet the median (4) was not. The median completely ignores outliers and simply selects the middle value of the sorted vector. For the median, it does not matter how serious the outliers are, it is robust against their presence. Key characteristics of such non-parametric methods are shown in Figure 1.

Side-by-side comparison of parametric and non-parametric statistical tests, highlighting distributional assumptions, robustness, and differences in statistical power.

Figure 1. Comparison between parametric and non-parametric (robust) tests.

Is there then no disadvantage of the median? Well, the median completely ignores all the other actual values themselves. It simply selects the middle value. You could therefore say that it represents the complete dataset less. We will see in this lesson that due to this abstract treatment of the data, robust methods feature significantly less statistical power. Table 1 shows an overview of parametric and non-parametric methods.

Table 1. Overview of parametric and non-parametric tests.

Objective	Parametric	Non-parametric
Describing a sample	Mean Standard deviation	Median IQR
Comparing mean with value Comparing two means (matched)	$t$ -test	Sign’s test Wilcoxon’s test
Comparing two means (non-matched)	$t$ -test	Mann-Whitney U test
Comparing variances	$F$ -test ( $k$ =2) Bartlett test ( $k\ge$ 2)	Levene’s test ( $k\ge$ 2)
Comparing several means	ANOVA	Kruskal-Wallis
Measuring variables relationship	Pearson correlation	Spearman correlation

2. Comparison of mean with reference #

2.1. Signs test #

In Lesson 4, we learned about the comparison of a sample mean with a reference as our first hypothesis test. The first non-parametric alternative is the sign’s test.

CASE: RETENTON TIME STABILITY

According to a protocol the median retention time of an analyte should be $\tilde{x}_0$ = 3.5 min. The repeated measurements shown below are obtained. Is the median of the population of data 3.5? Note that the hypotheses are $H_0: \tilde{x}=\tilde{x}_0$ and $H_1: \tilde{x}\neq\tilde{x}_0$ .

				
					x = [3.3, 3.2, 3.4, 4.7, 3.7, 3.3, 3.4, 3.2, 3.5];

The sign’s test exclusively considers the direction of the deviation and literally considers the signs. The procedure is as follows:

Subtract the reference from each value in the dataset.
Count the number of positive and negative values (i.e. literally consider the signs).
Take the minimum of the two values.
Compare $r_{\text{obs}}$ with a tabulated $r_{\text{crit}}$ value.

EXERCISE 1: RETENTION STABILITY

Considering the case above, what is the outcome of your hypothesis test?

This is indeed the correct answer. For Step 1, we first subtract the reference value 3.5 from each datapoint, yielding -0.2, -0.3, -0.1, 1.2, 0.2, -0.2, -0.1, -0.3, and 0. For Step 2: There are 6 negative values and 2 positive (one neutral, which we ignore). The resulting test statistic is 2, because 2 is lower than 6 (Step 3). Next, in Step 4, we compare 2 with the critical value 1 (assuming that significance is 0.05 and the number of datapoints is 9). The critical value (1) is smaller than the observed statistic (2) and therefore we accept the null hypothesis.

This is not the correct answer. Check your values. For Step 1, we first subtract the reference value 3.5 from each datapoint, yielding -0.2, -0.3, -0.1, 1.2, 0.2, -0.2, -0.1, -0.3, and 0. For Step 2: note that we do not count the 0. The resulting observed test statistic should be 2. The critical value should be 1.

The signs test can also be used for a matched pairs comparison of two sample means, by first subtracting the two datasets from each other prior to following Step 1 above.

The code shown below conducts the Sign’s test. If desired, variables such as alpha ( $\alpha$ ) and the tails of the test, tail, can be further specified. The variable tail can be either ‘left’, ‘right’, or ‘both’.

				
					% Sign Test
p = signtest(x,reference);

% Extended Configuration
p = signtest(x,reference,'Alpha',alpha,'Tail',tail);

% Example
x           = [4.6 4.7 4.8 4.9 5.1];
reference   = 5.0;
p           = signtest(x,reference,'Alpha',0.01,'Tail','left');

				
					using HypothesisTests

x           = [4.6, 4.7, 4.8, 4.9, 5.1]
reference   = 5.0
alpha       = 0.01

# This is a two-sided test
test        = SignTest(x, reference)
p           = pvalue(test)

println("p-value = ", p)

The sign’s test reduces the content of the data to the mere signs of the values compared to the reference. It is not surprising that the statistical power is thus lower than that of a parametric $t$ -test.

2.2. Wilcoxon signed-rank test #

The Wilcoxon signed-rank test is a non-parametric alternative to the paired $t$ -test comparison of two means. It is used when two paired measurements are compared but the assumptions of normality are questionable.

Unlike the sign test, it considers both the direction and the magnitude of the differences by ranking their absolute values.

EXERCISE 2: ROBUST RANKING?

Why does ranking the differences make the Wilcoxon signed-rank test more robust than the paired $t$ -test?

Correct. By replacing the raw differences with their ranks, extreme values have less influence on the test statistic. This makes the Wilcoxon signed-rank test more robust against outliers and deviations from normality than the paired t-test.

Incorrect. Ranking the differences does not enforce normality, change the sample size, or alter the sidedness of the test. Its robustness comes from reducing the influence of extreme values by working with ranks instead of raw differences.

The procedure for the Wilcoxon’s signed-rank test is:

Compute the differences $d$ between paired observations.
Assign a sign (+/–) to each difference.
Rank the absolute values of $d$ , ignoring the signs (ties receive average ranks).
Sum the ranks for positive differences ( $T^{+}$ ) and negative differences ( $T^{-}$ ).
The test statistic $T_{\text{obs}}$ is the smaller of $T^{+}$ and $T^{-}$ .
Compare this value to the critical value $T_{\text{critical}}$ for the Wilcoxon signed-rank test at the chosen significance level, or compute the corresponding $p$ -value.

CASE: PEAK AREA COMPARISON ON TWO SYSTEMS

The next example compares the peak areas of a compound of interest which was measured for 9 different objects in a sample on two different systems. See Sections 9.3.4.2 & 9.5.3 for more details.

The null hypothesis is that the median difference equals zero, meaning that there is no systematic increase or decrease between the paired measurements. Because the test relies on ranks rather than absolute values, it is robust against outliers and insensitive to skewed distributions. If $T_{\text{obs}} > T_{\text{crit}}$ , then $H_0$ is accepted.

The code shown below conducts the Sign’s test on a vector of differences (x_diff), or the differences computed from the original values in x1 and x2. If desired, variables such as alpha ( $\alpha$ ) and the tails of the test, tail, can be further specified. The variable tail can be either ‘left’, ‘right’, or ‘both’.

				
					% Wilcoxon Signed-Rank Test: Directly On Differences
p = signrank(x_diff);

% Wilcoxon Signed-Rank Test: On Two Paired Samples
p = signrank(x1,x2);

% Extended Configuration
p = signrank(x1,x2,'alpha',alpha,'tail',tail);

% Example
x1          = [5.1 5.3 5.0 5.2 5.4 5.1 5.3];   % Before
x2          = [5.0 5.4 4.9 5.3 5.3 5.2 5.2];   % After
p           = signtest(x1,x2,'alpha',0.01,'tail','left');

				
					using HypothesisTests

# Wilcoxon Signed-Rank Test: Directly On Differences
p           = SignedRankTest(x_diff)

# Wilcoxon Signed-Rank Test: On Two Paired Samples
p           = SignedRankTest(x1,x2)

# Example
x1          = [5.1, 5.3, 5.0, 5.2, 5.4, 5.1, 5.3]
x2          = [5.0, 5.4, 4.9, 5.3, 5.3, 5.2, 5.2]

x_diff      = x1 .- x2
test        = SignedRankTest(x_diff)
p           = pvalue(test)

Table 2. Overview of peak areas for compound of interest of nine objects measured on two systems (

x_1

and

x_2

Object	System 1	System 2	Difference	Sign	Rank
1	27185	25904	1281	+	4
2	20068	21207	-1139	–	5
3	32593	35582	-2989	–	7
4	176438	172343	4095	+	9
5	29319	28466	853	+	3
6	21634	18502	3132	+	8
7	3387	3273	114	+	1.5
8	28181	29889	-1708	–	6
9	3466	3352	114	+	1.5

EXERCISE 3: PEAK AREAS

In a system comparison study, peak areas of all objects within the same sample are measured one two different systems. Because the variation effects may be non-Gaussian and uneven across the sample, a Wilcoxon signed-rank test provides a robust alternative to the paired $t$ -test. Use the procedure above to conduct the test using the values of Table 2. See Section 9.5.2 for a detailed analysis of this question. As a result of the test, you find that your observed T-value is (round to one decimal)

Correct. By summing the signed ranks of the paired differences you obtained this value as the lowest of the two.

Incorrect. The Wilcoxon signed-rank test statistic is the smaller of the sums of positive and negative signed ranks. Please check that you (i) ranked the absolute differences correctly, (ii) handled tied ranks appropriately, and (iii) selected the minimum of T+ and T-.

As a result of this observed statistic, you conclude that the null hypothesis is

Correct. Since the observed test statistic does not exceed the critical value at the chosen significance level, there is insufficient evidence to reject the null hypothesis. We therefore accept the null hypothesis.

Incorrect. The observed test statistic lies within the acceptance region for the Wilcoxon signed-rank test at the chosen significance level. This means there is not enough evidence to reject the null hypothesis.

The Wilcoxon signed-rank test is more powerful than the sign test, but it relies on the assumption that the distribution of differences is symmetric. This assumption can be assessed by comparison with a symmetric reference distribution or by other approaches, which are beyond the scope of this course.

READ SECTIONS 9.5.3

Unpaired Mann-Whitney U test

2.3. Mann-Whitney U test #

When comparing two independent samples and the assumptions of the parametric

t

-test are not met, the Mann–Whitney $U$ test provides a non-parametric alternative. It serves the same role as the unpaired

t

-test, but operates on ranks rather than raw values, making it more robust to non-normality and outliers.

Wilcoxon's Rank-Sum

The Mann–Whitney $U$ test is also known as Wilcoxon’s rank-sum test. Despite the similarity in name, it should not be confused with the Wilcoxon signed-rank test, which is used for paired data. Here, the samples are independent and originate from different objects.

The procedure to perform the Mann–Whitney $U$ test is:

All observations from both samples (or group) are first combined and ranked, assigning average ranks to tied values.
The ranks are then summed separately for each group. This yields $R_{\text{A}}$ and $R_{\text{B}}$ in Equation 9.61.
The test statistics $U_{\text{A}}$ and $U_{\text{B}}$ are calculated from the sample sizes and rank sums.

EQUATION 9.61

$U_A = n_A n_B + \frac{n_A (n_A + 1)}{2} – R_A$

$U_B = n_A n_B + \frac{n_B (n_B + 1)}{2} – R_B$

The smaller of these two values is taken as the observed statistic $U_{\text{obs}}$ .
$U_{\text{obs}}$ is compared to the tabulated critical value for the chosen significance level. If $U_{\text{obs}}$ exceeds the critical value, the null hypothesis is accepted; otherwise, it is rejected.

Table 3. Example of ranked data for the Mann-Whitney

U

unpaired comparison of two samples.

Sample	Value	Rank
A	25.9	1
B	26.6	2
B	26.7	3
A	26.8	4
B	27.0	5
A	27.2	6
A	27.3	7
A	27.4	8
A	27.7	9
A	27.8	10.5
B	27.8	10.5
B	27.9	12
A	28.1	13
A	28.5	14
B	29.3	15
B	29.4	16
B	30.3	17

The test evaluates whether two samples are likely to originate from the same population, without requiring assumptions about the underlying distribution.

In the following example code, x1 and x2 are each the vectors containing all the data pertaining their group. The variable tail can be either ‘left’, ‘right’, or ‘both’.

				
					% Mann-Whitney U Test
[p,h,stats] = ranksum(x1,x2);

% Extended Configuration
[p,h,stats] = ranksum(x1,x2,'alpha',alpha,'tail',tail);

% Example
x1          = [9.8 10.1 10.5 10.7 11.0 11.2 11.4];   % Sample One
x2          = [9.9 10.2 10.4 10.8 10.9 11.3 11.5];   % Sample Two
[p,h,stats] = ranksum(x1,x2,'Tail','right');

				
					# Mann-Whitney U Test
ans         = MannWhitneyUTest(x1,x2)

# Example
x1          = [9.8 10.1 10.5 10.7 11.0 11.2 11.4]
x2          = [9.9 10.2 10.4 10.8 10.9 11.3 11.5]
[p,h,stats] = MannWhitneyUTest(x1,x2)

For small sample sizes, the Mann–Whitney $U$ test is limited by the availability of tabulated critical values. Importantly, the test does not strictly assess equality of medians, but rather whether two samples originate from the same underlying probability distribution [1,2]. Differences in data spread and distribution shape therefore play an important role, and ideally the two samples should have similarly shaped distributions [3].

EXERCISE 4: METABOLITES

The metabolite level of two groups of patients was analysed using LC-MS, and the concentrations of a target compound were obtained. The results are reported in Table 3. Group A received a drug which is believed to alter the metabolite level, whereas group B received a placebo. Each group consists of different patients.

Because the distribution of concentrations is skewed and contains tied values, a non-parametric test is used. The measured concentrations from both batches were combined and ranked, as shown in the table.

Using the Mann–Whitney U test, assess whether the concentrations from batch A and batch B can reasonably be considered to originate from the same underlying population (meaning the drug did not alter the metabolite levels).

Correct. Based on the Mann–Whitney U test, there is no statistically significant evidence that the LC–MS peak areas from batch A and batch B originate from different populations. Any observed differences can be attributed to random variation rather than a systematic effect of the experimental conditions.

Incorrect. The Mann–Whitney U test does not indicate a statistically significant difference between the two batches. Although individual LC-MS peak areas differ, the ranked data do not show a consistent shift between batch A and batch B that would justify rejecting the null hypothesis.

READ SECTIONS 9.5.4

Levene’s and Brown-Forsythe Test

3. Comparison of variances #

When comparing the variability of two or more groups, classical tests such as the $F$ -test or Bartlett’s test rely on the assumption that the data are normally distributed. In practice, however, deviations from normality and the presence of outliers can strongly affect these tests, making robust alternatives preferable.

3.1. Levene's test #

Levene’s test was developed as a robust method for comparing variances that is less sensitive to non-normality and outliers [4]. Instead of working directly with the original data, the test first transforms each observation into its absolute deviation from the group mean. These transformed values are then analyzed using a one-way ANOVA framework. The resulting test statistic approximately follows an $F$ -distribution and is compared to a critical latex]F[/latex] value.

EQUATION 9.62

$W_{\text{obs}} = \frac{n_{\text{tot}} – k}{k – 1} \cdot \frac{\sum_{j=1}^{k} n_j \, (\bar{z}_j – \bar{z})^2} {\sum_{j=1}^{k} \sum_{i=1}^{n_j} (z_{i,j} – \bar{z}_j)^2}$

Here, $z_{i,j} = \lvert x_{i,j} – \bar{x}_j \rvert$ , $\bar{z}j$ denotes the mean of the $z$ values within group $j$ , and $\bar{z}$ is the overall mean across all groups. Furthermore, $n{\text{tot}}$ represents the total number of observations, and $k$ is the number of groups being compared.

Quadratic Levene Alternative

A quadratic variant of Levene’s test replaces the absolute deviations $z_{i,j} = \lvert x_{i,j} – \bar{x}j \rvert$
with squared deviations, $z{i,j}^2 = (x_{i,j} – \bar{x}_j)^2$ .
This modification can increase statistical power for certain types of data. However, because this approach still relies on the group mean $\bar{x}_j$ , which is not a robust statistic.

3.2. Brown-Forsythe test #

The Brown–Forsythe test improves robustness further by computing deviations from the group median rather than the mean [5]. This makes the test particularly suitable when data contain outliers or are strongly skewed, while still testing whether group variances can be considered equal.

In the following example code, x is a matrix in which each column represents one group of data. Alternatively, x can be a vector with the new variable group representing a vector of group numbers for each value in x.

				
					% Levene Absolute
p         = vartestn(x,'TestType','LeveneAbsolute'); 

% Levene Quadratic
p         = vartestn(x,'TestType','LeveneQuadratic'); 

% Brown-Forsythe
p         = vartestn(x,'TestType','Brown-Forsythe'); 

% Extended Configuration
[p,stats] = vartestn(x,group,'TestType','LeveneQuadratic');

				
					using Statistics, DataFrames, StatsBase, LinearAlgebra, Distributions

function levene_test(x, group)
    df = DataFrame(x=x, g=group)
    glevels = unique(group)
    # compute absolute deviations from group median
    df[:, :z] = [abs(val - median(df[df.g .== g, :x])) for (val,g) in zip(df.x, df.g)]
    # perform one-way ANOVA on z
    group_means = [mean(df[df.g .== g, :z]) for g in glevels]
    grand_mean  = mean(df.z)
    n = length(x)
    k = length(glevels)
    
    # Between-group sum of squares
    ss_between = sum([sum(df[df.g .== g, :z].^0) * (m - grand_mean)^2 for (g,m) in zip(glevels, group_means)])
    
    # Within-group sum of squares
    ss_within = sum([sum((df[df.g .== g, :z] .- mean(df[df.g .== g, :z])).^2) for g in glevels])
    
    df_between = k - 1
    df_within  = n - k
    
    F = (ss_between / df_between) / (ss_within / df_within)
    p = 1 - cdf(FDist(df_between, df_within), F)
    
    return F, p
end

# Example
x           = [5.1, 5.3, 5.0, 5.2, 5.4, 5.1, 5.3, 4.9, 5.0, 5.2]
group       = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

F, p        = levene_test(x, group)
println("F = ", F, ", p = ", p)

READ SECTIONS 9.5.5

Robust Analysis of Variance

4. Analysis of Variance #

4.1. Kruskal-Wallis test #

The non-parametric counterpart to one-way ANOVA (see Lesson 8) is closely related to Levene’s test. The Kruskal–Wallis test, named after William Kruskal and Wilson Wallis, extends Wilcoxon’s rank-sum test to more than two groups. Although it is often introduced as a test of equal group medians (with the alternative that at least one group differs), the test in fact evaluates whether the underlying probability distributions of the groups are comparable, in the same spirit as the Mann–Whitney $U$ test, with steps being:

Rank the data from all combined groups
Compute the test statistic $H_{\text{obs}}$ , according to

EQUATION 9.63

$H_{\text{obs}} = \frac{12}{n_{\text{tot}}(n_{\text{tot}}+1)} \sum_{j=1}^{k} \frac{R_j^2}{n_j} – 3(n_{\text{tot}}+1)$

Here, $k$ denotes the number of groups, $n_{\text{tot}}$ the total number of observations, $j$ the number of datapoints in group $j$ , and $R_j$ the sum of the ranks in that group, given by $R_j = \sum_{i=1}^{n_j} r_{i,j}$ , where $r_{i,j}$ is the rank of observation $i$ in group $j$ . Unlike one-way ANOVA, the Kruskal–Wallis test statistic $H_{\text{obs}}$ does not follow an $F$ -distribution; for sufficiently large group sizes ( $n_j>5$ ), it approximately follows a $\chi^2$ distribution.

Compare $H_{\text{obs}}$ with $H_{\text{crit}}$ at $k$ − 1 degrees of freedom in a one-tailed $\chi^2$ test.

Table 4. Example original data in a parametric ANOVA format (left) of peptide concentrations determined using three different LC-MS methods. The data is translated into ranks for a non-parametric Kruskal-Wallis test (right). Data concerns corresponds to the first three columns used for Table 4 of Lesson 8 and is identical to Table 9.10 of the book.

ANOVA			Kruskal-Wallis
1	2	3	1	2	3
3.18	3.47	3.34	10.5	15	12
3.18	3.44	3.06	10.5	14	7.5
2.96	3.41	3.02	2.5	13	5
3.13	3.58	3.04	9	16	6
2.96	3.63	2.83	2.5	17	1
3.01	3.70	3.06	4	18	7.5

Table 4 revisits the data from Lesson 8 and Figure 9.23 in the book, and shows how the actual values are translated into ranks. The ANOVA conducted on these ranks according to the Kruskal-Wallis yields the typical ANOVA table as is shown in Figure 2.

For the Kruskal-Wallis test, the function can be used in two ways. The simplest form is simply providing the function with a matrix in x, in which each column constitutes a group. Alternatively, an array of datapoints can be used as x, with in the second input argument an array of group designations for each value in x.

				
					% Kruskal-Wallis Test
[p,anovatab,stats] = kruskalwallis(x,group); 

% Test Data
x =[3.18, 3.47, 3.34;
    3.18, 3.44, 3.06;
    2.96, 3.41, 3.02;
    3.13, 3.58, 3.04;
    2.96, 3.63, 2.83;
    3.01, 3.70, 3.06];
    
% Example Calculation
[p,anovatab,stats] = kruskalwallis(x);

				
					using HypothesisTests

x_mat       = [3.18 3.47 3.34;
                3.18 3.44 3.06;
                2.96 3.41 3.02;
                3.13 3.58 3.04;
                2.96 3.63 2.83;
                3.01 3.70 3.06]

# Flatten data and create group labels
x_vec       = vec(x_mat)  # column-major flatten
group1      = x_mat[:, 1]
group2      = x_mat[:, 2]
group3      = x_mat[:, 3]

# Kruskal-Wallis Test
test        = KruskalWallisTest(group1, group2, group3)
p           = pvalue(test)
H           = test.H     # Kruskal-Wallis statistic

Diagram showing the Kruskal–Wallis test expressed in an ANOVA-style table for ranked data, including sums of squares, mean squares, the H statistic, and a worked numerical example.

Figure 2. Example of a Kruskal-Wallis table. Note that the equation for H is mathematically identical to Equation 9.63. See also Sections 9.3.9 & 9.5.5, as well as Lesson 8.

Note that the Kruskal–Wallis statistic can be written either in terms of group-wise rank sums or, equivalently, as an ANOVA-style ratio of between-group to total variance computed on the ranks. This explains the difference between the equation shown in Figure 2, and Equation 9.63, although they can be rewritten into each other and thus are identical.

4.2. Friedman test #

The Friedman test is the non-parametric counterpart of a two-way ANOVA without interaction, and extends the logic of rank-based inference to blocked or repeated-measures designs. Instead of analysing the raw data, observations are ranked within each block, after which the test evaluates whether the treatment conditions systematically differ in their rank distributions.

As with other rank-based methods, the Friedman test sacrifices statistical power compared to a parametric two-way ANOVA when all assumptions are met, because information about absolute differences is discarded. However, this loss of metric information makes the test highly robust against non-normality, heteroscedasticity, and outliers, which are common in real experimental data.

For the Friedman test, the data are provided as a matrix x in which each column represents a treatment or condition (Factor 1) and each row represents a block or subject (Factor 2). The test evaluates whether the treatment conditions differ systematically after accounting for block-to-block variability by ranking the data within each row.

The variable reps specifies the number of repeated measurements for each treatment–block combination. When there is a single observation per combination, reps = 1. The Friedman test is the non-parametric analogue of a two-way ANOVA without interaction, making no assumptions about the underlying data distribution.

				
					% Friedman Test
[p,anovatab,stats] = friedman(x,reps); 

Example Calculation
x                  = [5.1 5.2   5.4 5.5   5.0 5.1
                      4.9 5.0   5.2 5.3   4.8 4.9
                      5.0 5.1   5.3 5.4   4.9 5.0];
reps               = 2;
[p,anovatab,stats] = friedman(x,reps);

				
					using Statistics, Distributions

function friedman_test(data::AbstractMatrix)
    n, k = size(data)  # n = blocks, k = treatments

    # 1) Rank within each row
    ranks = similar(data)
    for i in 1:n
        ranks[i, :] = sortperm(sortperm(data[i, :])) .+ 1  # row rank 1..k
    end

    # 2) Sum ranks for each treatment
    R = sum(ranks, dims=1) |> vec

    # 3) Friedman statistic
    T = (12.0 / (n*k*(k+1))) * sum(R.^2) - 3.0*n*(k+1)

    # 4) p-value via chi‑squared with (k‑1) df
    chi2dist = Chisq(k-1)
    p = 1 - cdf(chi2dist, T)

    return T, p
end

x           = [5.1 5.2 5.4 5.5 5.0 5.1;
                4.9 5.0 5.2 5.3 4.8 4.9;
                5.0 5.1 5.3 5.4 4.9 5.0]

T, p        = friedman_test(x)

println("Friedman χ² = ", T, ", p = ", p)

Concluding remarks #

The Kruskal-Wallis test forms a natural conclusion to this lesson on robust statistics. At first glance, it may seem counter-intuitive that the test relies on ranks rather than the original data values, effectively discarding information about magnitude and scale. This loss of information indeed makes the Kruskal-Wallis test less statistically powerful than a classical one-way ANOVA when all parametric assumptions are satisfied.

However, it is precisely this loss of metric information that makes the test more robust. By focusing only on relative ordering, rank-based methods are far less sensitive to outliers, skewed distributions, and violations of normality. As a result, the Kruskal-Wallis test provides more reliable inference in many realistic, imperfect datasets where classical assumptions do not hold.

This lesson has highlighted the fundamental trade-off between power and robustness. Starting from parametric tests, moving through robust estimators, and ending with rank-based methods such as the Kruskal-Wallis test, we have seen how simplifying assumptions and reduced information can improve stability and reliability. Together, these methods form a complementary toolkit for choosing the most appropriate statistical approach based on the nature and quality of the data.

References #

[1] R. M. Conroy, Stata J. Promot. Commun. Stat. Stata, 2012, 12(2), 182–190, DOI: 10.1177/1536867X1201200202.

[2] G. W. Divine, H. J. Norton, A. E. Bar.n and E. Juarez-Colunga, The American Statistician, 2018, 72, 278–286, DOI: 10.1080/00031305.2017.1305291.

[3] A. Hart, BMJ, 2001, 323, 391–393, DOI: 10.1136/bmj.323.7309.391.

[4] Levene, in Contributions to probability and statistics; essays in honor of Harold Hotelling, Stanford University Press, Stanford, 1960, pp. 278–292.

[5] M.B. Brown, J. Am. Stat. Assoc., 1974, 69, 364–367, DOI: 10.1080/01621459.1974.10482955

INFORMATION REPOSITORY

Extra

MSc. Chemometrics & Statistics

MSc. Separation Science

10. Robust Statistics

Learning Goals #

READ SECTIONS 9.5.1-9.5.2

1. What is a robust method? #

2. Comparison of mean with reference #

2.1. Signs test #

CASE: RETENTON TIME STABILITY

EXERCISE 1: RETENTION STABILITY

2.2. Wilcoxon signed-rank test #

EXERCISE 2: ROBUST RANKING?

CASE: PEAK AREA COMPARISON ON TWO SYSTEMS

EXERCISE 3: PEAK AREAS

READ SECTIONS 9.5.3

2.3. Mann-Whitney U test #

Wilcoxon's Rank-Sum

EQUATION 9.61

EXERCISE 4: METABOLITES

READ SECTIONS 9.5.4

3. Comparison of variances #

3.1. Levene's test #

EQUATION 9.62

Quadratic Levene Alternative

3.2. Brown-Forsythe test #

READ SECTIONS 9.5.5

4. Analysis of Variance #

4.1. Kruskal-Wallis test #

EQUATION 9.63

4.2. Friedman test #

Concluding remarks #

References #

Is this article useful?

INFORMATION REPOSITORY

10. Robust Statistics

Learning Goals #

READ SECTIONS 9.5.1-9.5.2

1. What is a robust method? #

2. Comparison of mean with reference #

2.1. Signs test #

CASE: RETENTON TIME STABILITY

EXERCISE 1: RETENTION STABILITY

2.2. Wilcoxon signed-rank test #

EXERCISE 2: ROBUST RANKING?

CASE: PEAK AREA COMPARISON ON TWO SYSTEMS

EXERCISE 3: PEAK AREAS

READ SECTIONS 9.5.3

2.3. Mann-Whitney U test #

Wilcoxon's Rank-Sum

EQUATION 9.61

EXERCISE 4: METABOLITES

READ SECTIONS 9.5.4

3. Comparison of variances #

3.1. Levene's test #

EQUATION 9.62

Quadratic Levene Alternative

3.2. Brown-Forsythe test #

READ SECTIONS 9.5.5

4. Analysis of Variance #

4.1. Kruskal-Wallis test #

EQUATION 9.63

4.2. Friedman test #

Concluding remarks #

References #

Is this article useful?

Share This Article :