INFORMATION REPOSITORY

05. Comparing Two Means

Updated on January 8, 2025
Last lesson introduced us to hypothesis testing with as example the comparison of a sample mean with a reference value. It is also possible to compare two sample means using a t-test. This will be the topic of this lesson. We will learn how the type of experiment not only heavily impacts the type of test, but also the reliability of the outcome.

Learning Goals #

  • Conduct the comparison of two sample means.
  • Understand the relevance of homogeneity of variances.
  • Discern matched and non-matched pair designs, and appreciate their strengths and weaknesses.
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.3.4-9.3.4.1

NON-MATCHED PAIRS COMPARISON

1. Introduction #

We will now learn how to use a t-test as an inferential statistic to determine whether there is a statistically significant difference between the means of two samples. Unfortunately, the type of test used depends heavily on the type of data and experimental design. 

In a non-matched pair comparison of two sample means the two samples arise from different samples of objects. These objects were independently obtained. In a matched pair comparison, the same objects are measured before and after the studied effect.

In this lesson, we will learn about the differences, encounter examples, and dive into the strengths and weaknesses of each.

2. Non-matched pair comparisons #

2.1. Concept #

In a non-matched pair comparison we compare two sample means that were obtained from different samples of objects. 

Suppose we are investigating a river. A factory is situated next to the river and releases wastewater effluent into the river.  A regulatory organization is tasked to determine whether the factory is responsible for an increased concentration of a contaminant. The situation is depicted in Figure 1. 

Figure 1. In this hypothetical example, a regulatory body is investigating whether a factory is responsible for an increase in concentration of a contaminant. A river flows from upstream to downstream.

A scientist obtains a sample of six objects upstream (i.e. prior to the water reaching the factory), and a sample of six objects, downstream. See also Table 1.

Table 1. Results from the LC measurements of the contaminant in the river near the factory.
Downstream Upstream
0.09
0.10
0.03
0.08
0.08
0.06
0.06
0.11
0.04
0.08
0.05
0.1

Similar to Lesson 4, a non-matched pair comparisons of two means can be either one-sided or two-sided. However, in this case we are specifically interested in an increase in concentration of the contaminant (the effect) caused by the factory. As it does not matter to us if the concentration is lower, it therefore makes more sense to do a one-sided test. 

Our hypotheses become:

H_0: \mu_{\text{down}}\leq\mu_{\text{up}} and H_1: \mu_{\text{down}}>\mu_{\text{up}} 

Or, in general terms:

H_0: \mu_{1}\leq\mu_{2} and H_1: \mu_{1}>\mu_{2} 

2.2. Test requirements #

The requirements for this t-test are the same as for the comparison of one sample mean with a reference (Lesson 4). There is, however, one additional requirement:

  1. Pertaining to continuous (interval) or ordinal variables.
  2. Normally distributed.
  3. Free of outliers.
  4. Independent: a representative but random sample from the population. 
  5. Variances of the two samples must be homogeneous.
Figure 2. Box and whisker representations of two samples with different variability. For the variances of two samples to be homogenous, their variability across a range must be similar.
The last requirement refers to the variance homogeneity of the two measured samples. When this is not the case, the variability of the data for the two samples differs over the measured range and the variances are heteroscedastic.
Table 2. Overview of requirements and examples of statistical tests that can be used to evaluate whether requirements are met. These will be treated in Lesson 8 as they rely on concepts which haven’t all been treated yet.
Requirement Test
Normality
Lilliefors’ test
Outliers
Grubbs, Dixon, MAD
Variance homogeneity
Bartlett, Levene, Brown-Forsythe
Independence
Cannot be tested; property of data
Data continuity
Cannot be tested; property of data

The testing for variance homogeneity is topic of Lesson 8 as it relies on concepts we have not treated yet. This is out of the scope of the current lesson and we proceed with the assumption that the variances are homogenous.

There is logic behind a decision to assume that variances are homogenous in specific cases. We have seen in Lesson 2 that the standard deviation is affected by the number of measurements. If we only have two measurements we will but much more uncertain about the actual variability in the data, compared to when we would have had 30 measurements. It is for this reason that larger n-values allow s to represent \sigma. Inhomogeneity becomes therefore more likely if the sample size of the two samples is different (e.g. n_1 = 5, whereas n_2 = 8).

Another point to our advantage is that – as a separation scientist – we would use the same analytical method for both samples. From a separation science point of view it thus is not likely to assume that the method would feature a different precision for the two samples. This would be a completely different story if one of them would require, say, a different sample preparation procedure.

2.3. Equal Variance #

Assuming that the variances are homogeneous and taking the standard significance level of 5% (i.e. \alpha = 0.05), we can proceed with the computation of the test statistic. For a one-sided test, this is given by:

Equation 9.27: t_{\text{obs}}=\frac{{\bar{x_1}-\bar{x_2}}}{s_{\text{pool}} \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}

For the two-sided test, the numerator utilises the absolute difference |{\bar{x_1}-\bar{x_2}}|. Due to the variances being homogenous, the standard deviation is pooled using:

Equation 9.20: s_{\text{pooled}}=\frac{\nu_1 s^{2}_1+\nu_2 s^{2}_2+…}{\nu_1+\nu_2+…}

With t_{\text{obs}}, the remainder of the test is similar to the comparison of a sample mean with a reference (Lesson 4). The only exception is that the number of degrees of freedom now is calculated as \nu=n_1+n_2-2. Indeed, we lose one degree of freedom for each set.

What is the outcome of the hypothesis test? Is the null hypothesis accepted if the significance level is 0.05?

2.4. Non-equal Variance #

Had the variances not been equal, we would have conducted the unequal variance test, also known as Welch’s test. Everything stays the same, except the test statistic is computed with the Welch-Satterthwaite equation so that

Equation 9.28: t_{\text{obs}}=\frac{{\bar{x_1}-\bar{x_2}}}{\sqrt{\frac{s^{2}_1}{n_1}+\frac{s^{2}_2}{n_2}}}

and

Equation 9.29: \nu=\frac{\Big(\frac{s^{2}_1}{n_1}+\frac{s^{2}_2}{n_2}\Big)^{2}}{\frac{s^{4}_1}{n^{2}_1 (n_{1}-1)}+\frac{s^{4}_2}{n^{2}_2 (n_{2}-1)}}

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.3.4.2

MATCHED PAIRS COMPARISON

3. Matched Pairs (Paired Test) #

The non-matched pair tests are easy enough to use, but one problem is that the calculated variability comprises of two components:

  1. Intrinsic measurement variance.
  2. Variance due to sampling two independent samples.

It is possible to make the test more powerful by eliminating the latter of the two through specifically designing a paired test.

To understand a paired design we imagine ourselves comparing two capillary electrophoresis systems. As part of the method diagnostics, a scientist measures eight different objects on both systems. The results are shown in Table 2.

Table 2. Peak areas of a compound of interest measured as a sample of eight objects on two capillary electrophoresis systems. The final column represents the row-based difference of the two datasets. Table is identical to Table 9.3.
Object System 1 System 2 Difference
1
27185
25904
1281
2
20068
21207
-1139
3
32593
35582
-2989
4
176438
172343
4095
5
29319
28466
853
6
21634
18502
3132
7
3387
3273
114
8
28181
29889
-1708
The crucial component here is that the same eight objects are measured on both systems. This creates eight matched pairs of values and removes the variability due to sampling tow independent samples such as is the case with a non-matched pair design.
With the same object measured on both systems, the results from both systems are subtracted from one another. This yields a column with the difference for each object between both systems (Table 2). It is this data (i.e. the column with the differences) that is considered for the statistical test. The hypotheses are:

H_0: \mu_{\text{difference}}=0 and H_1: \mu_{\text{difference}}\neq0 

A careful examination of the hypotheses shows that, in essence, this portrays a comparison of a sample mean (\bar{x_{\text{difference}}}) with a reference value (0). Conveniently, the remainder of the test functions therefore identical to the comparison of a sample mean with a difference (Lesson 4).

4. Paired vs. unpaired designs #

The use of identical objects renders paired designs strongly in favour of independent designs. However, it is not always possible to set a system as paired. For instance, it is difficult to imagine the factory case at the river as a paired experiment. After all, an object taken upstream can never co-exist as an object obtained downstream.

Figure 3. Two additional examples of matched and non-matched pair designs. In matched-pair designs, the same objects are measured before and after the effect.

Nevertheless, a huge component of variability in the data is removed by opting for a paired design and these are preferable. Let’s investigate this a bit further. In Table 3 we can see the summary results of measured concentrations of a compound in water measured by RPLC before and after heating. The question is whether the exposure to heat has any significant effect on the measured concentration.

Table 4. Summary of measured concentrations of a compound in water using RPLC before and after heating of the samples. Table is identical to Table 9.4.
n \bar{x} s
Before heating
12
21.95
2.699
After heating
12
22.65
2.283
Difference
12
-0.81
0.604
This is further worked out in Section 9.3.4.2 in the book, but in a nutshell the non-matched pairs (two-sided) comparison yields a p-value of 0.5, which means that the H_0 is accepted, which suggests that there is no difference. However, a matched-pairs treatment of the data yields a p-value of 0.0007, which is far below \alpha of 0.05, and thus H_0 is now rejected!

The difference in outcome is the consequence of the superior statistical power of the matched-pair design. Only if we raised n to 100, the non-matched pair design will come to the same conclusion! This is disturbing news.. because if we would have blindly trusted the non-matched pair design we would have come to a different conclusion. This suggests that the outcome strongly is affected by our experimental design. How could we have known this?

Concluding Remarks #

We have learned in this lesson that the t-test can also be used to conduct a comparison of two sample means. Two designs were treated, one with matched pairs of objects, and one where the samples were independent from each other. We furthermore noted that there is a good reason to prefer a matched-pair design: variability in the data due to sampling independent objects is removed in this design which yields statistical power and renders it efficient.

However, we also now have uncovered some bad news. We used both designs for the data in Table 4. The outcome of the non-matched pair design was that H_0 was accepted, whereas it was rejected in the matched pair design. This raises the question: which of the two is correct? How can we assess this?

It is due time for us to take a break from learning about further hypothesis tests. In the next lesson we will instead on learning about assessing the statistical power of a test. 

Exercise Solution #

EQUAL VARIANCE COMPARISON OF TWO SAMPLE MEANS #

See below for the solutions. Note that it is always very useful to plot the data before working on it. For example the two samples can be plotted together using a box and whisker plot, which will help you gauge whether your calculations give sensible answers. 

				
					% Data
x1 = [0.10, 0.08, 0.06, 0.11, 0.08, 0.10];
x2 = [0.09, 0.03, 0.08, 0.06, 0.04, 0.05];
a = 0.05;

% Descriptive Statistics
x1_mean = mean(x1);
x1_std = std(x1);
x1_n = length(x1);
x2_mean = mean(x2);
x2_std = std(x2);
x2_n = length(x2);

% Calculations
dof = x1_n + x2_n - 2;
s_pool = sqrt((x1_std^2*(x1_n-1)+x2_std^2*(x2_n-1))/(dof));
t_obs = (x1_mean - x2_mean)/(s_pool*sqrt(1/x1_n+1/x2_n));

% Step V: Critical Value Approach
t_crit = icdf('T',1-a,dof);

if t_obs<t_crit && t_obs>-1*t_crit
    accept = 'H0';
else
    accept = 'H1';
end

% Step V: P-Value Approach
p = 1-cdf('T',t_obs,dof);

if p>a
    accept = 'H0';
else
    accept = 'H1';
end
				
			

An example can be downloaded here (CS_05_EqualVariance,  .XLSX).

				
					using Distributions, Statistics

# Data
x = [1 2 3 4 5 6 42];

# Calculations
x_median = median(x)
x_mean = mean(x)
x_std = std(x)
				
			

COMPARISON OF TWO MATCHED PAIR SAMPLE MEANS #

The matched pairs test of Section 3 is identical to a comparison of a sample mean with a reference treated in the previous lesson. The reference value \mu_0 is 0 in this case. You should arrive at a t_{\text{obs}} of 0.535, t_{\text{crit}} of 2.36, and a p-value of 0.609. The following Excel file can be used to narrow down calculations if your computations yield different answers (CS_05_MatchedPairs, .XSLX).

Is this article useful?