Learning Goals #
- Understand how the sample size affects your ability to draw conclusions from your data.
- Calculate the confidence interval for your data.
- Estimate the optimal number of replicates for an experiment.
- Argue why it is reasonable to assume that experimental data follows the normal distribution.
- Outline strategies to increase the degrees of freedom to improve a confidence interval.
READ SECTIONS 9.2.3 - 9.2.3.2
Confidence Intervals & Central Limit Theorem
1. Sampling distribution of mean #
We turn back to the table of 100 RPLC measurements of a pesticide from the previous lesson. These are now shown here on this page in Table 1. At this point it is time to acknowledge that the 100 repetitions are far, far away from conventional laboratory practice. Indeed, usually we would conduct a much smaller number of repeats.
| SET | \bar{x} | \sigma | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | 9.08 | 9.13 | 9.11 | 9.13 | 9.10 | 9.13 | 9.15 | 9.12 | 9.14 | 9.10 | 9.119 | 0.0213 |
| #2 | 9.12 | 9.15 | 9.13 | 9.14 | 9.11 | 9.13 | 9.11 | 9.13 | 9.13 | 9.12 | 9.127 | 0.0125 |
| #3 | 9.11 | 9.09 | 9.11 | 9.14 | 9.11 | 9.12 | 9.15 | 9.14 | 9.16 | 9.14 | 9.127 | 0.0221 |
| #4 | 9.13 | 9.14 | 9.16 | 9.08 | 9.14 | 9.10 | 9.14 | 9.09 | 9.12 | 9.13 | 9.123 | 0.0254 |
| #5 | 9.16 | 9.12 | 9.12 | 9.11 | 9.15 | 9.13 | 9.17 | 9.12 | 9.15 | 9.11 | 9.134 | 0.0217 |
| #6 | 9.12 | 9.10 | 9.11 | 9.13 | 9.12 | 9.17 | 9.11 | 9.14 | 9.11 | 9.12 | 9.123 | 0.0200 |
| #7 | 9.10 | 9.13 | 9.14 | 9.12 | 9.11 | 9.13 | 9.16 | 9.12 | 9.13 | 9.15 | 9.129 | 0.0179 |
| #8 | 9.12 | 9.14 | 9.13 | 9.12 | 9.14 | 9.13 | 9.12 | 9.13 | 9.13 | 9.12 | 9.128 | 0.0079 |
| #9 | 9.13 | 9.12 | 9.13 | 9.13 | 9.09 | 9.18 | 9.13 | 9.11 | 9.14 | 9.11 | 9.127 | 0.0236 |
| #10 | 9.12 | 9.10 | 9.15 | 9.11 | 9.14 | 9.12 | 9.10 | 9.14 | 9.13 | 9.14 | 9.125 | 0.0178 |
The data in Table 1 can also be regarded as a collection of 10 datasets, one in each row. The final columns in the table display the mean, \bar{x}, and standard deviation \sigma for the data in each row. We can see that, even though all 100 measurements are in fact repetitions from the same measurement, the mean \bar{x} in each row is slightly different. This suggests that \bar{x} also follows a distribution of its own.
Things get much more interesting in Figure 1 where the distribution of the 10 \bar{x} values is plotted in a box and whisker plot (pink) next to those of the 100 original measurements (blue). Strikingly, the plots clearly show that the means are clustered much closer to one another. Can we somehow exploit this?
2. Central limit theorem #
It gets interesting once we take several dice and roll them simultaneously. Figure 3 shows the resulting means obtained from 1000 times following 10 dice simultaneously. In other words, 10 dice were thrown, the mean was recorded, and the process was repeated until 1000 means were obtained.
In Figure 3, the distribution no longer looks like a block shape at all, and in stead starts to resemble the normal distribution. In fact, if we would thrown an infinite number of dice 1000 times and plot the mean of the result, i.e. the sampling distribution of the means, then we would obtain a perfect normal distribution such as shown in Figure 4.
READ SECTION 9.2.3.3
Confidence Intervals For Large Sample Sizes
3. Confidence intervals #
3.1. Large sample size #
At this stage, you may overwhelmed by the computation of the critical value. However, determining this range is nothing different from the ICDF exercise in the previous lesson! We are interested in the central 95% of the distribution, which leaves 2.5% on both the left and right ends. We can use the ICDF at p 0.025 (2.5%) and 0.975 (97.5%) and obtain the z-values at these probabilities (-1.96 and 1.96, respectively).
Equation 9.15: \bar{x}\pm z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}
In reality, \sigma is often unknown. As long as n is above 30 we can reasonably replace it by s, but we will numerically explore this at the end of this lesson.
Calculate the confidence interval for the 100 RPLC pesticide measurements in Table 1. Fill in the interval (i.e. z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}) in the field below and round it to 4 decimals. Hint: Try to do the exercise with a combination of the tools you have learned thus far in this course. An incorrect answer will give your more hints. Further example code is down the page, but you learn more if you first try yourself.
Correct! Based on the 100 RPLC measurements we can conclude with a 95% confidence that the true mean is covered by the range 9.1262 ± 0.0038 ppm.
Unfortunately, this is not correct! Check your calculation first. You should have ended up using a n of 100, s (or \sigma) of 0.0192, and a critical z-value of 1.96. Did you get a different critical value other than 1.96? Then you possibly have not used the z-distribution. The standard normal distribution (i.e. z-distribution) has as property that it’s mean is 0 and its standard deviation is 1. You only use these values to calculate the critical z-value during the ICDF, not during the computation of the confidence interval itself.
3.2. Small sample size #
It is great that we have been able to determine the confidence interval for our 100 RPLC measurements. However, in practice we often find ourselves with a much smaller number of repetitions. What do we do then?
READ SECTIONS 9.2.3.4-9.2.3.5
Confidence Intervals For Small Sample Sizes
Equation 9.17: t=\frac{\bar{x}-\mu}{s/\sqrt{n}}
The effect of the degrees of freedom is shown in Figure 6, where the t distribution is plotted for different \nu values. At \nu = \infty the t-distribution is identical to the normal distribution. However, once we lose degrees of freedom, the distribution widens. This becomes significant below n of 30, and is seen in Figure 6 to be extreme for degrees of freedom as low as 1 or 2.
The effect is that the critical values will also move away from 0, and as a consequence the confidence intervals will widen. This is in agreement with what we expect, because if we have less repetitions (i.e. less degrees of freedom), then we also are less sure about what we are measuring. If we only have two measurements (i.e. \nu = 1, then we have no clue what the actual variation is based on just two values and our confidence will be horrible. For small sample sizes, the confidence interval is given by
Equation 9.19: \mu=\bar{x} \pm t_{(\frac{\alpha}{2},\nu)} \frac{s}{\sqrt{n}}
Here, t_{(\frac{\alpha}{2},\nu)} is the critical t-value at \nu degrees of freedom. It can be computed with the ICDF using all programming languages and Excel.
Calculate the confidence interval for the 10 RPLC pesticide measurements from Set 1 in Table 1. Fill in the interval (i.e. t_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}) in the field below and round it to 4 decimals. Hint: Try to do the exercise with a combination of the tools you have learned thus far in this course. An incorrect answer will give your more hints. Further example material to help you is located further down the page, but you learn more if you first try yourself.
This is the correct answer! If all went well, you should have ended up using a standard deviation of 0.0213, n = 10, and 9 degrees of freedom. The critical t-value should be (-)2.2622.
Unfortunately, this isn’t correct yet.
% Significance Level
p = 1-alpha/2;
% Calculation
t_crit = icdf('T',p,x_dof);
x_range = t_crit*(x_std/sqrt(n));
% Confidence Interval
x_CI = [x_mean - x_range, x_mean + x_range];
The T.INV() function can be used to compute the ICDF for the t-distribution for a given probability and number of degrees of freedom.
using Distributions, Statistics
# Significance level
p = 1 - alpha / 2
# Define the T distribution with degrees of freedom
t_dist = TDist(x_dof)
# Calculation
t_crit = quantile(t_dist, p)
x_range = t_crit * (x_std / sqrt(n))
# Confidence interval
x_CI = [x_mean - x_range, x_mean + x_range]
4. Optimal replicate number #
We saw in Figure 6 how at some point the t-distribution with sufficient degrees of freedom strongly resembles the normal distribution. We mentioned that this point would be around n>30 where we can reasonably assume s to sufficiently represent \sigma. Can we use this to determine what number of repeated measurements we should do?
Well, Equation 9.19 shows that the confidence interval depends on several components: (i) n, (ii) the critical t-value, and (iii) the precision. The latter is a given and depends on the instrument that produces the data. However, the first two we can investigate more closely. This is done in Figure 7, where the critical t-value divided by \sqrt{n} is plotted against the sample size.
On a first glance we see from Figure 7 that a small number of measurements yields a very high critical value, which improves as n increases. We can also see that after 20-50 measurements, the improvement per measurement of the critical value diminishes. This is the origin of the n>30 (some say n>50). More importantly, going from two to five repetitions yields a 7-fold improvement!
So, how many replicates of a measurement should we do? Definitely not just two, at least three, but five clearly yields a lot of confidence!
5. Increasing degrees of freedom #
Obviously, doing more replicates increases the degrees of freedom. But what if we find out later that we want more?
In this case, what can be done is that an additional set of replicates can be measured. This is from a statistical sense a second separate dataset and requires some statistical testing. Combining both datasets can only be done if the variances of the two dataset are homogeneous and requires a test of heteroscedasticity. We will learn how to do this in future lessons.
Equation 9.20: s^{2}_{\text{pooled}}=\frac{\nu_1 s^{2}_1 + \nu_2 s^{2}_2 + …}{\nu_1+\nu_2+…}
Here, \nu_1 and \nu_2 are the degrees of freedom of the respective datasets, and the same is the case for the variances.
Concluding Remarks #
We can conclude that the repeated measurements allows us to formulate a range of values that – in the absence of systematic errors – is very likely to include the true mean. This is the confidence interval. We learned how this can be set at different significance levels. Other conclusions are:
- Averaged measurements follow the normal distribution (central limit theorem), it is therefore not stupid to assume that our data is normally distributed.
- Going from 2 to 5 repetitions in practice for our laboratory experiments really pays off in terms of confidence intervals.
- Above, say, 30 repetitions, we can reasonably assume that our standard deviation of the sample represents the standard deviation of the population.
In the next lesson we will use these concepts to do actual hypothesis tests.
Further help with exercises #
% Data
data = readtable('PesticideConcentrations.csv');
vector_data = reshape(data{:,:},[],1);
a = 0.05;
% Gathering Information
p = 1-a/2; % 1-Alpha/2
x_mean = mean(vector_data);
x_std = std(vector_data);
n = length(vector_data);
% Calculation
z_crit = icdf('Normal',p,0,1);
x_range = z_crit*(x_std/sqrt(n));
% Confidence Interval
x_CI = [x_mean - x_range, x_mean + x_range];
The T.INV() function can be used to compute the ICDF for the t-distribution for a given probability and number of degrees of freedom.
An example can be downloaded here (CS_03_CI_Large, .XLSX).
using Distributions, Statistics, CSV, DataFrames
# Data
data = CSV.read("PesticideConcentrationsOutliers.csv", DataFrame)
vector_data = data.Data # Data <-- should be the name of the column in the csv
a = 0.05
# Gathering information
p = 1 - a / 2 # 1-alpha/2
x_mean = mean(vector_data)
x_std = std(vector_data)
n = length(vector_data)
# Calculation
dist = Normal(x_mean, x_std)
z_crit = quantile(dist, p)
x_range = z_crit * (x_std / sqrt(n))
# Confidence interval
x_CI = [x_mean - x_range, x_mean + x_range]
% Data
x = [9.08, 9.13, 9.11, 9.13, 9.10, 9.13, ...
9.15, 9.12, 9.14, 9.10];
a = 0.05;
% Gathering Information
n = length(x);
x_dof = n-1; % Degrees Of Freedom
x_mean = mean(x);
x_std = std(x);
p = 1-a/2; % 1-Alpha/2
% Calculation
t_crit = icdf('T',p,x_dof);
x_range = t_crit*(x_std/sqrt(n));
% Confidence Interval
x_CI = [x_mean - x_range, x_mean + x_range];
The T.INV() function can be used to compute the ICDF for the t-distribution for a given probability and number of degrees of freedom.
An example can be downloaded here (CS_03_CI_Small, .XLSX).
using Distributions, Statistics
x = [9.08, 9.13, 9.11, 9.13, 9.10, 9.13, 9.15, 9.12, 9.14, 9.10]
a = 0.05
# Gathering information
n = length(x)
x_dof = n - 1 # Degrees of freedom
x_mean = mean(x)
x_std = std(x)
p = 1 - a / 2 # a - alpha / 2
# Calculation
dist = Normal(x_mean, x_std)
t_crit = quantile(dist, p)
x_range = t_crit * (x_std / sqrt(n))
# Confidence interest
x_CI = [x_mean - x_range, x_mean + x_range]