INFORMATION REPOSITORY

02. Statistics of Repeated Measurements

Updated on February 6, 2025
Now that we recognize the need for statistics in analytical chemistry we can turn our attention to practice in the laboratory. We have learned that repeated measurements are they key assess the error in quantitative analysis. How can we use them to process our data? In this lesson, we will learn about different types of data, ways to present them both graphically and numerically and make the first step towards the probability distribution.

Learning Goals #

  • Understand how to distinguish robust from non-robust data.
  • Present data as histogram and box-and-whisker plot.
  • Use measures for central tendency and dispersions to numerically summarize data.
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.2.2 - 9.2.2.3

Descriptive Statistics

1. Descriptive Statistics #

Statistics can be useful for different purposes. In general, we can for now distinguish three aims: summarizing a sample (descriptive statistics), describing the population by inferring information from the sample (inferential statistics) and establish relationships between two or more variables (modelling).

Table 1. Statistics for data analysis can have several aims. Three are listed in this table. In this lesson we focus on descriptive statistics.
Concept Definition
Descriptive statistics
Summarize a sample (describe it).
Inferential statistics
Describe the population from where the sample comes from.
Modelling
Establish a relationship between two (or a collection of) variables.

1.1. Notation #

In this lesson we focus on descriptive statistics. To guide us, we’ll use a set of 100 repeated measurements that are shown in Table 2 and can also be downloaded for the exercises. 

Table 2. Representation of a sample of 100 repeated measurements of the concentration of a pesticide by RPLC. Identical to Table 9.2 from the book.
SET\bar{x}\sigma
#19.089.139.119.139.109.139.159.129.149.109.1190.0213
#29.129.159.139.149.119.139.119.139.139.129.1270.0125
#39.119.099.119.149.119.129.159.149.169.149.1270.0221
#49.139.149.169.089.149.109.149.099.129.139.1230.0254
#59.169.129.129.119.159.139.179.129.159.119.1340.0217
#69.129.109.119.139.129.179.119.149.119.129.1230.0200
#79.109.139.149.129.119.139.169.129.139.159.1290.0179
#89.129.149.139.129.149.139.129.139.139.129.1280.0079
#99.139.129.139.139.099.189.139.119.149.119.1270.0236
#109.129.109.159.119.149.129.109.149.139.149.1250.0178

Before we define the different statistics that we can calculate, we first need to cover another convention on their notation. If we describe the entire population, we use the Greek alphabet (e.g. \mu, \sigma), whereas the Latin alphabet is used for the sample (e.g. \bar{x}, s).

1.2. Central tendency measurements #

We can measure different aspects of our data in Table 2. The first we will address is the central tendency. The central tendency is a statistic (or measure) for the central point of the data. The arithmetic mean, \bar{x}, also known as the mean or average is the sum of all measurements divided by the number of measurements.
Equation 9.4: \bar{x}=\frac{\Sigma^{n}_{i=1}x_i}{n}

Alternative, we can use the median, which is the middle value that separates the lower half of the sorted data vector from the higher half. 

If the number of measurements is an even number, the median is the mean of the middle two of the sorted values.

Equation 9.5: \tilde{x}=\bigg\{\begin{matrix}{x_{(n+1)/2}} & \text{if odd number}\\{\frac{x_{n/2}+x_{1+n/2}}{2}} & \text{if even number}\end{matrix}

Note that the taking the mean of a variable always mathematically is denoted with the bar on the variable (e.g. \bar{x}), whereas the median is denoted with the tilde, \tilde{x}.

1.3. Robust and non-robust #

Equation 9.5: \tilde{x}=\bigg\{\begin{matrix}{x_{(n+1)/2}} & \text{if odd number}\\{\frac{x_{n/2}+x_{1+n/2}}{2}} & \text{if even number}\end{matrix}

You may wonder by now which of the two should be used: the mean or the median? This highly depends on the type of data. In fact, the decision to use any statistical method depends on how the data is distributed. To illustrate this, we’ll regard two sets of data.

Figure 1. A) Normally distributed data allows non-robust methods to be useful. For example, the mean is a good measure of the central value for this type of data. B) Asymmetric distributed data requires robust methods, that do not assume that the data is symmetrical. Here, the mean would poorly represent the central point of the distribution.
Figure 1A shows a case where the data is symmetrically distributed, or – in this case – even normally distributed (i.e. the data follows a normal distribution). Parametric statistical methods assume that the data follows a certain distribution, for example, such as the normal distribution. As a consequence, these methods are considered non-robust to relevant anomalies in the data. In contrast, non-parametric methods, or robust methods, do not operate under such assumptions. Figure 1B shows an example of an asymmetric distribution of data for which parametric statistical tests could not be used.
Example
				
					% Data
x = [1,2,3,4,5,6,42];

% Calculations
x_median = median(x);
x_mean = mean(x);
x_std = std(x);
				
			

The following functions can be used to do these calculations:

  • Mean: AVERAGE()
  • Median: MEDIAN()
  • Standard deviation: STDEV()

An example can be downloaded here (CS_02_CentralTendency,  .XLSX).

Note that Excel also features the STDEV.S() and STDEV.P() functions, which utilize the two equations below for the standard deviation. STDEV() = STDEV.S(). The relevance of using the correct one will become apparent in the next lesson.

				
					using Distributions, Statistics

# Data
x = [1 2 3 4 5 6 42];

# Calculations
x_median = median(x)
x_mean = mean(x)
x_std = std(x)
				
			

Does this really matter? Yes, even for something as simple as the mean and the median. Let’s regard the follow vector of data

x = [1; 2; 3; 4; 5; 6; 42]

The mean of this data is \bar{x}=9, whereas the median is \tilde{x}=4. The data vector x is clearly not symmetrically distributed and even contains an outlier (the value 42). We can see that as a consequence, the mean is strongly affected by the presence of the value 42. In contrast, the median completely ignores this outlier and selects the middle value of the vector, which is 4. For the median, it does not matter how serious the outliers are. We can indeed say that it that median is a robust statistic. Is there then no disadvantage of the median? Well, the median completely ignores that actual values themselves. It simply selects the middle value, it thus represents the actual data less.

1.4. Dispersion measurements #

The dispersion quantifies the variability of the distribution, i.e. the extent to which the distribution is squeezed or stretched. The non-robust statistic for the dispersion is the well-known standard deviation, which we in the context of analytical methods also refer to as precision. For a sample, the standard deviation quantifies the average deviation of x_i from the average as
Equation 9.6: s=\sqrt{\Sigma_i \frac{(x_i-\bar{x})^2}{n-1}}

where n is the number of datapoints. For the entire population, with a total number of n_{\text{pop}} datapoints, the standard deviation is given by

Equation 9.7: \sigma=\sqrt{\Sigma_i \frac{(x_i-\mu)^2}{n_{\text{pop}}}}

The variance is the standard deviation squared (i.e. s^2 or \sigma^2. Note that the standard deviation and variance each are not necessarily dimensionless. To compare standard deviation of data with different units or magnitudes, the relative standard deviation (RSD) can be calculated, which is given by

Equation 9.8: RSD=100\% \frac{s}{x}

or the coefficient of determination (CV)

Equation 9.8bis: CV=\frac{s}{\bar{x}}

Although the standard deviation can be used for any type of data, it can provide somewhat misleading information to an analytical chemistry. The robust alternative is the interquartile range (IQR), which we will learn more about in the next section.

Calculate the mean and standard deviation of the 100 values in Table 2. Round to four decimals. Fill in your answer in the format “mean; standard_deviation”.

For example, you fill in “9.1419; 0.0130″ if your result is a mean of 9.1419 and the standard deviation is 0.0130”.

2. Displaying data #

Next to numerical statistics, there are also several graphical tools to describe our data. In this section we will cover two. 

2.1. Box-and-whisker plot #

One very powerful method to display data is by creating a box and whisker plot, also known as a boxplot. We can regard an example in Figure 2A, where the 100 pesticide measurements are plotted.
Figure 2. Example of box and whisker plots of the 100 pesticide measurements from Table 2. Plot (A) without, and (B) with notches. For (A) two outliers were added to indicate how these are shown. A box and whisker plot is a robust graphical tool that can effectively display asymmetric data. Image is identical to Figure 9.7 from the book. See Table 3 and text for further clarification.

Several limits can be seen on the box and whisker plot. These are clarified in Table 3. Importantly, 50% of the measurements lie in the two boxes between Q1 and Q3. The latter distance is also referred to as the interquartile range (IQR). Both the mean and the median are individually indicated.

Table 3. Definitions for the box and whisker plot.
Concept Definition
First quartile, Q1
Value of x so that 25% of the observations are smaller.
Third quartile, Q3
Value of x so that 25% of the observations are larger.
Percentile, p\%
Value of x so that p\% of the observations are smaller.
Interquartile range, IQR
IQR=Q_3-Q_1 \approx 0.7413 IQR
Next we have the whiskers, which display any data smaller or larger than Q1 and Q3, respectively. Beyond 1.5 times the IQR, any further datapoint is considered an outlier. For the data in Figure 2A two additional outliers were added, whereas Figure 2B shows an example of notches which can be used to indicate the 95% confidence interval of the median.
Example
				
					% Data Import
data = readtable('PesticideConcentrationsOutliers.csv');
values = data{:,:};

% Create Box And Whisker Plot
boxplot(values,'Notch','on')

% Without Notch
boxplot(values)

% Directly From The Table
data = readtable('PesticideConcentrations.csv');
vector_data = reshape(data{:,:},[],1);
boxplot(vector_data,'Notch','on');

				
			

What becomes apparent from the box and whisker plots is that the 100 pesticide measurements are not symmetrically distributed. 

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.2.2.4

Distributions: Probability Density Function (PDF)

To make a Box and Whisker plot in Excel: select your data and navigate to Insert > Chart > Statistical > Box and Whisker. Note that depending on your version of Excel the actual chart type may be located at a different option.

An example can be downloaded here (CS_02_Box_and_Whisker_Plots,  .XLSX).

				
					using Distributions, Statistics, CSV, DataFrames, StatsPlots

# Data Import
data        = CSV.read("PesticideConcentrationsOutliers.csv", DataFrame)
values      = data.Data     # Data <-- should be the name of the column in the csv

# Create Box and whisker Plot
boxplot(values, notch = true)

# Without notch
boxplot(values)
				
			

2.2. Histogram #

While the box and whisker plot is elegant is displaying key characteristics of the data, it does not precisely show the distribution of the data. We will therefore also cover an alternative plot, which is the histogram. In this histogram all data is divided across several so-called bins. Each bin captures the frequency of observations of the variable to take on a value between defined limits. An example is shown in Figure 3A. The number of bins can be defined by the user. A good rule of thumb is to take a number of bins equal to \sqrt{n}.
Figure 3. (A) Histogram plots of the 100 pesticide measurements from Table 2. One of the bins is indicated. (B) Same histogram as panel (A), but now with the y-axis converted to relative frequency.

A very useful operation is to divide each frequency by the total number of observations, n. The result is shown in Figure 3B. By dividing the fequency of each bin by n, the total area of the distribution becomes 1. In other words, there is a 100% chance that the variable will take on. And, in this case, there is a 39% chance that the variable will take on a value of 9.13 or 9.14 (pink-marked area). 

Example
				
					% Data Import
data = readtable('PesticideConcentrationsVector.csv');
values = data{:,:};

% Prepare Variables
n = length(values);
bins = sqrt(n);

% Create Histogram
histogram(values,bins)
				
			

To make a Histogram plot in Excel: select your data and navigate to Insert > Chart > Statistical > Histogram. Note that depending on your version of Excel the actual chart type may be located at a different option. You can set the properties of the bins in the Properties window.

An example can be downloaded here (CS_02_Histogram, .XLSX).

				
					using Distributions, Statistics, CSV, DataFrames, StatsPlots

# Data Import
data        = CSV.read("PesticideConcentrationsOutliers.csv", DataFrame)
values      = data.Data     # Data <-- should be the name of the column in the csv

# Create Histogram
histogram(values, bins = :sqrt)
				
			

Florence Nightingale (1820-1910, United Kingdom)

Florence Nightingale is well known as a nurse who cared for soldiers during the Crimean War (1853-1856). Less well known is her role as a pioneer of statistics and data visualization. She used graphics to bring her findings across, which reputedly was revolutionary at the time among mathematicians and statisticians. She compiled data on seasonal variations in patient mortality in pie charts with segments of equal angles, but variable length. She called such plots “cockscombs” (after the crest of a domestic rooster). Among all her other impressive activities, Florence Nightingale was an early proponent of infographics.

Image by Henry Hering, copied by Elliott & Fry as half-plate glass copy negative, late 1856-1857. Image reproduced with permission by National Portrait Gallery, London under the CC BY-NC-ND 3.0 license https://creativecommons.org/licenses/by-nc-nd/3.0/).

3. Probability distributions #

3.1. Probability density function (PDF) #

The distributions thus far were created using discrete values, but a variable such as the concentration could take on any value. To describe the entire population of values we need a continuous curve. We can imagine that, if we would increase n to \infty, our distribution would become infinitely more refined and start accommodate the normal distribution. We can capture this as the probability density function (PDF), which is non-negative everywhere, and the area under its curve is 1 (i.e. 100% probability find any value). In our case, we can create one by plotting the normal distribution as shown in Figure 4A.

Figure 4. A) Probability density function (PDF), B) cumulative density function (CDF), and C) inverse cumulative density function (ICDF) based on the 100 pesticide measurements from Table 2.

The normal distribution – which is also referred to as the Gaussian distribution – is a continuous probability function for a real-valued random variable. It is given by

3.2. The normal distribution #

Equation 9.9: f(x)=\frac{1}{\sigma \sqrt{2\pi}} e^{\frac{(x-\mu)^2}{2 \sigma^2}}

We can see that this function requires \mu and \sigma. If we take the mean and standard deviation of our 100 pesticide RPLC measurements from Table 2, we can plot the normal distribution as PDF for our data and obtain Figure 4A.

Example
				
					% Data Import
data = readtable('PesticideConcentrationsVector.csv');
values = data{:,:};

% Prepare Variables
x_mu = mean(values);
x_sigma = std(values);

% Determine Plot Limits
x_min = min(values);    % Lowest Value
x_max = max(values);    % Highest Value
x_diff = x_max-x_min;   % Difference Between The Two
x = x_min-0.1*x_diff:x_diff/100:x_max; % X Values

% Create The PDF
y = pdf('Normal',x,x_mu,x_sigma);

% Plot The PDF
plot(x,y);
				
			

The NORM.DIST function can be used to create a probability density function of the Normal distribution if you set the final input parameter as FALSE.

To make a Line Plot plot in Excel: select your data and navigate to Insert > Chart > Line Plot. Note that depending on your version of Excel the actual chart type may be located at a different option. You can set the properties of the bins in the Properties window.

An example can be downloaded here (CS_02_PDF, .XLSX).

				
					using Distributions, Statistics, CSV, DataFrames, StatsPlots

# Data Import
data        = CSV.read("PesticideConcentrationsOutliers.csv", DataFrame)
values      = data.Data     # Data <-- should be the name of the column in the csv

# Prepare Variables
x_mu        = mean(values)
x_sigma     = std(values)

# Determine Plot Limits
x_min       = minimum(values)   # Lowest value
x_max       = maximum(values)   # Highest value
x_diff      = x_max - x_min     # Difference between the two
x           = x_min - 0.1 * x_diff:x_diff / 100:x_max   # X values

# Create the PDF
d           = Normal(x_mu, x_sigma)
y           = pdf(d, x)

# Plot The PDF
plot(x,y)
				
			

3.3. Standard Normal Distribution #

The plot shown in Figure 4A provides us yet another perspective on our data, but thus far we have not been able to do much with the result. We learned before that one attractive feature of the PDF is that the area under the curve is 1. In other words, the probability to find any value is 100%. This implies that we can also calculate the probability to find a value between certain limits. This will be highly useful later on.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.2.2.5 & 9.2.2.6

Standard Normal Distribution, CDF & ICDF

The objective is to calculate specific areas under the curve between limits. Before we continue, it must be noted that this can be simplified by converting our data into a standard normal distribution, which has the properties that its mean is 0 and the standard deviation is 1. We can achieve this by expressing our variable x into a standard normal variable z through

Equation 9.10: z=\frac{x-\mu}{\sigma}

The result is shown with the second x-axis in Figure 4A. The limits can now readily be calculated. For example, as a property of this distribution, 95% of the data will lie between a z of -1.96 and 1.96. We will see later how such values are computed.

3.4. Calculation of probability of finding specific values #

To calculate the probability between specific limits, we can express the PDF into the cumulative distribution function (CDF). The CDF yields the probability that a random variable is less or equal than x (Figure 4B, white dashed line).
Example
				
					
% Example Values
mu = 0;
sigma = 1;
x_limit = 1.96;

% Calculate Probability Lower Than Limit
p = cdf('Normal', x_limit, mu, sigma);

% Calculate Probability Higher Than Limit
p = 1 - cdf('Normal', x_limit, mu, sigma);

% Calculate Probability Between Two Limits
p2 = cdf('Normal', x_limit_high, mu, sigma);
p1 = cdf('Normal', x_limit_low, mu, sigma);
p = p2 - p1;
				
			

The NORM.DIST function can be used to compute the cumulative distribution function for the Normal distribution if you set the final input parameter as TRUE.

An example can be downloaded here (CS_02_CDF, .XLSX).

				
					using Distributions, Statistics

# Example Values
mu          = 0
sigma       = 1
x_limit     = 1.96

# Create a Normal Distribution
dist        = Normal(mu, sigma)

# Calculate Probability Lower Than Limit
p_low       = cdf(dist, x_limit)

# Calculate Probability Higher Than Limit
p_high      = 1 - cdf(dist, x_limit)

# Calculate Probability Between Two Limits
x_limit_high    = p_high
x_limit_low     = p_low

p2          = cdf(dist, x_limit_high)
p1          = cdf(dist, x_limit_low)
p           = p2 - p1
				
			

Just to re-iterate here that the CDF evaluated at x always returns the probability that the variable takes on the value of x or less. If you want to compute the probability that the variable takes on x or higher, you can do 1 (the full area) minus the probability that it will take on x or lower. Similarly, to calculate the probability between two limits, you can subtract the CDF evaluated at both limits from each other (Figure 5).

Figure 5. To compute the probability of x to take on the value between two limits, the CDF evaluated at the lower limit can be subtracted its evaluation at the higher limit.

A pH-meter gives on average true results with a standard deviation of 0.015. When an aqueous solution analysed has a true pH of 7.283, what is the chance to obtain a measurement result above a pH of 7.3? Round your answer to 4 decimals.

What is the probability to obtain a measurement result between a pH of 7.28 and 7.29? Round your answer to 4 decimals.

This is not correct. Let’s try again! Hint: Use the same data from the previous questions but to get to the answer you should subtract to probabilities from each other.

3.5. Determination of limits based on probabilities #

We can also reverse the process and use the inverse cumulative distribution function (ICDF) to compute the limit for which the probability for the variable to take on a value of our limit or lower is equal to the probability we specify. This is a mouth full, but the process is also schematically shown in Figures 4B and 4C (pink dashed line). A quantile or inverse CDF function can be used to do this.
Example
				
					% Compute Limit Based On Probability
x_limit = icdf('Normal',p,mu,sigma);

% Example
mu = 7.283;
sigma = 0.015;
p = 0.4; 
x_limit = icdf('Normal',p,mu,sigma);
				
			

The NORM.INV function can be used to compute the inverse cumulative distribution function of the Normal distribution.

An example can be downloaded here (CS_02_ICDF, .XLSX).

				
					using Statistics, Distributions

# Calculate Probability Lower Than Limit
d = Normal(mu,sigma)        # Normal distribution
x_limit = quantile(d, p)    # Limit

# Example
mu = 7.283
sigma = 0.01
p = 0.4
d = Normal(mu,sigma)        # Normal distribution
x_limit = quantile(d, p)    # Limit
				
			

For example, if we take compute the ICDF evaluated for our data at a probability of 0.8, then we will obtain the value of x for which the probability to find it or lower is 0.8.

Suppose the cholesterol level of blood plasma of healthy individuals is normally distributed with an average of 6.30 mmol·L-1 and a standard deviation of 0.61 mmol·L-1.

For an extensive study on the influence of nutrition on health, a large number of volunteers is to be divided in 5 groups according to their plasma cholesterol level: extra low – low – average – high – extra high. Calculate where the limits for these groups should be drawn such that the 5 groups will contain approximately the same number of test persons.

Concluding Remarks #

We have learned in this lesson how to describe our sample using various tools. First we saw numerical statistics that quantify the central tendency and the variation of the data. Next we looked at some graphical tools. Finally, we have practiced with some essential calculations that we can use to calculate probabilities based on our data, or determine limits. These calculations will become essential for our the confidence intervals and  hypothesis tests in the next lessons. Did you get stuck with the final question? There is some additional code below to help you out.

				
					% We Need Five Equal Groups
% So Our Probabilities Of Interest Are
p = 0:(1/5):1;

% We Plug The Entire Probability Vector In:
x=icdf('Normal',p,6.3,0.61);
				
			

The solution to the last exercise can be downloaded here (CS_02_ICDF_EE, .XLSX).

				
					using Distributions, Statistics

# We need five equal groups
# So our probabilities of interest are:
p           = 0:(1/5):1

# We plut the entire probability vector in:
x           = quantile.(Normal(6.3, 0.61), p)
				
			
Is this article useful?