INFORMATION REPOSITORY

01. Introduction to Chemometrics

Updated on December 22, 2024
Studying analytical separation science is not complete without developing an understanding of the data obtained, their interpretation, and ways to draw correct conclusions and to obtain relevant information. In the upcoming course modules we will learn about different facets of data analysis. We start with small sets of repeated measurements and work our way towards large chromatograms. Before we do so, it is first important to grasp the key concepts of errors.

Learning Goals #

  • Learn essential definitions used in analytical chemistry and statistics.
  • Grasp the concepts of errors in quantitative analysis.
  • Understand that statistics is the only way to use your powerful instrumentation to its full potential.
  • Recognize that statistics can save you a lot of time designing experiments.

1. Importance of Statistics in Analytical Chemistry #

The word chemometrics can reasonably be defined as the use of mathematical and statistical methods to describe the state of a chemical system (e.g. a sample), and design or select optimal measurement procedures and experiments.
In this course, we will see that analytical chemistry through chemometrics relies significantly on statistics.
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.1.1

Introduction to Quantitative Analysis

Analytical chemistry, and thus analytical separation science, is often used to answer the questions “What?” (qualitative analysis) and “How much?” (quantitative analysis). Unfortunately, it is impossible to analyse everything. Indeed, we usually take a sample and use this to derive a conclusion on the population from which the sample originated. To make things worse, no analytical method is perfect. Errors, however small, are unavoidable through the entire analytical workflow. The answer resulting from an analytical method is therefore never an absolute number, but a range of numbers which is likely to contain the true value. Statistics can be used to determine this range or confidence interval. The confidence interval can then be taken into account when drawing conclusions on the analytical questions.

(Luc) Massart (1941-2005, Belgium)

Luc Massart was professor of Analytical Chemistry at the Pharmaceutical Institute of the Vrije Universiteit Brussels (Belgium) since 1974. At that point he had already become one of the instigators of, and driving forces behind, the emerging field of chemometrics. He must be credited for bringing chromatography and chemometrics together, applying chemometric techniques to data analysis and optimization. Apart from his numerous scientific achievements, Luc Massart was a brilliant teacher and lecturer. He taught the fundamentals and applications of chemometrics to numerous analytical chemists and separation scientists.

In the course Chemometrics & Statistics, we will learn about various statistical methods that we can use to convert our data into meaningful information. We will also devote attention to signal processing.

2. Definitions #

We start by establishing a number of definitions. Even though analytical chemistry is rooted in statistics, it can be easy to confuse terms.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.1.2

Variables and Data Order

2.1. Populations, samples and objects #

For instance, an analytical question can concern the determination of the concentration of a given pesticide in a lake. In this context, all the water in the lake is the population, and the fraction that we measure is the sample. In practice, taking a sample of water from the lake can mean that several buckets are acquired. The sample then comprises of several objects.

2.2. Variables #

Here, the pesticide concentration is an example of a variable. When a method only measures one variable we refer to it as a univariate method, whereas it is a multivariate method if several variables are measured (e.g. the concentrations of different pesticides).
The variables themselves also can be distinguished into several categories. When a variable can be expressed with words that cannot be ranked we refer to it as a nominal variable. An example of this is colour of which the values “blue”, “red”, “yellow” cannot be ranked. Conversely, ordinal variables are expressed with words that can be ranked such as with the command of a language “Good” or “Poor”. Finally, interval variables are expressed by numbers and the zero point is defined, such as with temperature or pressure.

Which of the following options is a nominal variable? More options may be correct.

2.3. Order of instruments #

Our analytical instrumentation measures one or a series of one or several observed variables at the same time. Using statistical models and other methods, observed variables can be converted into the properties of interest that we want to determine. The latter are examples of latent variables, which are variables that can only be determined indirectly through mathematical statistical models.
Instrument Order Tensor Order Example
Zero
Zero (number)
pH meter
First
First (vector)
UV-Vis spectrometer
Second
Second (matrix)
GC-MS
Third
Third (block)
LC×LC-MS

When an instrument produced a single measured value, this will yield a 0th order tensor. An example of this is a pH meter. The tensor order represents the dimensionality of the data. The order of the analytical instrument is closely related to the tensor order.

For example, a UV-Vis spectrophotometer is an example of a first-order instrument as it measures a series of variables (wavelengths) to produce a spectrum. The spectrum is a series, or array or values that we also refer to as a vector. A row vector is a horizontal array of values, whereas a column vector is a vertical array of values. The latter distinction will become important once we start to process our data in a higher-order programming language.

The dimensionality (tensor order) of the data depends on the order of the instrument. The tensor order strongly affects data processing strategies.

A second-order instrument produces a matrix (i.e. a plane, or surface) of data. An example is GC-MS, where a mass spectrum is measured as function of time. Similarly, a third-order instrument will produce a block of data, such as is the case with LC×LC-MS.

Rank the instruments below by their instrument order from highest to smallest. This means that the instrument that produces third-order data should be on top.

3. Errors in quantitative analysis #

Now that we have established some key definitions, we’ll turn to the data. We will learn about the type of errors that can arise and see how repeating our measurements is an affective way to evaluate our error. 

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.2.1

Repeated Measurements: Errors

3.1. Random and systematic errors #

As concluded before, all analytical methods and instruments produce errors. We can define this as

e_i=x_i-\mu_{0}
The equation demonstrates that each measured value x_i will comprise of the true value \mu_{0} and an error e_i. Or, conversely, that the error is equal to the difference between the true value and the measured value. We can refine the definition of the error further through defining \mu as the mean of the measurements if we would do them an infinite number of times. We then obtain
e_i=(x_i-\mu)+(\mu-\mu_{0})
We now have made a distinction between two types of error. The random error (x_i-\mu) represents the variation between the repeated measurements. For a single value it is defined as the error between the value and the mean of an infinite number of measurements. It is also known as the precision of the method. It can only be obtained through repeated measurements and is normally measured as a standard deviation. We will learn more about the latter in the next lesson.
The systematic error, is the different between the mean of an infinite number of measurements and the true value. The systematic error causes the population of repeated measurements produced by the instrument to shift in either direction of the true value (\mu-\mu_{0}) . It is therefore also referred to as the bias of – in this case – the method.
The total error is the sum of the random and systematic error and is also referred to as accuracy.
Concept Definition
Precision
The closeness of agreement between the results obtained by applying the experimental procedure several times under prescribed conditions. Random error: normally measured as a standard deviation.
Bias
Difference between the value obtained after infinite measurements and the true value. Also defined as the trueness, or systematic error.
Accuracy
Difference between the result of a determination and the true value. The total error.

3.2. Repeatability and reproducibility #

For random errors one additional important distinction must be made. If we regard the figure below, then we see that now several sets of measurements are displayed for different labs. For lab A (depicted also by pink), a distinct mean value \mu_{A} is denoted. \mu_{A} is different from \mu in that the latter now represents the mean of the repeated measurements from all labs.

According to the regulatory definitions of the European Medicines Evaluations Agency (EMEA) we refer to the repeatability when all repeated measurements were conducted under the exact same operating conditions over a short amount of time. The intermediate precision is the precision within one laboratory but under different conditions such as a different analyst, day or instrument). Reproducibility refers to the comparison of measurements obtained from different laboratories.
EXERCISE

The figure below schematically shows the data from different labs. The glowing markers each represent one of the key concepts that is discussed in parts 3.2 and 3.3 of this lesson. Do you know which is which? Choose from systematic error, laboratory bias, method bias, random error, precision, accuracy, repeatability and reproducibility. You can check your answers by hovering over one of the markers. Note that laboratory and method bias are discussed below.

3.3. Sources of bias #

Bias can originate from different sources. When the systematic error arises due to a flaw in the method itself, then it is considered method bias. If the systematic error is caused by the laboratory then we refer to it as laboratory bias.

Specify for which items the bias arises from the laboratory. This means that the items that you leave unselected are cases of method bias.

We have seen the true value \mu_{0} popping up several times now. How do we even know this value if each instrument and method makes mistakes? Unfortunately, we will never know the true value. Instead, we use references to calibrate our system to a relative true value.

If a laboratory has a reference sample or certified reference material available, then the bias can be measured. If this is unavailable, a sample can be distributed to other laboratories and/or even be measured using different techniques so that a conventional true value can be established. These reproducibility studies tend to be very costly.

Figure: A) Example of a scheme to determine bias. B) Interlaboratory trials allow a conventional true value to be established for a reference sample. This allows each laboratory to determine its bias relative to each other.

It is clear that there is little we can do about systematic errors in our data except for to be aware of them. In the majority of the remainder of this course we will therefore focus on the random errors, which we do can take into account. We will from now on assume that systematic errors are absent.

Is this article useful?