Learning Goals #
- Understand why DoE is superior to traditional approaches to reach analytical objectives.
- Familiar with the principles of parameter-screening, optimization designs, empirical modelling.
- Predict predicting optimal parameter settings.
READ SECTION 9.9.1
PRINCIPAL-COMPONENT ANALYSIS
1. Introduction #
Most of the statistical treatment in earlier classes of this course (and in the first five modules of the Chapter 9) has centred around univariate statistics, i.e. we focused on populations across a single variable. However, in the classes on error propagation (Lesson 14; Module 9.7 in the book) and on Design-of-Experiments strategies (Lesson 15; Module 9.8) we started to expand our view to the simultaneous effects of several variables (x_1, x_2, \ldots) simultaneously on an outcome y. Multivariate statistics are a discipline of their own, which we will not discuss in this first course.
More interesting at this point are multivariate methods that have proven valuable for analytical chemists in treating and interpreting multidimensional data. Applications include data(pre-)treatment, calibration, modelling, classification, clustering, and more. We will start with a brief treatment of the archetype of such methods: principal-component analysis (PCA).
2. Principal component analysis (PCA) #
Like so many things in life, principal-component analysis (PCA) can be understood by thinking about chocolate. The basic concept is illustrated in Figure 1. Figure 1A shows a chocolate bar floating in three-dimensional space. In Figure 1B mean centring is applied (but no scaling) by moving the centre of gravity of the bar to the origin. The axis of the three space dimensions are the variables x_1, x_2, and x_3, which become x'_1, x'_2, and x'_3 after mean centring.
The concept of PCA now relates to finding the direction with the greatest variance in chocolate observations (the greatest spread of chocolate), which is in the direction of the length of the bar. This we call the first principal component or \text{PC}_1. Next we look at the direction that is perpendicular to \text{PC}_1 and features the greatest spread in chocolate, which is the width of the chocolate bar (\text{PC}_2). The final principal component (\text{PC}_3) is the only direction that is perpendicular to \text{PC}_1 and \text{PC}_2. In our case, this is the thickness of the chocolate bar. In three-dimensional space there are three principal components (or, generally, in n-dimensional space, n principal components can be established). In a new space, spanned by \text{PC}_1, \text{PC}_2, and \text{PC}_3, the chocolate bar is neatly positioned, with length, width and height in the directions of the axes.
In terms of the original variables, the principal components are linear combinations of the form
y_i = a_i x_1' + b_i x_2' + c_i x_1'
where i is the number of the PC. [Note: There is no intercept, because of the mean centring performed.]
Unfortunately, we are not usually dealing with chocolate. Instead, we have a number of observations y at specific settings of the variables (x_1, x_2, \ldots), where n may be (much) larger than three. For example, the variables may be a series of wavelengths and the values of the variables may be the absorbance at that wavelength. In that case each point is a spectrum. When dealing with spectra, it is prudent to normalize the intensity (e.g. to a maximal absorbance of 1), because spectra from a single compound do not coincide in a single point, but rather on a straight line through the origin, with the distance to the origin proportional to the concentration.
Note that this automatically scales all variables, which is generally recommended prior to performing PCA. Common ways to achieve this are mean centring and autoscaling. This is achieved by subtracting the average of each variable and dividing the result by the standard deviation, i.e.
Sets of spectra form a good example of the complex types of data analytical chemists are dealing with. For example, UV spectra may be recorded with a diode-array spectrometer, in which the spectrum is collected in 256 channels, corresponding to 256 variables (x_1, x_2, \ldots, x_256). We cannot picture a 256-dimensional space, but we can deal with it mathematically to reap one of the great benefits of PCA, i.e. reducing the number of dimensions. \text{PC}_1 in the 256-dimensional space is the linear combination of variables that harbours the greatest variance in absorbance, or the greatest amount of spectral information. \text{PC}_2 is the vector that is perpendicular to \text{PC}_1 and represents the greatest amount of remaining variance, etc. Every subsequent PC represents less of the variance and almost all of the total variance can usually be grasped in a limited number of principal components. In other words, almost all information can be grasped in a small number of dimensions.
When we project all spectra on a plane formed by \text{PC}_1 and \text{PC}_2, we obtain the most informative plot of the positioning of the spectra in the 256-dimensional space. Similar spectra will be positioned at similar locations. This hints on another major application of PCA, i.e. classification of subjects (such as spectra).
Mathematically, the principal components are eigenvectors of the covariance matrix of the data. To establish the principal components, eigen-decomposition of the data covariance matrix can be performed. This is beyond the scope of the present course. Numerous software packages feature tools to perform PCA analysis.