INFORMATION REPOSITORY

13. Model Validation

Updated on February 9, 2026

Modelling and calibration are only useful if the model reliably predicts new measurements. In this lesson, we extend the concepts from the previous lesson, least-squares regression and calibration-model variance, to evaluate how well a calibration model performs when applied to new data. We will now explore the key ideas behind model validation, including residual analysis, lack-of-fit testing, and the assumptions underlying linear regression models. Building on our understanding of predicted and experimental confidence intervals, we now learn how to judge whether a model is appropriate for the analytical question at hand. This forms the bridge between constructing a model and trusting it setting the stage for more advanced topics in method performance and analytical quality.

Learning Goals #

  • Understand the purpose of model validation and why regression models must be validated before being used for quantitative analysis
  • Use residual analysis as a diagnostic tool by interpreting residual plots
  • Assess the goodness-of-fit and overal model performance
  • Connect model validation to confidence intervals
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.6.4-9.6.4.1

VALIDATION & INSPECTION OF RESIDUALS

1. Inspecting the data #

To investigate whether our model is actually a good model, we need to regard the two factors that affect this: the data, and the model itself (i.e. the mathematical equation used for regression). We’ll start with the data.

Case: Retention Model

For this lesson we return to the retention model that we have been fitting and investigating in Lesson 11 and Lesson 12.

1.1. Residuals #

A straightforward way to assess how well a model describes the data is by examining the residuals, defined as e_i=y_i−\hat{y}_i (see also Lesson 11).

This is illustrated in Figure 1, where panel A shows our retention model from the previous lessons and panel B the residuals.

Figure 1. Illustration of a fitted regression model (A) and the corresponding residuals (B). Residual plots provide a simple visual check of model quality: random scatter around zero suggests an adequate fit, while patterns or unusually large points signal possible problems in the model or data.

From our existing code that we have developed over the last few lessons it is easy to plot and investigate the residuals.

				
					plot(x,y-y_hat,'o'); 
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					plot(x, y-y_hat, label = "o")
				
			

Residual plots make it easy to detect systematic deviations that may not be immediately obvious from the fitted curve alone. In Figure 2-A1, corresponding to the straight-line fit of Figure 1A, a clear pattern appears in the residuals, suggesting that the linear model fails to capture the true relationship. In contrast, Figure 2-B1 shows a random scatter of residuals without visible structure, which is characteristic of a well-fitting model.

Figure 2. Residual plot of a (A1) straight-line, and (B1) second-order polynomial fitted to the retention data with three repeats for each \varphi instead of one. Panel C1 reflects the same as panel B1, but one datapoint was adjusted to be an outlier (datapoint = 3.3 was changed to 2.2). The dotted line represents a perfect description of the plotted data by the model. Panels A2-C2 reflect the Durbin-Watson test for serial correlation, or autocorrelation.

Residual plots make it easy to detect systematic deviations that may not be immediately obvious from the fitted curve alone. In Figure 2-A1, corresponding to the straight-line fit of Figure 1A, a clear pattern appears in the residuals, suggesting that the linear model fails to capture the true relationship. In contrast, Figure 2-B1 shows a random scatter of residuals without visible structure, which is characteristic of a well-fitting model.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.6.4.2-9.6.4.3

AUTOCORRELATION, INFLUENCE & LEVERAGE

1.2. Autocorrelation #

When residuals show a pattern in which neighbouring errors tend to move together, the model may suffer from autocorrelation (also called serial correlation). This means that the error in one datapoint is not independent of the next, which often signals that the model structure is inadequate.

Autocorrelation can be formally assessed using the Durbin–Watson test, which evaluates whether consecutive residuals differ more or less than expected.

Equation 9.96: \text{DW}_{\text{obs}}=\frac{\Sigma^n_{i=2}(e_i – e_{i-1})^2}{\Sigma^n_{i=1} e^2_i}

				
					% Outlier Test (Example)
isoutlier(y-y_hat)

% Durban-Watson Test
dwtest(y-y_hat,X_matrix)
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					using StatsBase, HypothesisTests

function isoutlier_iqr(x; k=1.5)
    q1 = quantile(x, 0.25)
    q3 = quantile(x, 0.75)
    iqr = q3 - q1
    lower = q1 - k*iqr
    upper = q3 + k*iqr
    return (x .< lower) .| (x .> upper)
end

function durbin_watson(residuals)
    dw = sum(diff(residuals).^2) / sum(residuals.^2)

    println("Durbin–Watson test")
    println("-------------------")
    println("H0: no first-order autocorrelation")
    println("DW statistic = ", round(dw, digits=4))

    if dw ≈ 2
        println("≈ no autocorrelation")
    elseif dw < 2
        println("positive autocorrelation likely")
    else
        println("negative autocorrelation likely")
    end

    return dw
end

res         = y .- y_hat

# Outlier Test (Example)
outliers    = isoutlier_iqr(res)

# Durban-Watson Test
outliers_dw = durbin_watson(res)
				
			

The \text{DW}_{\text{obs}} test statistic ranges from 0 to 4. A value around 2 indicates no autocorrelation, values below 2 suggest positive autocorrelation, and values above 2 indicating negative autocorrelation. As a rule of thumb, values between about 1.5 and 2.5 are acceptable, while values below 1 or above 3 may warrant closer inspection. Most computational implementations also report a corresponding p-value.

EXERCISE 1: OUTLIERS & AUTOCORRELATION

Conduct outlier tests on the residuals for the straight-line retention model that you created in the previous lessons. Do you find any outliers? Be sure to also plot the outliers yourself and try to set your expectations. Do you actually expect there to be an outlier? Run the test with an \alpha of 0.05.

Now apply the Durban-Watson test to determine whether the same data and model suffer from autocorrelation. Then specify which of the following statements are correct.

Examples of values are shown in Figure 2-A2 through C2. The Durbin–Watson result in Figure 2-A2 (\text{DW}_{\text{obs}} = 0.66, p < 0.05) confirms the strong autocorrelation already visible in panel A1. For the dataset in panel B, no serial correlation is detected, indicating that the model adequately captures the structure in the data. The test is not reliable in the presence of outliers, so the values shown for panel C2 should be disregarded.

1.3. Influence #

Not all datapoints contribute equally to a regression model. Some points “sit” far from the rest or strongly “pull” the fitted line, meaning that removing them would noticeably change the model.

As unusual points can distort the regression or be mistaken for outliers, it is useful to quantify how much each datapoint affects the model (influence) and how unusual its position is (leverage). These diagnostics help us understand whether an apparent outlier is truly problematic or simply a structurally important point in the dataset.

				
					% Cook's Distance (Example)
mdl = fitlm(x,y);
plotDiagnostics(mdl,'cookd');
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					using GLM, StatsModels, DataFrames, Plots

# Get the intercept
df          = DataFrame(x=x, y=y)
mdl         = lm(@formula(y ~ x), df)

# Cooks Distance
cd          = cooksdistance(mdl)

# Plot the data
scatter(cd, xlabel = "Observation", ylabel = "Cook's Distance", title = "Cook's Distance", legend = false)

# Add a reference line
n           = length(cd)
hline!([4/n])
				
			

Cook’s distance measures how much a single datapoint alters the regression when removed. It is a measure of the influence of a datapoint compares the model predictions with and without point r

Equation 9.97: \text{CD}_r^2 = \frac{\sum_{i=1}^{n} (\hat{y}_i – \hat{y}_{i,\lnot r})^2}{(m+1)\, s_e^2}

A large value (e.g. \text{CD}^2_r > 1 for small datasets, say n < 10) indicates that point r is influential and may warrant further inspection. Different thresholds exist, such as \text{CD}^2_r > n/4, but the core idea is simple: a large influence means that the model depends heavily on that datapoint.

Logical NOT Symbol

Equation 9.97 shows \hat{y}_{i,\lnot r}. Although \lnot is normally the logical “NOT” symbol, it is used here simply as a shorthand to mean “without”. We are therefore to interpret \hat{y}_{i,\lnot r} as “the predicted value \hat{y} at point i using the model that was constructed without point r“. This is then compared to the same point but now including point r in \hat{y}_{i}. Note: In many statistics texts, this is written instead as \hat{y}_{i(-r)}.

1.4. Leverage #

Leverage reflects how far a datapoint lies from the other x-values. It is obtained from the diagonal of the hat matrix

Equation 9.98: \textbf{H}=\textbf{X} \left(\textbf{X}^{\text{T}} \textbf{X}\right)^{-1} \textbf{X}^{\text{T}}

A point meeting the criterion h_{r,r}>2(m+1)/n is considered high-leverage and has more potential to affect the fitted line, even when its residual is small.

				
					% Cook's Distance (Example)
mdl = fitlm(x,y);
plotDiagnostics(mdl,'cookd');
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					using GLM, StatsModels, DataFrames, Plots

# Get the intercept
df          = DataFrame(x=x, y=y)
mdl         = lm(@formula(y ~ x), df)

# Cooks Distance
cd          = cooksdistance(mdl)

# Plot the data
scatter(cd, xlabel = "Observation", ylabel = "Cook's Distance", title = "Cook's Distance", legend = false)

# Add a reference line
n           = length(cd)
hline!([4/n])
				
			

We also saw this in Figure 3B of Lesson 11, where the datapoint highlighted with the arrow in the upper-right corner showed a leverage of 1 (at n = 57 and m = 1; straight-line model). In contrast, the other datapoints has a leverage of 0.03, well below the threshold of 0.07 for that dataset.

EXERCISE 2: INFLUENCE & LEVERAGE

Which of the following statements is or are correct about leverage and influence?

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.6.4.4-9.6.4.5

Coefficient of Determination & F-Test of Significance

2. Model Evaluation #

In the first part of this lesson we have addressed means to validate the data used to construct the model. However, it is also useful to evaluate the mathematical equation used for the regression.

2.1. Coefficient of Determination #

A very common and well-known metric for this purpose is the coefficient of determination, or R^2.

Equation 9.100: R^2 = \frac{SS_{\mathrm{reg}}}{SS_{\mathrm{tot}}} = \frac{\sum_{i=1}^{n} (\hat{y}_i – \bar{y})^2}{\sum_{i=1}^{n} (y_i – \bar{y})^2} = 1 – \frac{SS_{\mathrm{res}}}{SS_{\mathrm{tot}}}

The R^2 quantifies the fraction of the total variation explained by the model by comparing the squared residuals of the model (Figure 3A) versus the total sum of squares of the datapoints compared to the mean (Figure 3B).

Figure 3. Graphical expression of the \text{SS}_{\text{res}} and the \text{SS}_{\text{tot}} components of the coefficient of determination (R^2).
Linearization

At this stage it is relevant to point out that many retention models are non-linear in their natural form (e.g. \hat{k}=\exp{(b_0 + b_1 \varphi)}. By taking the natural logarithm, we obtain a linearized model for \ln{\hat{k}}, which allows us to use ordinary least-squares regression. However, a key implication is that the fitted model minimizes SS_{\text{res}}=\Sigma^n_{i=1}(\ln{k_i}-\ln{\hat{k_i}})^2 rather than the residuals in k-space. In other words, the optimization is performed on the transformed variable, not on the original retention factors. For a discussion of why this matters and possible alternatives, see the textbook Section 9.10.2. For now, the concepts treated in this lesson remain valid.

2.2. Adjusted coefficient of determination #

A limitation of R^2 is that it always increases when more parameters are added to the model, even if they do not meaningfully improve the fit. The adjusted coefficient of determination, R^2_{\text{a}}, compensates for this by incorporating the degrees of freedom, effectively penalizing unnecessary parameters and providing a more reliable measure of explained variance.

Equation 9.101: R_a^2 = 1 -\frac{SS_{\mathrm{res}} / \left(n – (m+1)\right)}{SS_{\mathrm{tot}} / (n-1)}

Figure 4. Comparison of polynomial models of increasing complexity fitted to the same dataset. Although R^2 increases as additional terms are added (A to D), this does not necessarily mean the model improves. The adjusted R^2 provides a more balanced assessment by penalizing unnecessary parameters. Panels C and D illustrate how extra polynomial terms may yield only minimal improvement, highlighting the risk of overfitting when model complexity grows without meaningful gain in predictive power.

Figure 4 compares how the coefficient of determination keeps on increasing as more parameters (i.e. degrees of freedom) are added. This is also true for the adjusted coefficient of determination, but much less so.

				
					SSE_regression = sum((y_hat-mean(y)).^2);
SSE_residuals  = sum((y_hat-y).^2);
SSE_total      = sum((y-mean(y)).^2);
R2             = SSE_regression/SSE_total;

MSE_residuals = SSE_residuals / (size(X_matrix,1) - size(X_matrix,2));
MSE_total      = SSE_total/(size(x,1)-1);
R2_a           = 1-MSE_residuals/MSE_total;
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					using Statistics

SSE_regression  = sum((y_hat .- mean(y)).^2)
SSE_residuals   = sum((y_hat .- y).^2)
SSE_total       = sum((y .- mean(y)).^2)

MSE_residuals   = SSE_residuals / (size(X_matrix,1) - size(X_matrix,2));

R2          = SSE_regression / SSE_total

MSE_total   = SSE_total / (length(x) - 1)

R2_a        = 1 - MSE_residuals / MSE_total
				
			

Model validation is essential because it checks whether a fitted model truly generalizes beyond the data used to build it. A model may appear to fit the training data extremely well, yet fail to predict new data accurately. This problem, known as overfitting, occurs when the model becomes too complex and starts capturing noise rather than the underlying trend.

Without proper validation, an overfitted model can give a false sense of accuracy, perform poorly on future measurements, and lead to incorrect analytical decisions. Validation helps ensure that the model is reliable, robust, and suitable for real-world use.

2.3. F-test of significance #

A more sensitive way to evaluate whether a regression model should include an extra term is the F-test of significance, which uses the same concepts as treated in Lesson 7. In our example, the quadratic model fits the retention data better than the straight line because of the added b_2 x^2 term. The F-test checks whether this extra term is statistically meaningful, with H_{text{0}} stating that the added term is not significant and H_{text{1}} that it is.

The statistic is calculated as

Equation 9.102: F_{\text{obs}} =\frac{\text{MS}_{\text{difference}}}{\text{MS}_{\text{res, full}}}=\frac{\left(\text{SS}_{\text{res, reduced}} – \text{SS}_{\text{res, full}} \right)/ \left( q_{\text{full}} – q_{\text{reduced}} \right)}{\text{SS}_{\text{res, full}}/ \left( n – q_{\text{full}} \right)}

Here, q refers to the total number of parameters in the \text{full} (i.e. including the extra parameter) or \text{reduced} (i.e. without the extra parameter) model, respectively. Note that q=m+1.

Figure 5. Comparison of first-order (A) and second-order (B) polynomial retention models fitted to the same \ln{k} versus \varphidata. The straight-line model in panel A captures the general downward trend but shows systematic deviations, as indicated by the confidence bands widening toward higher \varphi. The quadratic model in panel B more accurately follows the curvature in the data and yields narrower, more symmetric confidence bands, illustrating the improvement gained by adding a second-order term.
				
					% Two Example Models
X_mat_red  = ones(length(x),1);       % REDUCED: y=b0
X_mat_full = [ones(length(x),1) x];   % FULL:    y=b0+b1x

% Regression
b_red      = pinv(X_mat_red'*X_mat_red)*X_mat_red'*y;
b_full     = pinv(X_mat_full'*X_mat_full)*X_mat_full'*y;
y_hat_red  = X_mat_red * b_red;
y_hat_full = X_mat_full * b_full;

% F-Test Of Significance
DoF_red    = size(X_mat_red,1) - size(X_mat_red,2); 
DoF_full   = size(X_mat_full,1) - size(X_mat_full,2); 
MS_diff    = sum((y_hat_full - y_hat_red).^2)./ ...
                (DoF_red - DoF_full); 
MS_full    = sum((y - y_hat_full).^2)/DoF_full;
F_obs      = MS_diff / MS_full;
p          = 1 - cdf('F',F,DoF_red-DoF_full,DoF_full);
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					using LinearAlgebra
using Statistics
using Distributions

# Two Example Models
X_mat_red   = ones(length(x), 1)
X_mat_full  = hcat(ones(length(x)), x)

# Regression (same formula you used)
b_red       = pinv(X_mat_red' * X_mat_red) * X_mat_red' * y
b_full      = pinv(X_mat_full' * X_mat_full) * X_mat_full' * y

y_hat_red   = X_mat_red * b_red
y_hat_full  = X_mat_full * b_full

# Degrees of freedom
DoF_red     = size(X_mat_red,1)  - size(X_mat_red,2)
DoF_full    = size(X_mat_full,1) - size(X_mat_full,2)

# Mean squares
MS_diff     = sum((y_hat_full .- y_hat_red).^2) / (DoF_red - DoF_full)

MS_full     = sum((y .- y_hat_full).^2) / DoF_full

F_obs       = MS_diff / MS_full

# F-test p-value
p           = 1 - cdf(FDist(DoF_red - DoF_full, DoF_full), F_obs)
				
			

\text{MS}_{\text{difference}} represents the mean square difference between the reduced and full models i.e. how much the residual sum of squares changes when the models are compared. 

Equation 9.103: \text{MS}_{\text{difference}} =\frac{\sum_{i=1}^{n} (y_i – \hat{y}_{i,\text{reduced}})^2-\sum_{i=1}^{n} (y_i -\hat{y}_{i,\text{full}})^2}{q_{\text{full}} – q_{\text{reduced}}}

Equation 9.103 quantifies the improvement (or lack thereof) gained by moving from the simpler to the more complex model. See Lesson 8 to read again about the mean squares.

EXERCISE 3: MODEL EVALUATION

Use the F-test of significance and the (adjusted) coefficient of determination to compare a constant (\hat{y}=b_0), and a straight line (\hat{y}=b_0+b_1 x) as models. Use your results to decide which of these answers are correct.

Repeat the calculations but now to compare the straight-line model (\hat{y}=b_0+b_1 x) with a second-order (quadratic) polynomial (\hat{y}=b_0+b_1 x+b_2 x^2). Which of the following statements are correct?

Finally, repeat the calculations once more to compare a second-order (quadratic; \hat{y}=b_0+b_1 x+b_2 x^2) and third-order (cubic) polynomial (\hat{y}=b_0+b_1 x+b_2 x^2 + b_3 x^3). Which of the statements is correct?

3. Figures of Merit #

Once a calibration model has been established, we can extract several key performance characteristics, known as analytical figures of merit, that describe the quality of an analytical method. 

3.1. Sensitivity #

A familiar one is sensitivity, defined as the slope b_1 of a straight-line calibration curve. Sensitivity quantifies how strongly the instrument response changes with concentration, and is unrelated to the statistical sensitivity (1-\beta; see Lesson 6). 

3.2. Detection Limits #

In regulated fields, however, the most critical figures of merit are the decision limit, detection limit, and quantification limit, all of which describe how low an analyte concentration can be reliably detected or quantified.

To define these detection-related limits, we start with the blank, a measurement identical to the sample matrix but without analyte. The blank shows a baseline signal \mu_{\text{blank}} and a noise level \sigma_{\text{blank}}, which after enough repeats follows a normal distribution.

The decision limit y_{\text{C}} (also called L_{\text{crit}} or CC_\alpha) is the lowest signal at which we conclude that analyte is present, with a false-positive rate \alpha. It is defined as

Equation 9.105: y_\text{C} = \mu_\text{blank} + k_{\text{C}} \sigma_\text{blank}

typically using k_{\text{C}}, corresponding to \alpha ≈ 0.0013. Because \mu_{\text{b}} equals the intercept b_0 of the calibration curve, the corresponding concentration can be expressed as x_{\text{C}}=(k_{\text{C}} \cdot \sigma_{\text{bl}})/b_1[/latex] or by a regression-based expression that incorporates the variability in the predicted signal. In terms of concentration, this limit can be written as

Equation 9.106: x_{\text{C}} \approx \frac{t_{\alpha,n-2}\, s_e}{b_1} \sqrt{ \frac{1}{g} + \frac{1}{n} +\frac{\bar{x}^{2}}{\sum_{i=1}^n (x_i – \bar{x})^2}}

Figure 6. Graphical explanation of the decision limit (y_\text{C}), detection limit (y_\text{D}), and their corresponding concentration values x_\text{C} and x_\text{D}. Panel D illustrates the blank signal with baseline \mu_\text{bl} and noise \sigma_\text{bl}. Panels B and C show how the distributions shift when analyte is present: at the decision limit, the false-positive rate \alpha is controlled but the false-negative rate \beta remains high, while the detection limit ensures both \alpha and \beta are small. Panel A maps these limits from signal space onto the calibration curve, demonstrating how sensitivity (slope b_1) governs the achievable detection and quantification limits.

3.3. Detection limit #

While the decision limit controls false positives, it still yields a 50% chance of a false negative (\beta = 0.5; see Lesson 6). To address this, the detection limit y_{\text{D}} is defined as

Equation 9.107: y_{\text{D}} = \mu_{\text{bl}} + (k_{\text{C}} + k_{\text{D}})\sigma_{\text{bl}}

with k_{\text{D}} = 3, giving a total multiplier of 6\sigma_\text{bl}. At this level, both \alpha and \beta are ≈ 0.0013, ensuring a statistical power of 0.9987. The corresponding concentration-based detection limit is obtained analogously to the decision limit, but evaluated at the higher signal level. This defines the minimum concentration that can be detected with high confidence though not necessarily quantified precisely. The limit can also be defined in terms of concentration 

Equation 9.108: x_{\text{D}} \approx \frac{t_{\beta,n-2}\, s_e}{b_1}\sqrt{\frac{1}{g} + \frac{1}{n} + \frac{(2x_{\text{C}} – \bar{x})^{2}}{\sum_{i=1}^n (x_i – \bar{x})^2}}

3.4. Quantification limit #

The quantification limit represents the lowest concentration that can be measured with acceptable precision. It is defined as

Equation 9.109: y_{\text{Q}} = \mu_\text{bl} + 10\sigma_\text{bl}

leading to

Equation 9.110: x_{\text{Q}} = 10 \frac{\sigma_\text{bl}}{b_1}

The multiplier of 10 is chosen because the relative precision at the detection limit is typically around 16%, which is considered too poor for quantitative reporting. The quantification limit instead corresponds to a relative precision of roughly 10%, making it the practical lower boundary for reliable quantification.

Concluding remarks #

In this lesson we explored the essential tools used to judge the quality of a regression model: examination of residuals, detection of patterns such as autocorrelation, assessment of leverage and influence, evaluation of model complexity through R^2 and R^2_{\text{a}}, and the use of statistical tests such as the F-test of significance. Together, these techniques form the core of model validation, the process of ensuring that a chosen mathematical description truly reflects the underlying analytical relationship.

While our focus here remained on ordinary least-squares (OLS) regression using straight-line and polynomial models, it is important to recognize that OLS is only one member of a much larger family of regression approaches. Many analytical situations require models that go beyond the assumptions of constant variance, linearity, or equal weighting of data points.

In future lessons, we will extend these ideas to more advanced regression strategies, including:

  • Weighted regression, used when measurement errors vary across the calibration range.

  • Non-linear regression, needed when relationships cannot be linearized without distorting the error structure.

  • Iterative optimization-based methods, which refine model parameters when no closed-form solutions exist.

  • Multivariate regression techniques, essential when multiple predictors (e.g., spectral intensities, chromatographic features) jointly determine the response.

These approaches build upon the principles introduced here, residual analysis, influence diagnostics, and statistical hypothesis testing, while offering greater flexibility and robustness for complex analytical problems.

Ultimately, mastering model validation allows you not only to fit models, but to trust them, ensuring that your calibration and quantitative measurements rest upon solid statistical foundations.

Extensive Exercise #

EXTENSIVE EXERCISE: VAN DEEMTER

This is an exam-grade question in the MSc. Chemometrics & Statistics course at the University of Amsterdam. It is worth of 20 out of 100 pts, and should be completed within 35 minutes.

Chromatographic band broadening can be modelled as a function of the flow rate according to the reduced van Deemter equation:

Equation 1.81: h=a+\frac{b}{\nu}+c\cdot\nu

where a, b and c are constants, \nu is the reduced flow velocity and h is the reduced plate height (both \nu and h are dimensionless). Experiments were carried out in a chromatographic system. Plate heights were measured at different reduced flow velocities. The following data was observed:

\nu h
0.5
3.60
1.0
2.57
1.5
2.48
2.5
2.16
3.0
2.18
5.0
2.37
7.5
2.46
10.0
3.24
				
					x=[0.5, 1.0, 1.5, 2.5, 3.0, 5.0, 7.5, 10.0];
y=[3.60, 2.57, 2.48, 2.16, 2.18, 2.37, 2.46, 3.24];
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			

Fit the reduced van Deemter equation to the data. Consider that all the error is in the measurement of the plate height. Calculate the fitted a, b, and c parameters. Round to two decimals. (5pts)

Calculate the 95% confidence limits of a, b and c. (4pts)

According to the theory, the reduced linear velocity has an optimal reduced linear velocity at \nu_{\text{min}}=\sqrt{b/c}. Calculate the optimal reduced linear velocity. Round to two decimals. (1pt)

What is the expected reduced plate height at this velocity? (2pts) Round to two decimals.

What are the 95% confidence limits of this reduced plate height? (3pts)

Your colleague is afraid that you may be dealing with autocorrelation. Calculate the Durban-Watson statistic. Is there autocorrelation? (3pts)

Your colleague is not done criticizing your work yet and comments on the spread in the data, suggesting you are likely to have outliers. Conduct an outlier test and plot the residuals. Specify which of the following statements is/are true. (2pts)

Is this article useful?