INFORMATION REPOSITORY

11. Least-Squares Regression

Updated on December 4, 2025

In this lecture we introduce students to the foundations of linear regression, significance testing, and model validation in chemometrics. Building on earlier work with ANOVA and calibration, we explore how experimental factors influence analytical responses, how covariance and correlation structure data, and how regression models can be fitted and critically evaluated. Through the synthetic-fiber case study, students learn to construct models, assess their quality using F-tests, residual analysis, confidence intervals, and coefficients of determination, and understand the difference between expected and experimental responses. The session emphasizes correct model interpretation and the statistical thinking required for robust analytical method development.

Learning Goals #

  • Explain the concept of least-squares regression and why it is used for calibration and modelling.
  • Construct a linear regression model <span class="katex"><span class="katex-mathml">\hat{y}=b_0+b_1 \cdot x and describe the role of residuals.
  • Calculate the regression parameters (slope and intercept) manually.
  • Describe what a residual is and how it reflects the difference between measured and modelled values.
  • Fit a log-linear retention model to experimental isocratic retention data.
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.6.1

INTRODUCTION TO REGRESSION ANALYSIS

1. Modeling & Calibration #

In earlier parts of this course, we focused on describing sample data and drawing conclusions about the underlying population. A further key use of statistics is examining how two or more variables relate to each other. Regression analysis allows us to capture such relationships in a mathematical model.

When we link a measurable response, such as peak area, to a property of interest like analyte concentration, we perform calibration, a fundamental step in quantitative analysis (Figure 1).

Figure 1. Example of a calibration curve relating the property of interest (concentration, x) to the measured response (signal, y). The regression model (\hat{y}; dashed line) is fitted to the data. In classical calibration (A), the measurement error is assumed to lie in the signal, whereas in inverse calibration (B), the error is attributed to the concentration. Data from Section 9.6.2.

In classical calibration, standards of known concentration are measured and the instrument signal is treated as the only source of error; a model \hat{y}=f(x) is fitted and later inverted to predict unknown concentrations. Inverse calibration reverses this perspective by modelling the signal as the dependent variable, which is especially useful when multiple signals contribute to the prediction. In both cases, the analyst must ensure that the measured signal is specific to the analyte and not distorted by interfering species.

Classical calibration assumes errors lie in the measured signal, while inverse calibration assumes they lie in the concentration values. In multivariate spectroscopy and chemometrics (e.g., PLS, PCR), inverse calibration is preferred, because it avoids explicit inversion of large matrices and produces more stable, robust prediction models.

CASE: FITTING A RETENTION MODEL

In this lesson we introduce least-squares regression by fitting a retention model. We will return to calibration in the next lesson.

Calibration curves provide the foundation for the regression concepts used in quantitative analysis and retention modelling, but the regression technique used to determine the relationship is also of interest for other applications (e.g. retention modeling).

EXERCISE 1: ENTERING THE DATA

In this lesson we will learn how to conduct least-squares regression through the example of fitting a retention model. We start by entering the data that has been measured for an organic acid using a series of isocratic measurements in HILIC mode.

\ln{k} \phi
0.05
5.97
0.10
4.62
0.15
3.31
0.20
2.56
0.25
1.87
0.30
1.01
0.35
0.94
0.40
0.79

Enter the data into your program of choice. For convenience, several data formats are provided below, already converted to x and y for universal use.

We will then introduce new concepts and apply them step-by-step to this dataset as we progress.

Example data
				
					phi=[0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40];
lnk=[5.97, 4.62, 3.31, 2.56, 1.87, 1.01, 0.94, 0.79];

x = phi'; 
y = lnk';
				
			

We take this moment to learn about transposing vectors. The apostrophe behind the phi and lnk variables will transpose the vectors (i.e. turn the row vectors into column vectors) which in this case allow for easier calculations later on the lesson.

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			
Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTION 9.6.2

COVARIANCE AND CORRELATION

2. Studying data #

Calibration, and modelling in general, relies on a predictable relationship between the variables x and y. However, there is no point in modeling data that bears no relation. Before building a model, it is therefore helpful to examine how the variables co-vary. Let’s look at some tools.

2.1. Covariance #

Covariance quantifies how deviations in x and y co-vary and thus indicates the degree of their linear association.

Equation 9.65: \text{cov}(x,y)=\frac{\Sigma^{n}_{i=1} (x_i-\bar{x})(y_i-\bar{y})}{n-1}

Figure 2A-C illustrates this idea: as the relationship between x and y becomes clearer and more linear, the covariance correspondingly increases.

Figure 2. The covariance and correlation coefficient for different distributions of data. The standard deviations in both x and y, i.e. s_x and s_y, are roughly 0.275.

				
					covariance = (sum((x-mean(x)).*(y-mean(y))))/(length(x)-1);
correlation = corr(x, y);
spearmanCorr = corr(x, y, 'Type', 'Spearman');
				
			

An example file can be downloaded here (CS_08_OneWayANOVA, .XLSX). See below for further instructions.

				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			

2.2. Correlation coefficient #

To make the measure comparable across datasets, the covariance is divided by the standard deviations of x and y, resulting in the dimensionless correlation coefficient (r).

Equation 9.66: r=\frac{\text{cov}(x,y)}{s_x \cdot s_y}

Equation 9.66 (also known as the Pearson correlation coefficient) captures the linear association between x and y relative to their variation, making it dimensionless (no units). Applied to the datasets in Figures 2, it increases in line with their covariance trends, including the positive and negative cases in Figures 2B and Figure 3A.

Figure 3. The covariance and correlation coefficient for different distributions of data. For panel A, the standard deviation in both s_x and s_y is roughly 0.275.

Figure 3A shows that covariance also can be negative, wheras Figure 3B illustrates that the covariance depends on the units of the variables, meaning its magnitude can be misleading.

Because it reflects only linear behaviour, it fails to characterize the non-linear pattern in Figure 3C. It is also highly sensitive to leverage points: in Figure 3B, adding a single extreme datapoint to Figure 2A increases the correlation from 0.037 to nearly 1. Overall, correlation must be interpreted cautiously, especially when outliers are present.

Population-based Correlation Coefficient

The Pearson correlation coefficient r is calculated from a finite sample and may differ from the true population correlation \rho (i.e. if we had an infinite number of datapoints).

To assess whether the observed correlation reflects a genuine linear association rather than noise, a dedicated t-test can be applied (with t following a t(n–2) distribution). This test evaluates the hypothesis H_0: \rho=0 against the alternative that a real correlation exists. For details, including the test statistic and one-sided variants, see Section 9.2.3 and Equation 9.67.

2.3. Spearman correlation #

Spearman’s rank correlation is a non-parametric alternative to Pearson’s correlation that first converts x and y into their respective ranks. It then computes the Pearson correlation of these ranks, allowing it to capture monotonic relationships even when they are non-linear. Because it is based on rank positions rather than raw values, it increases when datapoints occupy similar relative positions in both variables. See also Section 9.6.2.

EXERCISE 2: COVARIANCE & CORRELATION

Calculate the covariance, Pearson correlation and Spearman correlation for the retention data and then answer the following questions.

You computed the covariance for the dataset. What does your value indicate?

Based on your results, what is the best interpretation for the Pearson and Spearman correlation?

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers
READ SECTIONS 9.6.3 - 9.6.3.1

LEAST-SQUARES REGRESSION

3. Least-Squares Regression #

3.1. Formulating the model #

The strong negative correlation observed in Exercise 2 suggests that the relationship between \phi (i.e. x), and \ln{k} (i.e. y) can reasonably be approximated by a straight line. We will now use least-squares regression to formalise this idea. 

Dependent and independent variables

In chromatography, errors in the mobile-phase composition (i.e. \varphi, our x; a property we set in the instrument) are generally small compared with measurement noise. The same is true during calibration, where we use standards with certified concentrations (also x in Figure 1).

We therefore treat x as known and assume all uncertainty lies in y. Thus, x is the independent variable and y the dependent variable.

A model describes this relationship through

Equation 9.69: y_i=\hat{y}_i + \epsilon_i=f(x_i)+e_i

meaning each measured point (y_i) equals the model prediction (\hat{y}_i) plus an error (e_i) term. When the correlation coefficient indicates a linear trend, we use the linear model

Equation 9.70: y=\beta_0 + \beta_1 \cdot x  + \varepsilon

Here \beta_0 and \beta_1 are the true (population) parameters that we would get if we have an infinite number of datapoints. We can now state that our model depends on x and \textbf{β} and can be summarized as f(x,\textbf{β}) with \textbf{β} = \begin{bmatrix}\beta_0 \\ \beta_1 \end{bmatrix}.

In reality we only have a finite sample (8 datapoints for the retention data of the exercise), so these true \textbf{β} parameters are unknown and must be estimated. Their sample-based estimates are denoted b_0 and b_1, yielding our model

Sample and Population-based Variables

As we saw in Lesson 2 with the mean and standard deviation, we continue to make distinctions between the sample-based (Latin alphabet) estimates and the population-based (Greek alphabet) values. We now also translate this to the model parameters \textbf{β} and their estimates \textbf{b}, and also the error in the model \textbf{ε}, as well as the estimation of the error \textbf{e}.

Equation 9.71: \hat{y}=b_0+b_1\cdot x

Here, b_1 and b_0 are the model coefficients. Specifically for the straight line, they refer to the slope (b_1) and intercept (b_0) as derived in Section 9.6.3 in the book.

3.2. Least Squares #

We now have an equation for a straight line (Equation 9.71) but how to now perfectly fit it through the data? This is where least-squares regression comes in. We calculate b_0 and b_1 in such a way that the total squared error of the sum of all y_i (i.e. \Sigma e^2_i) is minimized. This process is shown in Figure 4. 

Similar to Figure 1 we see our experimental data from Table 1 plotted (blue points; \ln{k}; i.e. y) as a function of \varphi (i.e. x).

Figure 4. Schematic depiction of the least-squares concept for a straight line model (A) and a constant as model (B). Several symbols and concepts are explained throughout the figure. See also the text and Section 6.3.1 in the book.

We also see two different models fitted through these points. A straight line (left; \hat{y}=b_0 + b_1 \cdot x) and a constant (right; \hat{y}=b_0), yielding the light-pink dashed line for both cases. Also depicted are the meaning of the coefficients b_0 and, if applicable b_1 as intercept and slope, respectively.

We have already learned earlier this lesson that the difference between the experimental datapoints (y_i) and the modelled datapoints (\hat{y}_i) represents the residual, or error of the model, for each datapoint i.

To judge whether a model fits the datapoints the best, it is therefore of interest to assess the total error. However, the large errors in Figure 4B are both positive and negative. If we would sum them all up, we would obtain a small number, similar to that of Figure A, even though the model of Figure 4A fits the data much better. To solve this, the errors are squared, yielding – literally – squares (i.e. e_i^2), as illustrated in the figure.

The total error can then by calculated as the sum-of-squares error (SSE). The objective of the fitting is therefore to reduce the squares, explaining the term “least squares” regression.

Vector & Matrix Notation

The concept of a vector that was introduced in Lesson 2 (see also Section 9.1.2) is important in the present lesson. A vector is a one-dimensional series of data. In equations they are typicall depicted through a bold, non-italic symbol (e.g. y as value is \textbf{y} as vector. Vectors can be a row vector (e.g. \textbf{x}=\left[1, 2, 3\right]), or a column vector (e.g. \textbf{b}=\begin{bmatrix} b_0 \\ b_1 \\ b_2 \end{bmatrix}).

Similarly, special notations exist for a matrix, which is not only formatted bold, non-italic, but also denoted by a capital case character, such as \textbf{X}.

3.3. Fitting a straight line #

How do this fitting ourselves? If the model is linear, we can write it as

Equation 9.76: \textbf{y}=\left[ \frac{\delta \hat{y}}{\delta b_0} \frac{\delta \hat{y}}{\delta b_1} \cdots \frac{\delta \hat{y}}{\delta b_m} \right] \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_m \end{bmatrix} + \textbf{e}

which becomes for our straight line case (derive it yourself!)

y_i=\begin{bmatrix} 1 & x_i \end{bmatrix} \begin{bmatrix} b_0 \\ b_1  \end{bmatrix} + e_i
Matrix Calculation

In this course, we assume basic familiarity with vectors and matrices. The key idea behind the expression

\begin{bmatrix}1 & x_i \end{bmatrix} \begin{bmatrix}b_0 \\ b_1 \end{bmatrix}

is that a row vector multiplied by a column vector gives a single number obtained by a dot product. First multiply the first entries (1 \cdot b_0), then multiply the second entries (x_i \cdot b_1) and finally add them together. This yields the original model:

y_i=b_0+b_1 x_i

We can also write Equation 9.76 for all x_i by rewriting it as a matrix \textbf{X} (note the captical case X to denote that it is a matrix), so that \hat{y} is defined as f(\textbf{X},\textbf{b})

For our straight line model, this would result in

Equation 9.78: \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} b_0 \\ b_1 \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{bmatrix}

In least-squares regression, we choose the estimates \textbf{b} so that the sum of squared errors, \Sigma e^2_i, is as small as possible. This minimisation is accomplished by

Equation 9.79: \hat{\textbf{b}}=\left( \textbf{X}^{\text{T}} \cdot \textbf{X} \right)^{-1} \cdot \textbf{X}^{\text{T}} \cdot \textbf{y}

Here, the superscript \text{T} denotes the transposed matrix. We already have \textbf{y} as this is our original \textbf{y}-data (the \ln{k} values).

EXERCISE 3: FIT THE RETENTION MODEL

We will now fit and plot the retention model. To compute the regression parameters using Equation 9.79, we first must build this \textbf{X}-matrix that represents our model. For our straight line model

\hat{y}_i = b_0 + b_1 \cdot x_i

each datapoint contributes two terms: a 1 (for the intercept b_0) and the corresponding x_i value (for the slope b_1. This means every row of \textbf{X} looks like:

\begin{bmatrix}1 & x_i \end{bmatrix}

For all n datapoints, the full \textbf{X}-matrix is:

\textbf{X}=\begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}

We can construct this directly by stacking a column of ones next to the vector of predictor values.

				
					X_matrix = [ones(length(x),1) x];
				
			
				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			

Here, \textbf{x} is our vector of x_i-values (i.e. the \varphi-values), and ones(length(x),1) creates the intercept column.

We can now compute \hat{\textbf{b}} using Equaton 9.79:

				
					b=pinv(X_matrix'*X_matrix)*X_matrix'*y;
				
			
				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			

We obtain coefficients b_0 = 5.9686, and b_1 = -14.8214. This means that our \ln{k} and S values of the straight-line log-linear retention model are also identical to this, respectively. We can now create the modeled datapoints for all values of x. 

				
					y_hat=X_matrix*b;
				
			
				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			

This creates a vector of predicted \hat{y}-values. Let’s now plot our fitted model over the original data! For the specific functions in MATLAB see Tutorial 3.

				
					scatter(x,y)
hold on
plot(x,y_hat)
xlabel('\phi');
ylabel('ln k');
				
			
				
					#=
Julia does not have a ANOVA funcion,therefore the following
    function was created. 
Copy this function, after running this function you can use it
    as any other regular function.
=#


##############################################
using LinearAlgebra, Distributions

"""
    ANOVA 1 analysis of variances.
    If `group` is not specified, setup groups based on columns of `x`. Otherwise, setup groups based on `group`.
    The input variables can contain missing values for `x`, which will be removed before ANOVA analysis.\n
    Parameters
    ----------
    - x : AbstractMatrix{T<:Real}
        A matrix with columns corresponding to the individual groups.\n
    - group : AbstractVector{T<:Real}, optional
        An equally sized vector as `x`.\n
    Returns
    -------
    Dict{Any,Any}
        A dictionary containing the following keys:\n
        - `"DF"` : A tuple of vectors corresponding to degrees of freedom for between groups, residual, and total.
        - `"SS"` : A tuple of vectors corresponding to the sum of squares for between groups, residual, and total.
        - `"MS"` : A tuple of vectors corresponding to the mean square for between groups and residual.
        - `"F"` : The F-statistic for the ANOVA analysis.
        - `"p-value"` : The p-value for the ANOVA analysis.
"""
function anova1(x,group = [])
    # ANOVA 1 analysis of variances
    # anova1(x), where x is a matrix with columns correpsonding to the induvidual groups
    # anova1(x,group), where x is an equally sized vector as group
    # the input variables can contains missing values for x, which will be removed before anova analysis

    if isempty(group)
        # setup groups based on x columns
        group = ones(size(x)) .* collect(1:size(x,2))'
        group = reshape(group,:,1)
        x = reshape(x,:,1)
    #setup groups based on x columns
    elseif length(x) .!= length(group)
        println("x and groups contain a different amount of elements")
        return
    else
        if size(group, 1) == 1
            group = group'
        end
        if size(x, 1) == 1
            x = x'
        end
    end

    #remove NaN values
    if any(isnan.(x))
        group = group[isnan.(x).==0]
        x = x[isnan.(x).==0]
    end

    x_ori = x
    x_mc = x .- mean(x)
    gr_n = unique(group)
    gr_m = ones(size(gr_n))
    gr_c = ones(size(gr_n))
    for i = 1:length(gr_n)
        gr_m[i] = mean(x_mc[group.== gr_n[i]])
        gr_c[i] = sum(group.==gr_n[i])
    end

    x_mean_mc = mean(x_mc)
    x_cent = gr_m .- x_mean_mc
    #degees of freedom
    df1 = length(gr_c) - 1
    df2 = length(x) - df1 - 1

    RSS = dot(gr_c, x_cent.^2)

    TSS = (x_mc .- x_mean_mc)'*(x_mc .- x_mean_mc)

    SSE = TSS[1] - RSS[1]
    if df2 > 0
        mse = SSE/df2
    else
        mse = NaN
    end

    if SSE !=0
        F = (RSS/df1) / mse
        p = 1-cdf(FDist(df1,df2),F)
    elseif RSS==0
        F = NaN;
        p = NaN;
    else
        F = Inf;
        p = 0;
    end

    #print results
    sum_df1 = df1+df2
    MS1 = RSS/df1
    println("")
    println("anova1 results")
    println("----------------------------------------------------------")
    println("Source\t\tDF\tSS\t\t\tMS\t\t\tF\t\t\tp")
    println("Between\t\t$df1\t$RSS\t$MS1\t$F\t$p     ")
    println("Residual\t$df2\t$SSE\t$mse                     ")
    println("Total\t\t$sum_df1\t$TSS                               ")

    # stats = DataFrame(Source = ["Between", "Residual", "Total"], DF = [df1, df2, sum_df1],
    #                   SS = [RSS, SSE, TSS], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1], DF = [df1, df2, sum_df1])

    stats = Dict("DF" => (["Between","Residual", "Total"],[df1, df2, sum_df1]),
                "SS" => (["RSS", "SSE", "TSS"],[RSS, SSE, TSS]),
                "MS" => (["Between","Residual"],[MS1, mse]),
                "F" => F, "p-value" => p)

    return stats
end



##############################################
# Example Data
x = [3.18 3.47 3.34;
     3.18 3.44 3.06;
     2.96 3.41 3.02;
     3.13 3.58 3.04;
     2.96 3.63 2.83;
     3.01 3.70 3.06]

# Perform ANOVA
anova1(x)
				
			

Wow! We now should have a figure that closely resembles Figure 4A. Nice!

Concluding remarks #

We have learned how to fit a simple straight-line model through experimental data. This will come handy during calibration.

In the next lessons we will learn how to evaluate our model. For example, the data that we use clearly affects the resulting model. This also implies that there is variation in almost every aspect of the modelling: variation in the model coefficients, variation in the prediction, and variation in the experimental data itself. This in turn affects the calibration, as we will learn in the next lesson. 

Another aspect to consider is that we could also have fitted a different type of model. How do we know what to use in what scenario? To address this, we will learn about model validation later in the course.

A final concluding remark is that everything we have seen in this lesson applies to linear models, which must meet the condition

Equation 9.75: \frac{\delta \hat{y}}{\delta b_j} \neq f(b_0, b_1, \cdots, b_m)

This type of linearity therefore does not refer to whether it is a straight line, but to the model being linear in terms of the coefficients. Examples of models are shown in the book in Section 9.6.3.1. In the event that a model is not linear in this sense, then it means that a unique solution cannot readily be calculated and iterative techniques are required.

Is this article useful?