13. MODEL OPTIMISATION AND VALIDATION
13.1. Training, optimisation and validation
The determination of the optimal complexity of the model (the number of PCs that should be included in the model) requires the estimation of the prediction error that can be reached. Ideally, a distinction should be made between training, optimisation and validation. Training is the step in which the regression coefficients are determined for a given model. In PCR, this means that the b-coefficients are determined for a model that includes a given set of PCs. Optimisation consists of comparing different models and deciding which one gives best prediction. In PCR, the usual procedure is to determine the predictive power of models with 1, 2, 3, … PCs and to retain the best one. Validation is the step in which the prediction with the chosen model is tested independently. In practice, as we will describe later, because of practical constraints in the number of samples and/or time, less than three steps are often included. In particular, analysts rarely make a distinction between optimisation and validation and the term validation is then sometimes used for what is essentially an optimisation. While this is acceptable to some extent, in no case should the three steps be reduced to one. In other words, it is not acceptable to draw conclusions about optimal models and/or quality of prediction using only a training step. The same data should never be used for training, optimising and validating the model. If we do, it is possible and even probable that we will overfit the model and prediction error obtained in this way may be over-optimistic. Overfitting is the result of using a too complex model. Consider a univariate situation in which three samples are measured. The y = f(x) model really is linear (first order), but the experimenter decides to use a quadratic model instead. The training step will yield a perfect result: all points are exactly on the line. If, however, new samples are predicted, then the performance of the quadratic model will be worse than the performance of the linear one.
13.2. Measures of predictive ability
Several statistics are used for measuring the predictive ability of a model. The prediction error sum of squares, PRESS, is computed as:
where yi is the actual value of y for object i and the y-value for object i predicted with the model under evaluation, ei is the residual for object i (the difference between the predicted and the actual y-value) and n is the number of objects for which is obtained by prediction.
The mean squared error of prediction (MSEP) is defined as the mean value of PRESS:
Its square root is called root mean squared error of prediction, RMSEP:
All these quantities give the same information. In the chemometrics literature it seems that RMSEP values are preferred, partly because they are given in the same units as the y-variable.
The RMSEP is determined for models with increasing complexity. PCs are included according to the Top-Down or Best Subset Selection procedure (see section 12.1). Usually the result is presented as a plot showing RMSEP as a function of the number of components and is called the RMSEP curve. This curve often shows an intermediate minimum and the number of PCs for which this occurs is then considered to be the optimal complexity of the model. A problem which is sometimes encountered is that the global minimum is reached for a model with a very high complexity. A more parsimonious model is often more robust (the parsimonity principle). Therefore, it has been proposed to use the first local minimum or a deflection point is used instead of the global minimum. If there is only a small difference between the RMSEP of the minimum and a model with less complexity, the latter is often chosen. The decision on whether the difference is considered to be small is often based on the experience of the analyst. We can also use statistical tests that have been developed to decide whether a more parsimonious model can be considered statistically equivalent. In that case the more parsimonious model should be preferred. An F-test [104, 105] or a randomisation t-test  have been proposed for this purpose. The latter requires less statistical assumptions about data and model properties, and is to be preferred. However in practice it does not always seem to yield reliable results.
The model selected in the optimisation step is applied to an independent set of samples and the y-values (i.e. the results obtained with the reference method) and -values (the results obtained with multivariate calibration) are compared. An example is shown in figure 13. The interpretation is usually done visually: does the line with slope 1 and intercept 0 represent the points in the graph sufficiently well? It is necessary to check whether this is true over the whole range of concentrations (non-linearity) and for all meaningful groups of samples, e.g. for different clusters. If a situation is obtained when most samples of a cluster are found at one side of the line, a more complex modelling method (e.g. locally weighted regression [28, 53]) or a model for each separate cluster of samples may yield better results.
Sometimes a least squares regression line between y and is obtained and a test is carried out to verify that the joint confidence interval contains slope = 1 and intercept = 0 . Similarly a paired t-test between y and values can be carried out. This does not obviate, however, the need for checking non-linearity or looking at individual clusters.
An important question is what RMSEP to expect? If the final model is correct, i.e. there is no bias, then the predictions will often be more precise than those obtained with the reference method [108, 7, 109], due to the averaging effect of the regression. However, this cannot be proved from measurements on validation samples, the reference values of which were obtained with the reference method. The RMSEP value is limited by the precision (and accuracy) of the reference method. For that reason, RMSEP can be applied at the optimisation stage as a kind of target value. An alternative way of deciding on model complexity therefore is to select the lowest complexity which leads to an RMSEP value comparable to the precision of the reference method.
13.5. External validation
In principle, the same data should not be used for developing, optimising and validating the model. If we do this, it is possible and even probable that we will overfit the model and prediction errors obtained in this way may be over-optimistic. Terminology in this field is not standardised. We suggest that the samples used in the training step should be called the training set, those that are used in optimisation the evaluation set and those for the validation the validation set. Some multivariate calibration methods require three data sets. This is the case when neural nets are applied (the evaluation set is then usually called the monitoring set). In PCR and related methods, often only two data sets are used (external validation) or, even only one (internal validation). In the latter case, the existence of a second data set is simulated (see further section 13.6). We suggest that the sum of all sets should be called the calibration set. Thus the calibration set can consist of the sum of training, evaluation and validation sets, or it can be split into a training and a test set, or it can serve as the single set applied in internal validation. Applied with care, external and internal validation methods will warn against overfitting.
External validation uses a completely different group of samples for prediction (sometimes called the test set) from the one used for building the model (the training set). Care should be taken that both sample sets are obtained in such a way that they are representative for the data being investigated. This can be investigated using the measures described for representativity in section 10. One should be aware that with an external test set the prediction error obtained may depend to a large extent on how exactly the objects are situated in space in relationship to each other.
It is important to repeat that, in the presence of measurement replicates, all of them must be kept together either in the test set or in the training set when data splitting is performed. Otherwise, there is no perturbation, nor independence, of the statistical sample.
The preceding paragraphs apply when the model is developed from samples taken from a process or a natural population. If a model was created with artificial samples with y-values outside the expected range of y-values to be determined, for the reasons explained in section 10, then the test set should contain only samples with y-values in the expected range.
13.6. Internal validation
One can also apply what is called internal validation. Internal validation uses the same data for developing the model and validating it, but in such a way that external validation is simulated. A comparison of internal validation procedures usually employed in spectrometry is given in . Four different methodologies were employed:
a. Random splitting of the calibration set into a training and a test set. The splitting can then have a large influence on the obtained RMSEP value.
b. Cross-validation (CV), where the data are randomly divided into d so-called cancellation groups. A large number of cancellation groups corresponds to validation with a small perturbation of the statistical sample, whereas a small number of cancellation groups corresponds to a heavy perturbation. The term perturbation is used to indicate that the data set used for developing the model in this stage is not the same as the one developed with all calibration objects, i.e. the one, which will be applied in sections 14 to 16. Too small a perturbation means that overfitting is still possible. The validation procedure is repeated as many times as there are cancellation groups. At the end of the validation procedure each object has been once in the test set and d-1 times in the training set. Suppose there are 15 objects and 3 cancellation groups, consisting of objects 1-5, 6-10 and 11-15. We mentioned earlier that the objects should be assigned randomly to the cancellation groups, but for ease of explanation we have used the numbering above. The b-coefficients in the model that is being evaluated are determined first for the training set consisting of objects 6-15 and objects 1-5 function as test set, i.e. they are predicted with this model. The PRESS is determined for these 5 objects. Then a model is made with objects 1-5 and 11-15 as training and 6-10 as test set and, finally, a model is made with objects 1-10 in the training set and 11-15 in the test set. Each time the PRESS value is determined and eventually the three PRESS values are added, to give a value representative for the whole data set (PRESS values are more indicated here to RMSEP values, because PRESS values are variances and therefore additive).
c. leave-one-out cross-validation (LOO-CV), in which the test sets contain only one object (d = n). Because the perturbation of the model at each step is small (only one object is set aside), this procedure tends to overfit the model. For this reason the leave-more-out methods described above may be preferable. The main drawback of LOO-CV is that the computation is slow because PCA must be performed on each matrix after object deletion. Fast algorithms are described where the speed of calculation is greatly improved .
Another way to improve the speed is based on the use of leverage-corrected residuals [95, 61, 111], where the leave-one-out cross-validated values are replaced by the fitted values from the least squares model corrected by the leverage value, using the equation :
where is the obtained residual for object i after fitting r factors by using least-squares model, hi(r) is the leverage value for object i after fitting r factors and is the corresponding predicted residual when r factors are used in the leave-one-out cross-validation of the model. It is fast to perform, compared to complete cross-validation, because only one singular value decomposition of the data matrix is needed. In the absence of outliers, results from leave-one-out cross-validation and leverage-correction are similar. However, leverage-correction must be employed as a quick-and-dirty method  and the results must be confirmed later with another method.
d. Repeated random splitting (repeated evaluation set method) (RES) . The procedure described in a is repeated many times. In this way, at the end of the validation procedure, we hope that an object has been in the test set several times with different companions. Stable results are obtained after repetition of the procedure several times (even hundreds of times). To have a good picture of the prediction error we have to use both low and high percentages of objects in the evaluation set.