Performance Modelling in Forest Operations Through Partial Least Square Regression

Partial Least Square (PLS) regression is a recent soft-modelling technique that generalizes and combines features from principal component analysis (PCA) and multiple regression. It is particularly useful when predicting one or more dependent variables from a large set of independent variables, often collinear. The authors compared the potential of PLS regression and ordinary linear regression for accurate modelling of forest work, with special reference to wood chipping, wood extraction and the continuous harvesting of short rotation coppice. Compared to linear regression, PLS regression allowed producing models that better fit the original data. What is more, it allowed handling collinear variables, facilitating the extraction of sound models from large amounts of field data obtained from commercial forest operations. On the other hand, PLS regression analysis is not as easy to conduct, and produces models that are less user-friendly. By producing alternative models, PLS regression may provide additional – and not alternative – ways of reading the data. Ideally, a comprehensive data analysis could include both ordinary and PLS regression and proceed from their results in order to get a better understanding of the phenomenon under examination. Furthermore, the computational complexity of PLS regression may stimulate interdisciplinary team-building, to the greater benefit of scientific research within the field of forest operations.


Introduction
Performance studies in forest operations often produce empirical models used for many purposes, including wood-flow planning, harvesting cost calculation and work rate setting (Björheden 1988).At a more fundamental level, performance studies also allow understanding the behaviour of harvesting machines and systems under varying stand and terrain conditions (Visser and Spinelli 2011).That is particularly important when deploying specialised industrial technology (Chiorescu and Grönlund 2001), which is more expensive and less flexible than traditional general-purpose equipment (Spinelli and Magagnotti 2011a).
Empirical performance models are generally developed by collecting field data and testing the statistical significance of any relationships with regression analysis (Samset 1990).The most commonly used regression type is ordinary least square linear regression (OLS).This technique is used to "calculate" an equation capable of representing the relationship between a dependent variable (typically time consumption or productivity) and one or more independent variables (Bergstrand 1987).Indicator (Dummy) variables are often used to include influencing factors that assume discrete rather than continuous values (Olsen et al. 1998).
A fundamental assumption of ordinary least square linear regression is that variables are independent, and not collinear (Freedman et al. 2007).That can be obtained through a strict experimental design, carefully planned before data collection and eventually integrated as work proceeds (Howard 1989).However, a large number of variables can impact the performance of forest machines, including piece size (Nakagawa et al. 2010), stocking density and thinning intensity (Eliasson 1999), type of cut and total volume (Suadicani and Fjeld 2001) and terrain characteristics (Visser and Stampfer 1998).Further variation is introduced by the widely varying skills of both machine operators (Ovaskainen et al. 2004) and researchers (Nuutinen et al. 2008).To overcome such variation, productivity models should be based on large samples (Nurminen et al. 2006).Bergstrand (1987) estimates that about 400 operators should be included in each performance study, in order to detect the existence of differences between groups at a 95% confidence level.
That explains why it is so difficult and expensive to implement a strict experimental design when developing an empirical performance model (Spinelli et al. 2011).The large samples needed to obtain a reliable general model are often assembled by studying commercial operations, which makes it difficult to implement a controlled study design (Spinelli and Magagnotti 2009).As a result, variables are often collinear and most such studies can estimate primary effects only, while missing secondary effects (Spinelli et al. 2010).
Hence the interest in exploring alternatives to ordinary linear regression (OLS), such as multivariate predictive modelling based on the recombination of principal components (Principal Component Regression -PCR) or latent variables (Partial Least Square -PLS).Different authors (Nsofor, 2006) observed that in many cases the PLS approach returns better results than PCR, including a better goodness-of-fit and a more robust model.PCR is a multivariate method where a multiple linear regression is performed on the Principal Component Analysis scores.In contrast, PLS is a soft-modelling technique, i.e. it has "soft" distributional assumptions (Pulos and Rogness 1995) and can be used when distributions are highly skewed (Bagozzi and Yi 1994).PLS finds a linear regression model by projecting the predicted variables and the observed variables to a new space (projection to latent structures) that is component-based rather than covariancebased.PLS is particularly useful when predicting one or more dependent variables from a large set of independent variables, often collinear.This technique originated within the field of economics (Wold 1966) but became popular first in computational chemistry (Geladi and Kowalski 1986) and then in sensory evaluation (Martens and Naes 1989).Today PLS regression is becoming a tool of choice in the social sciences, as a multivariate technique for non-experimental and experimental data alike (Mcintosh et al. 1996, Costa et al. 2010, Capoccioni et al. 2011).PLS regression was first presented as an algorithm akin to the power method used for computing eigenvectors and was rapidly interpreted in a statistical frame-work (Frank and Friedman 1993, Helland 1990, Hoskuldsson 1988, Abdi 2003).
The goal of this study was to explore the potential of multivariate approaches different from OLS (i.e.PCR and PLS), focusing mainly on PLS regression when developing forest operation performance models.In particular, the study aimed at comparing the main statistical significance indicators associated to models calculated with OLS, PCR and PLS regression from the same original datasets, for the purpose of quantifying the eventual improvements obtained with the new techniques.

Materials and Methods
Three datasets were selected for comparing the three regression techniques: OLS, PCR and PLS.These datasets represented a wide variety of forest operations, with clearly different characteristics and influencing factors.The same datasets had already been used for published modelling studies.Dataset 1 concerned chipping whole trees, logs and forest residues with mobile chippers, and was used to estimate chipping time in min ton −1 as a function of eleven independent variables (Spinelli and Hartsough 2001).Dataset 2 concerned the skidding of whole trees, delimbed stems and logs with forestry-fitted farm tractors, and was used to estimate productivity in m 3 hour −1 as a function of ten independent variables (Spinelli and Magagnotti 2011b).Dataset 3 concerned harvesting short rotation coppice with modified foragers, and was used to estimate the forager harvesting rate in min km −1 as a function of five independent variables (Spinelli et al. 2009).The complete list of dependent and independent variables is shown in Table 1.
In order to determine the most robust PCR and PLS models in terms of reducing the overfitting in prediction, each dataset was partitioned into 80% to estimate the model, and 20% for the independent validation tests.The partitioning strategy is one of the most reliable and advanced approaches to validate models and correct overfitting, and is directly linked with model robusteness.The partitioning algorithm used was SPXY (Harrop Galvao et al. 2005, Antonucci et al. 2011).This algorithm accounts for the variability of both the dependent and independent variables, constituting the Y-block and the X-block, respectively.This procedure was not used for the ordinary models, which had been previously published as calculated from the whole data set without any partitioning.Hence, calculating them again after partitioning would have generated a result inconsistent with the published formulations.-Harvest rate: (min km −1 ) X-block variables: -Stocking (t ha −1 ) -Forager Power (kW) -Header (HS2, GBE) -Row System (twin-, single-row) -Stocking (t km −1 ) Note: underlined variables are also significant to the model obtained through ordinary regression For the purpose of both PCR and PLS regression analyses, the X-blocks from Datasets 1 and 2 were transformed by column centering ('mean center') procedure, while Dataset 3 was transformed by column normalization ('autoscale' equal to mean centering * stand dev -1 ).All the Y-blocks were transformed using the 'autoscale' procedure.
PCR is a three-step multivariate method: in the first step, a Principal Component Analysis (PCA) of the data matrix is performed and measured variables are converted into new ones (scores on latent variables).In the second step the Principal Components relevant in the prediction model are selected on the base of the highest goodness of fit in the validation phase.This is followed by a multiple linear regression (3rd step) between the scores obtained in the PCA (1st step) and the characteristic response variable to be modelled (De Maesschalck et al. 1999), Because it directly addresses the collinearity problem, PCR can be said to be less susceptible to overfitting than Multiple Linear Regression (MLR).
PLS is used to find the fundamental relations between two matrix (X and Y) and represents a latent variable approach to modeling the covariance structures in these two spaces.A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space.
The general underlying model of multivariate PLS is: where X is a n × m matrix of predictors, Y is a n × p matrix of responses; T and U are n × l matrices that are, respectively, projections of X (the X score, component or factor matrix) and projections of Y (the Y scores); P and Q are, respectively, m x l and p × l orthogonal loading matrices; and matrices E and F are the error terms, assumed to be i.i.d.normal.The decompositions of X and Y are made so as to maximize the covariance of T and U.
A number of variants of PLS exist for estimating the factor and loading matrices T, P and Q.In this study we used the SIMPLS (De Jong 1993) algorithm (equal to PLS1 for univariate y) that constructs estimates of the linear regression between X and Y as (B and B 0 are parameters): Both PCR and PLS were computed using PLS toolbox 6.2 (Eigenvector research) for Matlab 7.1.The programme also calculated residual error indicators, such as the root mean square errors in calibration (RMSEC) and in validation (RMSECV).The predictive ability of the model was partially dependent on the number of the latent vectors used and was assessed through the following statistical indicators: root mean square error (RMSE), standard error of prevision (SEP), correlation coefficient (r) and bias.
Finally, the programme calculated the ratio of percentage deviation (RPD), which is the ratio of the standard deviation of the measured data to the RMSE (Williams 1987).This represents the factor by which the prediction accuracy has been increased compared with using the mean of the original data.Generally, a good predictive model should have high values for r and low values for RMSE, SEP and bias.The model chosen was for the number of LV (Latent Vector) that yielded the highest r, minimum SEP for predicted and known Y-block and maximum RPD.RPD values were classified as follows: RPD < 1.0 for a very poor model, whose use is not recommended; RPD between 1.0 and 1.4 for a poor model where only high and low values can be separated; RPD between 1.4 and 1.8 for a fair model that may be used for assessment and correlation; RPD values between 1.8 and 2.0 for a good model able to produce quantitative predictions; RPD between 2.0 and 2.5 for a very good quantitative model, and RPD > 2.5 for an excellent model, highly accurate and reliable (Viscarra Rossel et al. 2007).The main differences between OLS, PCR and PLS are summarized in Table 2.

Results
The models generated through ordinary regression analysis are available on the quoted original publications, and namely: Spinelli and Hartsough (2001), Spinelli and Magagnotti (2011b) and Spinelli et al. (2009), for Datasets 1, 2 and 3, respectively.On the other hand, the models generated through PCR and PLS regression analyses are rather complex to write and will not be reported.
Table 3 shows the main indicators for the OLS, PCR and PLS regression models, for the three datasets tested.The number of independent variables used by PCR and PLS regressions are from 2 to 5 times higher than used in OLS.Model error indicators (SEP and RMSE) are 20 to 40% lower for the PCR and PLS regression models, compared to the OLS ones.Moreover, r values are always higher for both PCR and PLS models, with an increment between 5 and 20% over OLS models.RPD is always higher for the PLS regression models.Based on the previously mentioned RPD classification, PLS regression allows the systematic upgrading of ordinary regression models: Model 1 goes from "very good" to "excellent", Model 2 from "fair" to "very good" and Model 3 from "poor" to "fair".The indicators for the validation tests are also encouraging, with the predicted values following quite closely the actual values in the subset reserved for independent validation.
PLS is the best performing model for Dataset 1 while for Datasets 2 and 3, with a reduced number of X variables, PCR and PLS converged to the same results.
The observed vs predicted independent Y variables for the OLS and PLS models for the three datasets were reported in Figs. 1, 2 and 3, respectively.
Table 4 shows the relative contribution (loadings) of individual X-variables to each of the first three latent vectors of each PLS model.
Regarding Dataset 1, the variables with the highest contribution to the first LV (x-block 99.97%; y-block 0.08%) are the indicators for Finally, the first LV (x-block 76.01%; y-block 26.22%) of Dataset 3 receives the highest contribution from Stocking in t ha −1 and Power in kW.Power is also the main contributor to the second LV (x-block 19.63%; y-block 1.96%).Stocking in t ha −1 and t km −1 give the highest contribution to the third LV (x-block 4.00%; y-block 3.44%).
Power and Stocking are the two main variables included in the model calculated with ordinary regression techniques.PCR loadings were not reported, having similar values than PLS.

Discussion
Although variable selection may reduce the error of a prediction model, it may also inadvertently discard useful redundancy.Using fewer variables to make a prediction means that each variable has a larger influence on the final result.Hence, one should carefully consider the requirements of the final model before variable selection.For this reason, we decided to use full-spectrum PCR and PLS models.
As observed by Nsofor (2006) PLS gives better results than PCR for latent vectors that maximize the correlation between LV's and the Y var (Table 2).Moreover, we observed that with fewer variables (Datasets 2 and 3) PCR and PLS models tend to offer the same results.
PLS regression analysis does offer some benefits over ordinary regression analysis (Lipp 1996).The substantial improvement of all goodness-offit indicators is probably the most visible benefit.Moreover, other benefits of the PLS regression technique are not merely the increase of a coef-   ficient, but the capacity of detecting significant variables otherwise missed with ordinary regression techniques (Costa et al. 2009).This is the advantage of latent vectors, which are capable of integrating the effect of more independent variables.A further advantage of PLS regression over multiple linear regression is in the definition of the new variables, which takes into account not only the values assumed by the X but also their correlation with the dependent variables (Kresta 1992).
In this respect, it is most interesting to compare the X-variables included in the ordinary and PLS regression models obtained from the same datasets.In most cases (Datasets 2 and 3) the balance remains the same: the strongest variables in the ordinary regression model are also the strongest in the PLS regression model.Hence, PLS regression may have the capacity of drawing additional variables into the models, without radically changing its conceptual structure.That is most logical, because both model types still describe one single real-life phenomenon, and the phenomenon is bound to be the driver, not the model.The model describes the phenomenon, and regardless of how it does that, skidding still involves a machine dragging a load over a certain distance.Hence, machine pulling power, load size and distance are bound to have the strongest effect on skidding performance.
On the other hand, the same event can be seen from different angles, and different observers can choose different attributes to describe the same quality.That may explain why the PCR and PLS regression models underestimated the effect of piece size, which the ordinary regression model picked as one of its strongest independent variables.In contrast, PLS regression analysis selected other piece attributes than size.Hence, the new technique still detected the strong effect of piece characteristics, but chose different specific attributes for inclusion into the model.That is likely dependent on the capacity of PLS regression to handle collinear variables.Ordinary regression would pick one or the other, but the use of latent vectors in PLS regression make it possible to select more than one attribute for the same characteristic, after weighing their contribution through pre-treatment.
When different variables are picked by different models, it is difficult to decide which model best represents the real phenomenon.Direct experience with the phenomenon and convenience should be the best guides, but they are highly subjective.In the specific case of Dataset 1, the choice would be between Size (ordinary regression model) and Species combined with Layout (PLS regression model).There are good reasons for defending both models.The effect of piece size on productivity is generalized and well known (Visser and Spinelli 2011).On the other hand, operator experience often hints at raw material lay-out as a main driver of chipping productivity.The distinctive effect of a given tree species can be related to different wood characteristics.In our case, poplar wood is indeed the softest wood type among those represented in the dataset.It can be debated that a model electing size over lay-out and species is somewhat more flexible, as it may adapt to a wider number of situations.On the other hand, flexibility may tempt users into extrapolation, whereas a model is properly used only within the range set by the original data pool.
The larger number of X-variables included in the PCR and PLS regression models also warrants some comments.While this larger number guarantees a more accurate description of the phenomenon, it also requires a larger effort when gathering input data.Hence, PCR and PLS regression models may be less convenient to use than similar models calculated through ordinary regression.Furthermore, users may be somewhat less careful when collecting many input variables, than when they need to collect fewer.Pressed by time constraints, they may settle for approximate values, rather than going all the way and get accurate representative figures.In that case, the alternative is between using fewer better figures or more approximate figures.Therefore, the larger modelling effort required by PLS regression analysis may be frustrated.
PCR and PLS regression analyses are not as easy to perform as OLS.The latter is easily available within any mainstream software package, including the basic Excel.More sophisticated users may scorn the base Excel package and turn to R, or to any commercial statistical softwares -all of which rightly include comprehensive linear regression programmes.All researchers are familiar with ordinary least square regression analysis, and can quickly adopt the results pub-lished by their colleagues.In contrast, PCR and PLS regression analyses require specific packages, algorithms and skills that are not as readily available.The models themselves are somewhat less handy than standard regression equations.Nevertheless, PLS modelling and more in general the advanced multivariate approach, are getting increasingly popular, because they are very robust and are particularly suitable for modelling complex systems.
This very same reasons may justify the introduction of multivariate regression to forest work science.If its merits turned out to be so valuable, PLS regression would spread rapidly, and the sector would evolve from an older established technique to a new one -as it has already happened before, when regression analysis was first introduced.At the moment, the most practical thing to do for accessing PLS regression is probably to team with researchers who already use it, building multidisciplinary work groups.This way, one may multiply the comparisons, and decide if, when and how PLS regression should replace ordinary least square regression.

Conclusions
Compared to OLS analysis, PCR and PLS regression analyses allow producing models that better fit the original data.What is more, they allow handling collinear variables, facilitating the extraction of sound models from large amounts of field data obtained from commercial forest operations.This could lead to more robust models in terms of both variable oscillations and higher repeatability.
On the other hand, PCR and PLS regression analyses are not as easy to conduct, and produce models that are less user-friendly.
In fact, we believe that PCR and PLS regression analyses offer significant benefits in terms of theory-building, and that these benefits may far outweigh the strictly practical ones.By producing alternative models, PCR and PLS regression may provide additional -and not alternativeways of reading the data.Ideally, the analysis could include ordinary, PCR and PLS regression and proceed from their results in order to get a better understanding of the phenomenon under examination.By comparing the ways and the variables used by both analyses to mirror the actual phenomenon, researcher could get a better understanding of it, which is the ultimate goal of any field study.
Furthermore, the computational complexity of PCR and PLS regression may stimulate interdisciplinary team-building, to the greater benefit of scientific research within the field of forest operations.Cross-pollination could generate new ideas, improve study methods and eventually accelerate scientific progress in this field.

Fig. 2 .
Fig. 2. Dataset 2: observed vs predicted Y for the ordinary and PLS model.

Table 1 .
Variables used for the regression analysis.

Table 2 .
Principal features of OLS, PCR and PLS modelling techniques (modified from Nsfor 2006).

Table 3 .
Main goodness-of-fit indicators for the regression models.

Table 4 .
PLS Models: X variable loadings for each of the first 3 LVs (Latent Vectors).