Weighting Alternative Estimates when Using Multi-Source Auxiliary Data for Forest Inventory

Five auxiliary data sources (Landsat TM, IRS-IC, digitized aerial photographs, visual photo-interpretation and old forest compartment information) applying three study areas and three estimators, two-phase sampling with stratification, the k nearest neighbors and regression estimator, were examined. Auxiliary data were given for a high number of sample plots, which are here called first phase sample plots. The plots were distributed using a systematic grid over the study areas. Some of the plots were then measured in the field for the necessary ground truth. Each auxiliary data source in combination with field sample information was applied to produce a specific estimator for five forest stand characteristics: mean diameter, mean height, age, basal area, and volume of the growing stock. When five auxiliary data sources were used, each stand characteristic and each first phase sample plot were supplied with five alternative estimates with three alternative estimators. Mean square errors were then calculated for each alternative estimator using the cross validation method. The final estimates were produced by weighting alternative estimates inversely according to the mean square errors related to the corresponding estimator. The result was better than the final estimate of any of the single estimators. The improvement over the best single estimate, as measured in mean square error, was 16.9 % on average for all five forest stand characteristics. The improvement was fairly equal for all five forest stand characteristics. Only minor differences among the accuracies of the three alternative estimators were recorded.


Introduction
The application of two phase sampling for combining remote sensing and field measured sample plot information has a history of almost 30 years in Finnish forest inventory (Poso and Kujala 1971, 1978, Mattila 1985, Poso et al. 1987, Kilkki and Päivinen 1987, Peng 1987, Tomppo 1993, Wang 1996).First, a large number of first phase sampling units in the inventory area are demarcated in a coordinate system, usually by systematic sampling with a square grid.Then each first phase unit is supplied with auxiliary data from one or more data sources, such as an aerial photo interpretation or a satellite image.The first phase sample units are stratified into maximally homogeneous strata on the basis of the auxiliary data.The auxiliary data are usually condensed using the principal component technique, and instead of the original auxiliary data, PC-values are used for stratification.K-means stratification has been commonly applied since Poso et al. (1987).
Stratification of the first phase sample units is used for drawing the second phase sample: field plots.Proportional or, if necessary, more optimum allocation can be applied.Field plots are measured and ground truth is derived for each field plot.The ground truth needed for a field plot is a vector of forest stand characteristics.
There are numerous ways to derive estimates for the first phase sample plots.Poso and Kujala (1971) stratified the first phase plots into small and homogeneous strata of approximately similar size, in accordance with the idea that only one field plot is drawn for each stratum.Thus information on the field plot was easy to generalize for all first phase plots belonging to the same stratum.This resulted in unbiased estimates for each first phase sample plot and for the whole population (see Cochran 1963).However, it created problems in variance estimation.The variances within strata were estimated by grouping the strata whose variances could be expected to be fairly similar and by studying the variances among the field plots as such or as residuals after using regression (Poso andKujala 1971, 1978).
Later, Poso et al. (1987) examined stratification by satellite imagery and drew some 3-10 field plots for each stratum.The average values of the field plots belonging to a specific stratum were used when generalizing the estimates for first phase plots of the specific stratum.Kilkki and Päivinen (1987) recommended the use of a "reference" plot corresponding to the n-nearest neighbor method where n equals 1. Tomppo (1993) applied the method with five nearest neighbors.Peng (1987) has applied the regression procedure.In this method, a regression model is constructed in which the auxiliary data of field plots are used as independent variables and the ground truth of the respective plots as dependent variables.The method is then applied to the first phase plots for derivation of forest stand variables.All the estimators result in formally complete data for each first phase plot, allowing a high degree of flexibility in calculating information for any desirable sub-population.The larger the area of a sub-population the better the accuracy.For example, the experiments made at the Helsinki University Field Station, Hyytiälä, showed correlation coefficients of 0.62 between the estimated and field measured forest stand volumes (m 3 /ha) when the observation unit was a relascope sample plot, and 0.86 when the observation unit was a forest compartment with an average size of 1.2 ha.Accordingly, it can be concluded that root mean square errors for compartment estimates are some 65 % of that of plot estimates.
The method becomes more promising if there is more than one source of auxiliary data.Wang (1996) developed an expert system to find the best estimates from the many alternatives.A different approach would be to study the accuracy of each auxiliary data source-specific estimate and combine all estimates through a weighting procedure as suggested by Cochran (1963), among others.
The aim of this paper is to develop a methodology of estimating forest variables for conditions in which more than one auxiliary data source is available.The study will be made in the layout of two phase sampling as applied earlier by Poso and others (1971Poso and others ( , 1978Poso and others ( , 1987)).It is supposed that the layout is feasible for both management planning inventories and large area inventories.The specific objective is to test the possibility of improving the inventory accuracy necessitated by forest resource management requirements, and to do this by using many auxiliary data sources independently and then combining the alternative estimates from different sources by a weighting procedure.

How to Use Auxiliary Data Sources
Auxiliary data can be defined as data which are easily available and correlated with the desirable information.Accordingly, auxiliary data for forestry can be obtained by remote sensing, topographic maps, and, if available, existing forest information which is not up to date enough to fulfil the requirements.
It is assumed here that the population of interest is defined by an equidistant grid producing a very large number of plots.Each plot can be supplied with information about desirable forest variables.Properties of forest population such as distributions, means, and standard deviations can be defined on the basis of the population units.It may be interesting to note that the population properties, especially distributions and variances, are not independent of the population definition, e.g., the size of field plot.This is in accordance with Shiver and Borders (1996, p. 7), who state that: "The population and the sample should be defined using the same elements or units".
Here, the first phase sample is defined in a coordinate system by a square grid the intensity of which is relevant to the objectives and forest conditions.The intensity has an effect on computer calculations but very little effect on cost.Intensity should be based on stand or compartment structure.A distance ranging from 20 to 50 m between the successive first phase plots has been used in this study.
First phase sample units are then supplied with first phase data, i.e., auxiliary data.This can be done in several different ways: by taking numerical values from the nearest pixel of satellite imagery, mean and texture derived from a window of nearest pixels from digitized aerial photographs, and ocular interpretation from stereoscopic photo-pairs.The essential steps taken after each first phase plot has been supplied by auxiliary data are as follows: 1. Stratification of the first phase plots into homogeneous strata on the basis of auxiliary data.Stratification was done separately for each auxiliary data source and in combination with various data sources.The original auxiliary data were transformed into principal component values in order to improve the stratification.A K-means program (Peng 1988) was applied.The parameter K can be given by the user and the program is able to divide the first phase plots into K strata.Simple Euclidean distance in the feature or principal component space was used as a minimizing criterion.2. Drawing a second phase field sample.The objective was to draw those first phase plots which would optimally represent the population in all respects.This meant that proportional allocation was applied on the basis of using a combination of Landsat TM and IRS-1C satellite imagery.3. Measuring field sample plots and deriving the necessary forest stand variables for each individual plot.4. Generalizing field sample plot data to all first phase sample plots.This was done separately with each auxiliary data source and three alternative estimators.Thus the total number of estimates of a specific stand variable, e.g.volume, m 3 /ha, was the number of auxiliary data sources (5) multiplied by the number of estimators (3), i.e. 15 estimates.The estimator refers here to the procedure of generalizing field sample plot data to all first phase sample units.5. Estimating MSE-values referring to the accuracy of plot estimates by an estimator-auxiliary data combination.6. Deriving final estimates by weighting the alternative estimates by inverse MSE values.
The abbreviation and description of the three different types of estimators applied in this study were: 1. STRAT.Plot i belonging to stratum k was supplied by the average values of the forest stand variables of the field plots belonging to stratum k (e.g.Poso et al. 1987) 2. KNN (k-nearest neighbors).In the feature space of the auxiliary data, usually with principal component values, the five nearest field plots were searched for each first phase plot by applying the equation d ij 2 = Σd ijk 2 /n, where d ij refers to Euclide-an distance of plot i to be estimated from plot j which is a potential nearest neighbor to plot i. k refers to the dimension in the feature space (k = 1,2,.,.,n).The estimates were calculated as the average values of the forest stand variables of the k-nearest field plots.From 5 to 10 nearest neighbors corresponding to the size of stratum were applied.A similar type of method with n= 1 was first suggested by Kilkki and Päivinen (1987).Tomppo (1993) has used the method for national forest inventory purposes by weighting the k-nearest neighbors.Weighting of the field data with the distance in feature space was not regarded as feasible in this study.3. REGR.The estimator is based on regression as described by Peng (1988) and Wang (1996).The following equation was applied where k = auxiliary data set, j = combination of sample groups (j = 1, . .g, but not h), ŷkj = estimate of the variable y using auxiliary data set k, b kj0 = regression constant, b kjz = regression coefficient (z= 1, . . .,q), q = number of auxiliary variables, x kjz = data value of the zth auxiliary variable, and h = the sample group the respective field plot belongs to.
For solving the regression model, the field information on the field plot to be estimated was disregarded.The field sample of m sample plots was divided into g sample groups with m/g field plots in a group.When a field plot was estimated the regression model was solved on the basis of groups other than the one the field plot belonged to, i.e. on the basis of the field plots of the g -1 sample groups.

Material and Methods
The three study areas ( The accuracy of old field data or compartment data was not checked.It can be assumed that the information about the volume of growing stock, m 3 /ha, is given with some 20-30 % accuracy. The drawing of field sample was done separately for each auxiliary data source 1a, 1b, and 1c (in Study Area 1) and in all other cases according to proportional allocation based on combined stratification, based in turn on auxiliary data sources Landsat TM and IRS-IC.The purpose of stratification was to produce strata of about equal size using five to ten second phase sample units per stratum.
Five quantitative forest variables: mean diameter (D, cm), mean height (H, m), Age(A, a), basal area (BA, m 2 /ha), and volume (V, m 3 /ha) were measured for each field plot and imputed for each first phase sample plot.The data were taken from old forest compartment map information.(updated) The following attribute data were taken for each plot: 1) Land use class, 2) forest site type, 3) development class, 4) basal area, 5) mean height, 6) mean diameter, and 7) age * The old field data was based on a compartment data base which was updated.This means that after drastic changes the compartments have been revisited and other compartments have been updated by growth models.The problem with the material is that digitalization is based on aerial photos without ortho-projection, often resulting in errors in compartment boundaries.Thus, the attribute data for a sample plot may sometimes be taken from the wrong compartment.

Estimation with Weighting
The accuracy of estimates for first phase plots was examined separately for each combination of data sources, k, (k = 1, 2, . . .16, see Table 3) and estimator, j, (j = 1, 2, 3).The equation of mean square error applied was: (1) In the equation, ŷkji refers to the estimate for field plot i, y i to the ground truth of plot i, and n to the number of field plots used.Cross-validation was applied to eliminate the effect of the ground truth of a plot on the estimate of the respective plot.In the case of the REGR method, the solution was based on the division of the material into eight groups of equal size as described earlier.
Weighting was applied for three sets of combinations of auxiliary data sources (Study area 1: 1a + 1b + 1c, Study area 2: 1d + 2 + 3 + 4 + 5, and Study area 3: 1d + 2 + 3 + 4 + 5).Thus, weighting resulted in a combination of three estimates in Study area 1, and five estimates in Study Areas 2 and 3.The equation used for weighting was: where ŷi = combined estimate of plot i, w k = weight of estimate obtained by a data source k ŷki = estimate obtained by a data source k.
It is common to use inverse values of error variances as weights (e.g., Cochran 1963).Here, it was assumed that the errors of the estimates from different auxiliary data sources are uncorrelated and weights were calculated in two different options: In Option 2, the MSE value was reduced by a certain quantity, C.This was included in the test because there are reasons to believe that the MSE values calculated as are too high.This is because there may be substantial random variation in the ground truth as measured for individual sample plots; sometimes "border trees" are included, sometimes excluded, depending on the location of the plot center.The smaller the plot the higher the proportion of the random element in MSE.The effect of this random variation in the weighting process was studied by using weights 1/(MSE-C) as an alternative to simple 1/MSE.The values of C corresponded to experimental studies on the standard deviation of plot values within a homogeneous forest stand.According to Nyyssönen (1954) relascope with basal area factor 1 m 2 /ha produced a coefficient of variation of 16 %, when the basal area was measured for a forest stand.Here C was set to correspond to a coefficient of variation ranging from 5 to 16 %.

Results
Weighting was tested separately for all three estimators (STRAT, KNN, and REGR) and for all the selected five forest variables.The estimates with weighting for plot i, ŷkji are based on auxiliary data source k and estimator j.Accordingly, kj estimates are produced for a plot.The best auxiliary data source, if cost is not considered, is one with the smallest MSE.The quality of the auxiliary data sources is compared in Table 4, where single auxiliary data is used applying the STRAT method.Figure 1 presents the relative values of the MSE's of five alternative estimates and estimates based on weighting for all five forest variables.The reference value, 100, corresponds to the MSE of the best alternative auxiliary data source.If the best alternative source for a forest variable is not same, the average reference value exceeds 100.This is true for mean height, H, and basal area, BA.
Auxiliary data sources 1a, 1b, and 1c (see Table 3) correspond to a situation in which the auxiliary data originate from the same type of source but from different dates.Figure 2 illustrates the use of Landsat material from different dates.Value 100 refers to the MSE of the best auxiliary data source when estimator STRAT was applied.
The three alternative estimators were examined with each auxiliary data source separately and in the weighting mode.The relative MSE's are given in Table 5.

Discussion
A comparison of the quality of alternative auxiliary data sources, when costs are not taken into consideration, shows some variability depending on the forest stand variable to be studied.For mean diameter, mean height and age, the two best sources were old compartment field data and visual photo interpretation.The third best were digitized aerial photos (no calibration was applied).The fourth and fifth best were Landsat TM and IRS-IC.IRS was a panchromatic image with one wave length band, for which DN's of the nearest pixel, 5 by 5 window, and standard deviation of the window were applied.For age, the best source was old compartment field data.
For basal area and volume the order of the sources was 1. visual interpretation of photos, 2. IRS-IC, 3-4.old forest information together with digitized aerial photos and 5. Landsat TM.
Visual interpretation of individual plots led to better results than visual interpretation of compartments.As different persons did the interpretation, the differences may also have been due to personal skill.Neither of the photo interpreters had much experience and working conditions were fairly primitive; contact prints of 1:31000 with a digital orthophoto, as seen in a computer monitor, were used in study areas 2 and 3.In addition, lens stereoscopes and a parallax bar were available and used to some extent for plot interpretation.
Because of the nature of the ground truth, the differences between the MSE-values of separate auxiliary data sources are probably underrated.Often fairly small sample plots with a relatively high number of "border trees" are measured.This means that there may exist substantial random variation in the ground truth, thus increasing the error variance, particularly in Study area 1.To study this effect, a constant referring to random variation within a homogeneous stand and related to the size and type of sampling unit was reduced from the MSE when weights were calculated.This constant corresponded to a coefficient of variation ranging from 5 to 16 %, as based on the study by Nyyssönen (1954).The reduced MSE-weights improved the accuracy of the final estimates by very little.
Weighting of alternative estimates improved the accuracy in all experiments.This was true for the five different types of auxiliary data sources and multi-temporal Landsat TM material as well as for all five forest stand variables.
The average decrease in MSE obtained by weighting was 17 % when compared with the best estimation based on one alternative auxiliary data source for all five forest stand variables.Basal area and volume showed the highest decrease (21 %) and age the lowest (8.5 %).The corresponding decrease in MSE in the case of multi-temporal Landsat TM material (1985, 1989, and 1992) in Study area 1 averaged 15 % for all five forest stand variables.
The above conclusions are based on the application of Estimator STRAT.If 100 signifies the MSE-values of all alternative estimates by STRAT, the corresponding figure for Estimator The differences between estimators STRAT, KNN, AND REGR are fairly small and the more detailed conclusions would require more detailed analyses.Distinct differences, however, can be recognized between the estimators which are or are not based on weighting.
The forest stand estimates for first phase plots obtained with the best combination of five auxiliary data sources and field sample plot measurements cannot be regarded as fulfilling the common quality requirements of forest management planning in Finland.The percentages of root mean square values were 44 % for mean diameter, 39 % for mean height, 48 % for age, 44 % for basal area and 58 % for volume.These percentages were almost equal for Study areas 2 and 3 even though the areas were rather different.The accuracy for compartments, 1-2 ha in size, would be better: some 65 % of root mean square errors of sample plots.Estimation of forest stand variables such as tree species distribution, timber quality, site and biodiversity is difficult without a geographical or spatial connection of field observations and nearby first phase sample plots.
The quality of estimation would be improved if cross validation were not applied and field plot information were used directly for estimating the respective and nearby plots, plots belonging to the same compartment as the field plot.The dependence of the quality on a number of field observations calls for further studies as well as the methodology to acquire field information.
Other helpful approaches would be: improving the quality of auxiliary data sources, using difference images, and applying expert systems.

Fig. 1 .
Fig. 1.Relative accuracy of average alternative and weighted estimates for five auxiliary data sources in Study Areas 2 and 3.

Fig. 2 .
Fig. 2. Relative accuracy of specific and weighted estimates in the case of three multi-temporal Landsat TM images.The MSE of the best specific estimate is marked by value 100.0.(Study Area 1)

Table 3 .
Combinations of data sources.

Table 4 .
Comparison of auxiliary data sources by RMSE of Study areas 2 and 3 when estimator STRAT has been applied.The best auxiliary data sources are indicated in bold.

Table 5 .
Effect of estimator on MSE.All auxiliary data sources and three study areas are included in the frame of calculating arithmetic means.The new order in the performance of estimators, however, can be explained by the difference in their nature.Plot errors by STRAT and alternative auxiliary data sources are probably less correlated than in the case of other estimators.The comparison of estimators with all auxiliary data sources leads to the following set up: