Imputing Mean Annual Change to Estimate Current Forest Attributes

When a temporal trend in forest conditions is present, standard estimates from paneled forest inventories can be biased. Thus methods that use more recent remote sensing data to improve estimates are desired. Paneled inventory data from national forests in Oregon and Washington, U.S.A., were used to explore three nearest neighbor imputation methods to estimate mean annual change of four forest attributes (basal area/ha, stems/ha, volume/ha, biomass/ha). The randomForest imputation method outperformed the other imputation approaches in terms of root mean square error. The imputed mean annual change was used to project all panels to a common point in time by multiplying the mean annual change with the length of the growth period between measurements and adding the change estimate to the previously observed measurements of the four forest attributes. The resulting estimates of the mean of the forest attributes at the current point in time outperformed the estimates obtained from the national standard estimator.


Introduction
Keeping national inventories of forests updated to reflect current conditions poses substantial logistical and accuracy issues (Gillis and Leckie 1996).In paneled inventory systems the plot network is systematically divided into panels with rotating panels being measured on an annual basis (Reams et al. 2005); in a 10 year panel inventory, for example, one-tenth of all plots are measured each year and plots of each panel are remeasured every 10 years.In theory, a paneled inventory system can provide current information when only the most recent panel is used (Reams and Van Deusen 1999), but in practice, most forestry applications are at a spatial scale that require combining field plots from multiple years to achieve sufficient information (McRoberts 2001, Tomppo et al. 2008).However, combining field plots from multiple years to estimate current conditions can cause a lag bias when forest conditions are changing over time.
In the United States (US), the national inventory of forests is collected with a panel system by the Forest Inventory and Analysis (FIA) program of the US Forest Service.The FIA default estimator is a moving average (MA) approach (Bechtold and Patterson 2005) which is known to result in biased estimates when trend is present (Van Deusen 2002).In the western US, the problem of lag bias is exacerbated by a relatively long remeasurement interval (10 years), shifts in forest management in response to altered economic and social conditions, changing climate, and a high but variable disturbance rate from wildfire, disease, and insects.Thus there is interest in using remote sensing information to reduce lag bias in estimates of current forest conditions.Combining remote sensing and other ancillary data with field plots has become common for improving forest inventory information (Tomppo et al. 2008).
Combining field and remote sensing data, nearest neighbor (NN) imputation is often used for mapping (Finley and McRoberts 2008, Katila and Tomppo 2002, Ohmann and Gregory 2002), and these methods typically impute point-in-time plot attributes.In contrast, when using imputation to update paneled inventory data, it is possible to impute mean annual change (MAC) for a plot.For example, Arner et al. (2004) estimated mean annual net volume change using MA approaches and sampling with partial replacement approaches and McRoberts (2001) imputed the difference in basal area between two measurements to plots with missing measurements to update basal area for plots measured in previous years to the current point in time.There are few studies in the western US that examine alternatives to the MA (e.g., Eskelson et al. 2009) and none that we are aware of that impute MAC rather than point-intime measurements.
The objectives of this study are to: 1) use paneled data from the Pacific Northwest (PNW) to estimate mean annual change (MAC) of forest attributes using three nearest neighbor imputation methods; and 2) to estimate current forest attributes from paneled inventory data by updating the most recent measurement with imputed MAC.The results are compared with the estimates obtained from the MA estimator and the data from the current panel.

Data
In this study, 618 primary sampling units (PSU) from six national forests that were collected as part of the Pacific Northwest Region's Current Vegetation Survey (CVS) of the US Forest Service were used.The particular national forests, sampled between 1993 and 1997 (occasion 1 measurement) and remeasured in 2000, were the Colville (28), Mt. Hood (111), Ochoco (82), Rogue River (70), , and Winema (128) (Fig. 1).
Panel data is a special case of inventory data with measurements taken at different times.A panel system was imitated with the available data by randomly assigning 25% of the PSUs to panel 4 (P4; 154 PSUs) and the remaining 75% of the PSUs (464 PSUs) to panel 1 (P1), panel 2 (P2), and panel 3 (P3) based on their year of occasion 1 measurement (Table 1).
In the CVS inventory, circular PSUs (1 ha in size) are established on a regular grid with square spacing (5.47 km).Each circular PSU is subsampled by a satellite system of five second-ary sampling units (SSU).Each SSU consists of three fi xed-area circular, nested plots with radii of 3.6 m, 7.3 m, and 15.6 m in which trees with diameter at breast height (DBH in cm) smaller than 12.7 cm, with 12.7 cm ≤ DBH > 33 cm, and DBH greater or equal to 33 cm are measured, respectively.For a detailed description of the CVS inventory see Max et al. (1996).Live trees with DBH of 12.7 cm or larger were used in this study and height models developed in Barrett (2006) were employed to fi ll in missing heights (HT in m).Gross cubic-meter volume and total gross oven dry weight biomass were calculated with volume and biomass equations from the US Forest Service (USDA 2000).For each PSU, basal area in m 2 per ha (BA), stems per ha (SPH), volume in m 3 per ha (VOL), and biomass in tons per ha (BIOT) were calculated and summarized (Table 2).MAC for BA, SPH, VOL, and BIOT were calculated by dividing the difference of the observed values in 2000 and the observed value at the occasion 1 measurement by the growth period  length (GPL) between the two measurements.Thematic Mapper (TM) images from 2000 from the national land cover database 2001 (USDI / USDA 2009) were used as ancillary data.This data set includes normalized imagery bands 1 to 5 and band 7 (TM1, TM2, TM3, TM4, TM5, TM7), three commonly used band ratios (band 4 to band 3, band 5 to band 4, and band 5 to band 7), the normalized difference vegetation index (NDVI), the Tasseled Cap (TC) transformations of the 6 axes (Kauth and Thomas 1976), and a model of percent canopy cover.Normalization methods and development of derived layers are described in Homer et al. (2004).Values for each plot were calculated by overlaying the TM attributes, which had a 30 m resolution, with the center points of the five SSUs and calculating a mean of the five values.
In addition to the satellite data, the following climate and topography variables were used as ancillary data (Table 2): Natural logarithm of annual precipitation (mm) and mean annual temperature (°C) [Data source: DAYMET Daily Surface Weather Data and Climatological Summaries (Thornton et al. 1997, Thornton andRunning 1999)], Elevation (EL in m) and its transformations EL 2 and ln(EL) [Data source: CVS inventory], slope (%), and the transformations cosine(aspect) * slope and sine(aspect) * slope [Data source: 30 m digital elevation model using Arc Workstation GRID surface functions and commands (Environmental Systems Research Institute 1991)].

Nearest Neighbor Imputation
Nearest neighbor (NN) imputation methods are donor-based methods.Variables of interest are those forest attributes that are only measured on a subset of PSUs (e.g., MAC of BA, SPH, VOL, and BIOT).Ancillary variables are the attributes that are measured on all PSUs.In this study, satellite, climate, and topography data as well as the most recent measurements of BA, SPH, VOL, and BIOT, which were taken at measurement occasion 1 (between 1993 and 1997), constitute the available ancillary data.Reference data are the PSUs for which both variables of interest and ancillary variables are available (PSUs in P4).Target data are the PSUs for which only the ancillary variables are available (PSUs in P1, P2, and P3).The reference PSUs constitute the pool of potential PSUs which could be selected to impute the MAC data for the target PSUs.
The most similar neighbor (MSN) method (Moeur and Stage 1995), the gradient nearest neighbor (GNN) method (Ohmann and Gregory 2002), and the randomForest (RF) method (Crookston and Finley 2008) have been shown to provide reasonable imputation results for forest attributes (Moeur and Stage 1995, Temesgen et al. 2003, LeMay and Temesgen 2005, Hudak et al. 2008) and for mapping forest composition and structure (Ohmann andGregory 2002, Ohmann et al. 2007).MSN, GNN, and RF were conducted using the yaImpute R package version 1.0-9 (Crookston and Finley 2008).The similarity between reference and target PSUs is defined using a weighted Euclidean distance for MSN and GNN: where W is the weight matrix, X i is a vector of standardized values of the ancillary variables for the ith target PSU; and X j is a vector of standardized values of ancillary variables for the jth reference PSU.The ancillary variables for both target and reference PSUs were standardized using the mean and variance of the ancillary variables of the reference PSUs resulting in standardized variables with zero mean and unit variance.For MSN, the weight used is W = GL 2 G´, where G is the matrix of standardized canonical coefficients for the ancillary variables and L 2 is the diagonal matrix of squared canonical correlations between ancillary attributes and variables of interest (Moeur and Stage 1995).
For the gradient nearest neighbor method (GNN) the weights are assigned by employing a projected ordination of the ancillary data based on canonical correspondence analysis (CCA) (Ohmann and Gregory 2002).
The RF method is an extension of classification and regression tree (CART) methods (Breiman 2001).The data and variables are randomly and iteratively sampled to generate a forest of classification and regression trees.If two PSUs tend to end up in the same terminal nodes in a forest of classification and regression trees, they are considered to be similar.The RF distance measure is one minus the proportion of trees where a target PSU is in the same terminal node as a reference PSU (Crookston andFinley 2008, Hudak et al. 2008).The 618 PSUs were randomly split into P4 (154 PSUs) and P1, P2, and P3 (remaining 464 PSUs) 500 times.For each of the 500 realizations of randomly splitting the data, the following estimators were used to estimate the mean value of the variables of interest for the year 2000.

Estimation
Using the data from P4 only, the mean value of Y for the year 2000 (SAMPLE25 estimator) was estimated as: where Y i is the observed Y value of the ith PSU, and n 4 is the number of PSUs in P4.
The MA estimator, the FIA default method, was also used to estimate the mean value of Y for the year 2000: . * where n 1 is the number of PSUs in P1), with t-3, t-2, and t-1 being the years 1993/1994, 1995, and 1996/1997, respectively.In the following, this MA(4) estimator will be referred to as MA.Instead of filling in the missing values for P1, P2, and P3 with their previous measurements, as was done in the MA calculation, MSN, GNN, and RF were explored to impute the MAC of the variables of interest for the PSUs in P1, P2, and P3.The R package yaImpute (Crookston and Finley 2008) was used to impute MAC.The PSUs in P4 were the reference data in the imputation process.The imputed MAC was then used to update the variables of interest for each PSU in P1, P2, and P3 to the year 2000 by multiplying it with the GPL and adding it to the occasion 1 measurement as follows: where Y imp,i is the imputed Y value for the ith PSU in 2000 and Y t-x,i is the observed Y value for the ith PSU at time t-3 (1993/1994), t-2 (1995), and t-1 (1996/1997) for P1, P2, and P3, respectively.MAC imp,i is the imputed MAC for the ith PSU with imp referring to the NN imputation method used.GPL i is the growth length period between the occasion 1 measurement and the remeasurement in 2000.
The mean value of Y for the year 2000 was then estimated as follows: where IMP refers to the NN imputation method used and Y imp,i is the imputed Y value for the ith PSU as described in Eq. 5.
Two sets of ancillary variables were tested for the NN imputation methods: The first set included the available climate, topography, and satellite data (Data set A) and the second set consisted of the previous measurements of the variables of interest that were taken at measurement occasion 1 from 1993 to 1997 (Data set B: BAocc1, SPHocc1, VOLocc1, BIOTocc1).SAMPLE25, MA, and the three imputation methods were compared based on the estimated overall means of the variables of interest in 2000 (see Eqs. 3,4,and 6).The basis of evaluation was accuracy, as expressed by the root mean square error (RMSE), and bias calculated as the mean difference between the estimated and the observed (Eq.2) mean values.RMSE and bias were approximated using a random sample of m = 500 realizations of splitting the data.Both RMSE and bias were expressed as percent of the observed mean for each variable of interest: where Ŷj is the estimate of Y _ for the jth iteration of data splitting using either SAMPLE25, MA, or the NN imputation methods.Two-sided t-tests were performed to examine whether the esti-mated bias (in original units) of the SAMPLE25, MA, and the NN imputation methods was significantly different from zero, and the p-values were reported.
The MAC estimates based on MSN, GNN, and RF imputation were compared to the observed MAC also using the approximated RMSE and bias from m = 500 iterations of randomly splitting the data.Two-sided t-tests were performed to examine whether the estimated bias (in original units) of the three NN imputation methods using data sets A and B for imputing MAC was significantly different from zero, and the p-values were reported.

Results
MSN imputation provided similar results for MAC estimates for both sets of ancillary variables with the variance contributing most to the RMSE.When BAocc1, SPHocc1, VOLocc1, and BIOTocc1 were used as ancillary variables, the p-values of the two-sided t-test testing the unbiasedness of the estimator ranged from 0.7041 to 0.9453 and the p-values for the estimates based on climate, topography, and satellite data ranged from 0.1556 to 0.4581 indicating that MSN imputation provided unbiased MAC estimates with both ancillary data sets (Table 3).GNN estimates of MAC were significantly biased (p < 0.0001) and the large bias (> 66%) contributed most to the RMSE.Using climate, topography, and satellite data as ancillary variables resulted in smaller bias and hence, smaller RMSE values than using BAocc1, SPHocc1, VOLocc1, and BIOTocc1 as ancillary variables (Table 3).In terms of RMSE, RF using climate, topography, and satellite data as ancillary variables provided the best estimates of MAC.However, these were biased for BA, VOL, BIOT (p < 0.0007) and suggestive, but inconclusively unbiased for SPH (p = 0.0770).RF using BAocc1, SPHocc1, VOLocc1, and BIOTocc1 as ancillary data provided unbiased MAC estimates (p > 0.1) for all four variables of interest and smaller RMSE values than those achieved by either MSN or GNN imputation (Table 3).
For estimating current BA and SPH the RMSE values of MA were about half the size of those observed for SAMPLE25.For estimating current VOL and BIOT the RMSE values for MA were about a third of those observed for SAMPLE25.SAMPLE25 results were unbiased (p > 0.5).MA results were precise but biased (p < 0.0001) and the bias contributed most to the RMSE.The opposite was true for the unbiased SAMPLE25 estimates, where the variance contributed most to the RMSE (Table 4).
MSN resulted in biased estimates of current BA, SPH, VOL, and BIOT (p < 0.0047) when BAocc1, SPHocc1, VOLocc1, and BIOTocc1 were used as ancillary variables.When climate, topography, and satellite data were used as ancillary variables, estimates were biased for BA (p = 0.0021), unbiased for SPH and VOL (p > 0.1), and suggestive, but inconclusively unbiased for BIOT (p = 0.0460).For both sets of ancillary variables, the MSN estimates were similar in terms of RMSE and outperformed the MA estimates in terms of both bias and RMSE (Table 4).
For the GNN estimates of current forest attributes, bias contributed most to RMSE.When BAocc1, SPHocc1, VOLocc1, and BIOTocc1 were used as ancillary variables, the bias and hence the RMSE was larger than for the ancillary variable set including climate, topography, and satellite data.For both sets of ancillary variables, bias and RMSE were larger than for any of the other estimators (Table 4).
RF estimates of the current variables of interest were biased (p < 0.0016) when BAocc1, SPHocc1, VOLocc1, and BIOTocc1 were used as ancillary variables.When climate, topography, and satellite data were used as ancillary variables, RF provided suggestive, but inconclusively unbiased (p = 0.0421) and unbiased (p = 0.9975) estimates of current BA and SPH, respectively, and biased estimates of VOL and BIOT (p < 0.0001).RMSE values were smallest when climate, topography, and satellite data were used as ancillary variables.For both sets of ancillary variables, RF imputation outperformed the MA estimates both in terms of bias and RMSE (Table 4).
RF using climate, topography, and satellite data as ancillary variables provided the best results overall in terms of bias and RMSE for estimating current BA, SPH, VOL, and BIOT, followed by MSN and RF using BAocc1, SPHocc1, VOLocc1, and BIOTocc1 as ancillary variables in terms of RMSE (Table 4).
In a previous study, RF imputation using BAocc1, SPHocc1, VOLocc1, and BIOTocc1 as ancillary data for directly imputing current BA, SPH, VOL, and BIOT provided the most accurate estimates compared to the SAMPLE25 and MA estimators and MSN and GNN imputation.The results of this PSU-level RF imputation are provided in Table 4. See Eskelson et al. (2009) for details.The RF imputation in this study outperforms the PSU-level RF imputation from the previous study in terms of RMSE for BA, SPH, VOL, and BIOT and in terms of bias for BA and SPH for both sets of ancillary data.For both sets of ancillary data MSN imputation in this study outperforms the PSU-level RF imputation from the previous study in terms of both RMSE and bias for SPH (Table 4).

Discussion
After the start of the second inventory cycle, MAC can be estimated using remeasured plots (Arner et al. 2004).Considering the year 2000 as the start of the second inventory cycle, MAC was estimated in this study.The results of the three NN imputation methods that were explored to impute MAC of forest attributes showed the same pattern that was observed in an earlier study where the same three NN imputation methods were used to impute BA, SPH, VOL, and BIOT (see Eskelson et al. 2009).RF imputation provided the best results in terms of RMSE followed by MSN.The results of this study suggest that GNN PSUlevel imputation is not adequate to impute MAC of forest attributes.This might be due to the fact that the CCA in the GNN procedure requires the use of environmental factors for the ordination (Ohmann and Gregory 2002) which might not be picked up in the ancillary data that was used for the imputation in this study.The performance of using MAC of forest attributes of one inventory cycle to predict MAC for the next inventory cycle could not be tested in this study since only one remeasurement of the PSUs was available for the CVS data.This is an area of research that should be pursued as soon as multiple remeasurements of the FIA annual inventory are available in the western US.
When BA, SPH, VOL, and BIOT were updated to the year 2000 using MAC estimates obtained from MSN and RF imputation, the estimates of the mean BA, SPH, VOL, and BIOT in 2000 outperformed those of the SAMPLE25 and MA estimators in terms of accuracy.These results indicate that updating the variables of interest for unmeasured PSUs to the current point in time using estimated MAC from MSN or RF imputation should be preferred over using the SAMPLE25 or MA estimators for estimating the current forest attributes.
The estimates of mean BA, SPH, VOL, and BIOT in 2000 using MAC estimates obtained from RF imputation also outperformed the estimates from PSU-level RF imputation that was used to directly impute BA, SPH, VOL, and BIOT in 2000 (see Eskelson et al. 2009).This is due to the fact that the approach employing the imputed MAC estimates makes use of the previously observed measurements directly, whereas the PSU-level RF imputation only makes use of the previous measurements indirectly by using them as ancillary data.Adding a multiple of estimated MAC to the previously observed measurement will result in current estimates that will be close to the actual values even if the estimated MAC values were not perfect.If current BA, SPH, VOL, and BIOT are imputed directly as was done in Eskelson et al. (2009), the imputed values can be either close to the actual values or they can be completely different.The results of this study suggest that the approach using imputed MAC values should be preferred over directly imputing current BA, SPH, VOL, and BIOT.

Fig. 1 .
Fig. 1.Geographical distribution of the six national forests in Washington and Oregon.
Procedures BA, SPH, VOL, and BIOT were the variables of interest (Y).The observed mean value of Y in 2000 (Y _ OBS ), based on the observed Y values from all 618 PSUs, was used as the best available estimate of the true mean of Y: is the observed Y value of the ith PSU in 2000 and n = 618.

Table 2 .
Summary of variables at the primary sampling units (n = 618) in 2000.

Table 3 .
Bias with p-value in parentheses and RMSE of mean annual change (MAC) of BA (basal area/ha), SPH (stems/ha), VOL (volume/ha), and BIOT (biomass/ha).Data set A comprises climate, topography, and satellite data.Data set B comprises occasion 1 measurements of the variables of interest.

Table 4 .
Bias with p-value in parentheses and RMSE of mean BA (basal area/ha), SPH (stems/ha), VOL (volume/ ha), and BIOT (biomass/ha) in year 2000.Data set A comprises climate, topography, and satellite data.Data set B comprises occasion 1 measurements of the variables of interest.