Understanding the representativeness of FLUXNET for upscaling carbon flux from eddy covariance measurements

Eddy covariance data from regional flux networks are direct in situ measurement of carbon, water, and energy fluxes and are of vital importance for understanding the spatio-temporal dynamics of the the global carbon cycle. FLUXNET links regional networks of eddy covariance sites across the globe to quantify the spatial and temporal variability of fluxes at regional to global scales and to detect emergent ecosystem properties. This study presents an assessment of the representativeness of FLUXNET based on the recently released FLUXNET2015 data set. We present a detailed high resolution analysis of the evolv5 ing representativeness of FLUXNET through time. Results provide quantitative insights into the extent that various biomes are sampled by the network of networks, the role of the spatial distribution of the sites on the network scale representativeness at any given time, and how that representativeness has changed through time due to changing operational status and data availability at sites in the network. To realize the full potential of FLUXNET observations for understanding emergent ecosystem properties at regional and global scales, we present an approach for upscaling eddy covariance measurements. Informed by the 10 representativeness of observations at the flux sites in the network, the upscaled data reflects the spatio-temporal dynamics of the carbon cycle captured by the in situ measurements. This study presents a method for optimal use of the rich point measurements from FLUXNET to derive an understanding of upscaled carbon fluxes, which can be routinely updated as new data become available, and direct network expansion by identifying regions poorly sampled by the current network.


Introduction
Terrestrial ecosystems are a critical component of the global carbon cycle regulating the climate through a range of biogeophysical and biogeochemical mechanisms.Terrestrial ecosystems exchange about 120 Pg of carbon per year with the atmosphere, through the processes of photosynthesis and respiration (Schlesinger and Bernhardt, 2013).Eddy covariance-based methods are widely used around the world to measure the gas exchange between vegetation and the atmosphere.The eddy covariance method assumes the measurement site is located in flat terrain, experiences steady or stable atmospheric conditions, and is surrounded by uniform vegetation for an extended distance in the upwind direction (Baldocchi, 2003).In practice, however, sites are often located in non-ideal and spatial heterogeneous regions and are thus may be prone to varying degrees of measurement 1 Earth Syst.Sci. Data Discuss., doi:10.5194/essd-2016-36, 2016(Baldocchi et al., 2001).FLUXNET consists of more than ∼750 sites (http://fluxnet.fluxdata.org/sites/historical-site-status/)across the globe that are independently operated by regional networks (http://fluxnet.fluxdata.org/about/regional-networks/).However, the locations of the sites in the network were not formally designed to uniformly and consistently observe global biomes and thus represent a sparse and spatially biased sampling of the global terrestrial ecosystem.Careful synthesis of spatiotemporally sparse flux observations is required to understand the regional to global scale carbon cycle.Upscaling of fluxes to regional to global scales requires understanding and quantification of the spatial representativeness of the global network of flux sites.
A number of studies have analyzed the representativeness of regional networks of flux sites.Hargrove et al. (2003) analyzed the representativeness of the AmeriFlux network using an ecoregion-based approach and found that while most of the continental United States was well represented by the network, the Pacific Northwest and Texas grasslands were poorly represented by the network sites.Yang et al. (2008) conducted a similar independent study of AmeriFlux network repsentativeness using remotely sensed climate and vegetation data products and arrived at conclusions similar to Hargrove et al. (2003).Hargrove and Hoffman (2005) presented a quantitative ecoregion-based approach to assess the representiveness of networks and a method for design of sampling networks.Sulkava et al. (2011) studied the representativeness of the European flux tower network and developed a quantitative network design tool for identifying and prioritizing addition of new flux sites.Chen et al. (2011Chen et al. ( , 2012) ) used remote sensing and footprint analysis to characterize the spatial representativeness of flux sites across the Canadian Carbon Program Network.He et al. (2015) conducted a similar assessment of regional representativeness for the Chinese flux network.They also identified a set of sites that would complement the existing flux network.Most of these assessments of the spatial representativeness of regional flux networks across the globe have been focused on quantifying the coverage by the network and for the design of new sampling sites.However, the quantitative representativeness of network sites has not yet been applied to synthesize and upscale observations beyond the footprints of individual sites to provide landscape-to global-scale estimates of flux measurements.
We present here an analysis of the quantitative represenatitveness of the FLUXNET global network of sites that contributed to the FLUXNET2015 data set.While a large number of global flux sites are affiliated with the FLUXNET network, only a subset of site data are typically available in data collections.Thus, only the sites with data available in the most recent July 12, 2016, release of FLUXNET2015 were included in our study.Independently operated sites in the FLUXNET network often vary significantly in their period of active operation or data availability for logistical reasons.Thus in this study, in addition to quantifying the spatial representativeness of the network, we quantify the temporally variable representativeness of data availability within the FLUXNET2015 collection.Finally, we present an approach for global upscaling of eddy covariance fluxes from FLUXNET network informed by the temporally varying spatial representativeness of the flux sites. 2 Data and Methods

Eddy covariance measurements
FLUXNET provides uniform and high quality data products through coordination among various regional flux networks across the globe (http://fluxnet.fluxdata.org/data/fluxnet2015-dataset/data-processing/).We used the latest release of the FLUXNET2015 eddy covariance data set collection in this study.The time series products in the FLUXNET2015 collection were gap filled (Reichstein et al., 2005) and went through a rigorous QA/QC procedure (Pastorello et al., 2014).The model efficiency method reference GPP product (GPP_DT_CUT_REF) calculated using day time data (Lasslop et al., 2010) was used to create an upscaled data product in this study.
While more than ∼750 sites are registered in the FLUXNET database (http://fluxnet.fluxdata.org/sites/historical-site-status/),indicated by red triangles in Figure 1(a), data from only 165 global sites were available in the FLUXNET2015 data set released July 12, 2016, and shown by filled blue circles in Figure 1(b).This study was limited to data contained in the FLUXNET2015 data set.Environmental data sets used in our study were not available at the permanent wetland site at Adventdalen, Norway (NO-Adv) and thus was excluded from the analysis.Thus, the GPP observations from 164 sites globally (Figure 1(a)) spanning the period 1991-2014 were used.FLUXNET sites are maintained and operated by independent groups and the period of active operation and data availability varies widely from site to site, ranging from a single year to up to 24 years (Figure 1(a)).There also exists some variability in variables and data fields available from any site.Figure 1(b) also shows the number of sites where the GPP_DT_CUT_REF variable used in this study was available.

Environmental variables
Ecosystem structure and functioning are governed by multiple independent control variables including climate, parent material, topography, potential biota and time (Chapin et al., 2012).A set of environmental variables to capture climatic, edaphic and topographic characteristics were selected to characterize the terrestrial ecosystem in this study (Table 1).that are biologically meaningful, representing annual trends, seasonality and extreme or limiting environmental factors.We selected 14 bioclimatic variables (Table 1) to represent the global bioclimatic environment in this study.We also selected four edaphic variables (Global Soil Data Task Group, 2000;Saxon et al., 2005) and one topographic variable (Saxon et al., 2005) Figure 2. Global ecoregions delineated using 20 bioclimatic, edaphic and topographic characteristics (Table 1).This randomly colored map shows 10 distinct ecoregions identified by the k-means clustering algorithm.

Ecoregionalization
FLUXNET classifies the sites based on vegetation type they represent using IGBP classification scheme (IGBP, 1990(IGBP, , 1992) ) among cropland, closed shrubland, deciduous broadleaf forest, evergreen broadleaf forest, evergreen needleleaf forest, grassland, mixed forest, open shrubland, savanna, wetlands and woody savanna.While the IGBP classifications are primarily based on vegetation, a wider range of environmental variables (bioclimatic, edaphic and topographic) were employed in our study.
The multivariate clustering technique has been widely used in ecology and Earth sciences for delineation of ecoregions that are relatively homogeneous with respect to selected environmental characteristics (Hargrove and Hoffman, 2005;Bernert et al., 1997;Lark, 1998;Hessburg et al., 2000).A massively parallel k-means clustering analysis tool developed by Kumar et al. (2011) was used to derive custom ecoregions using a selected set of environmental variables (Table 1).
Figure 2 shows the global land area classified into 10 distinct ecoregions at ∼ 4 km resolution, when defined quantitatively by the k-means clustering algorithm, using 20 bioclimatic, edaphic and topographic characteristics (Table 1).Quantitatively defined ecoregions correspond well with the major global biomes and serves as a reference landcover classification for this study.

Representativeness of flux sites
Representativeness of a measurement site is the extent to which the measurements collected at any given location and time The second was a point-based representativeness approach in which the Euclidean distance of each map pixel containing a site from every other pixel on the map was calculated in the standardized n-dimensional data space.When a network of sites is present, the Euclidean distance for the site closest to the pixel in data space was selected as the representativeness of that pixel.
In this study, we used the point-based representativeness approach (Hoffman et al., 2013) to compute the dissimilarity of environmental conditions between any flux site location and every land pixel on the globe at ∼4 km resolution in 20-dimensional data space (Table 1).At each flux site the environmental characteristics (Table 1) were extracted from the gridded data sets and the Euclidean distance to every land pixel on the globe was computed as resulting in a map showing dissimilarity in environmental conditions sampled by the flux site location to that of other locations on the globe.For example, a representativeness map for the Willow Creek, U.S. site (US-WCr) (Figure 3) located in an upland Deciduous Broadleaf Forest (DBF) shows that the site is representative of temperate environment in central United States and Europe.However, this site does not capture well the dry environment of the southwestern U.S., coastal regions of the southern U.S., evergreen forests in the northwestern U.S. Tropical and high latitude regions of the globe experience significantly different environmental conditions and thus show high dissimilarities when compared to the US-WCr site.Similarly, the Evergreen Broadleaf Forest (EBF) site at Santarum km83, Brazil (BR-Sa3) (Figure 3) represents tropical environments well, but shows high dissimilarities with temperate and high latitude regions.
Environmental conditions at any location on the map may be characteristic of one or many or none of the sites in the sampling network.To understand the best representative from the network as a whole, every pixel was assigned the representativeness corresponding to the site closest to it in the multivariate environmental data space.

Upscaling point measurements
Quantitative representativeness of network sites can be utilized to upscale flux estimates to larger spatial and temporal scales for input to or evaluation of process modeling or for estimating landscape-scale characteristics (Hoffman et al., 2013).Upscaling eddy covariance constrained estimates of gross primary production (GPP) from a global network of flux sites requires 1) identifying all sites that sample environments similar to any location, and 2) developing a statistical model to estimate GPP at all global land regions using estimates from point measurements at flux sites.

Inverse distance weighted interpolation
Interpolation is a process of using measurements about a process at a limited number of point locations to make estimates about the process at other, unmeasured locations.Inverse distance weighting (IDW) is a deterministic, nonlinear interpolation technique that uses a weighted average of the measurements from nearby sample points to estimate the magnitude of variables at non-sampled locations.IDW interpolation is based on Tobler's first law of geography (Tobler, 1970) which states "everything is related to everything else, but near things are more related than distant things".The weight of a measurement at a particular point is assigned in the averaging calculation depending upon the sampled point's distance to the non-sampled location.While traditional spatial interpolation approaches often employ the distance in geographic space between sampled points and the un-sampled location, in this study the distance in multivariate environmental data space (representativeness of the sampling location) was used to calculate the weights.Thus, samples from locations closer in terms of environmental conditions were assigned higher weights in interpolation while information about geographical proximity was never used in the process.
An important aspect of the inverse distance weighting interpolation is the neighborhood of influence that determines the sampling points that are used for interpolating the value at any un-sampled location.In this study we systematically defined the neighborhood of influence (described in Section 2.5.2) and thus identified the flux measurements to include in the interpolation at each spatial location and at each monthly time step during the study period.
High temporal resolution measurements at eddy covariance sites capture the temporal dynamics and variability (seasonal and interannual), allowing us to examine the influences of phenology, drought, heat waves, El Niño, length of growing season, presence/absence of snow on canopy-scale fluxes, etc. (Baldocchi et al., 2001).Statistical models built using all the data, while representing the long term trends and mean climatology well, also tend to lose the fine temporal dynamics captured by measurements.To preserve the rich temporal dynamics in the flux measurements, we built the interpolation at monthly time steps using the data only from that month.This approach, distinct from many past studies, helps capture the temporal variability in the upscaled data set; however, it is also prone to exhibit a bias if a localized phenomena is experienced by a flux site.

Identifying flux sites in similar environments
While global terrestrial ecosystems are highly diverse and heterogeneous, point measurements at flux sites sample the environment in which they are located.These measurements are representative of similar conditions at geographic locations elsewhere.
The network of global flux sites which are managed and operated by independent regional institutions provide a sparse and 10 Earth Syst.Sci. Data Discuss., doi:10.5194/essd-2016-36, 2016 Open Access  non-uniform sampling of the environment.While some regions and environments may be well sampled, others may be undersampled.Thus it is important to identify the flux sites in similar environments where GPP estimates were available for upscaling.We overlaid the locations of sites in FLUXNET over the quantitatively defined ecoregions (Figure 2) and identified the sites in each ecoregion.Figure 6 shows the distribution of 164 FLUXNET sites in the FLUXNET2015 collection globally across various ecoregions.To estimate the GPP at any location, the sites located within the same ecoregion would provide the most relevant and representative measurements.Ecoregions in the mid-latitudes of North America and Europe are well sampled by the global network of flux sites, while sites in many of the other ecoregions are few to none. Figure 4 shows the best possible representativeness while using all 164 sites in the FLUXNET2015 collection.The actual availability of flux site measurements is spatio-temporally variable.
Table 2 shows the number of sites within each ecoregion each year during the period 1991-2014 for which data were available in the FLUXNET2015 collection.We carefully processed and identified the sites at which GPP estimates were available within each ecoregion monthly for use in the upscaled product.However, there were some ecoregions and time periods that were completely unsampled by the available sites.In such cases we identified the ecoregion most similar in multivariate data space (Table 1) that was sampled by at least one or more flux sites and used those data for extrapolation.The accuracy of the GPP estimates in such cases is expected to suffer due to data limitation.

Results and discussions
We used monthly time series of GPP estimates from eddy covariance observations taken at flux towers from FLUXNET2015 to develop upscaled gridded GPP data sets at ∼ 4 km resolution.We compared the upscaled GPP time series data set to 11 Earth Syst.Sci. Data Discuss., doi:10.5194/essd-2016-36, 2016 Open Access    FLUXNET-MTE (Jung et al., 2009).While FLUXNET-MTE GPP was also developed using observations from FLUXNET sites, a different set of sites and data products were used in the current study.The overlapping period of 2000-2008 was used for comparison with FLUXNET-MTE.

Spatial distribution of GPP
Figure 7 shows the global spatial distribution of time integrated mean annual GPP for four years in the comparison period.
The quality and accuracy of the upscaled GPP time series data is strongly affected by the availability and spatial distribution of the flux sites at any given time.Figure 8 shows the comparison of our upscaled GPP estimates with FLUXNET-MTE GPP estimates (Jung et al., 2009).Agreement was fairly good for Northern Hemisphere mid-latitudes regions that were well sampled by FLUXNET2015 sites.However, the data show significant differences in Southern Hemisphere tropical regions, which lack sufficient flux measurements.For example, FLUXNET-MTE employs the data for the flux sites in Matto Grosso (BR-Mtg) in South America, which was not available as part of FLUXNET2015 data set, leading to an underestimate of GPP in that region compared to FLUXNET-MTE.FLUXNET-MTE also suffers from limitations of data in this region, indicated by their high index of extrapolation in tropical areas (Jung et al., 2009).
Extra-tropical regions in the Southern Hemisphere were poorly sampled by the FLUXNET2015 data with only a few sites making data available in that region.In addition, based on the environmental similarity sites in the Northern Hemisphere were used to estimate the fluxes in the region, which have an out-of-phase phenology, leading to an overestimation of GPP magnitude in Northern Hemisphere winter.
Higher estimates of GPP compared to FLUXNET-MTE in our upscaled data set in coastal regions of the Pacific Northwest of North America provide an improved representation of the wet evergreen forests.Coastal regions of the Pacific Northwest possess of high orographic relief and strong rain shadows.FLUXNET-MTE does not capture these highly productive forests, probably due to lower resolution averaging over wet and dry forests, but our approach at a high resolution provides a better estimate of GPP in this orographically complex terrain.

Seasonal patterns of GPP
Figure 9 shows the zonal mean seasonal patterns of monthly GPP estimated from FLUXNET2015 for years 2000, 2006, 2011, and 2014.In all cases, June and July exhibit the large GPP in the Northern Hemisphere extra-tropics.Weak seasonality, interannual variability, and regional under-sampling results in non-persistent ordering of monthly GPP in the tropics.As mentioned above, Southern Hemisphere high latitude regional estimates are strongly influenced by measurements in Northern Hemisphere mid-latitudes because of the similarity in their environmental conditions and the lack of FLUXNET2015 measurement sites in the Southern Hemisphere.Consequently, anomalously high GPP is estimated during June and July in −30 • to −50 • latitude.
The varying list of measurement sites each year also affects upscaled GPP estimates.For example, many fewer sites have made data from year 2014 available in FLUXNET2015, so Figure 9(c) exhibits unusually low tropical GPP. Figure 10(a) shows significant spread in our annual upscaled GPP because of both interannual variability and as a consequence of the continuously evolving list of contributing measurement sites.However, it is likely that interannual variability is strong enough to result in great variability in GPP than is exhibited by the FLUXNET-MTE estimates in Figure 10(b).

Temporal variability of GPP
Eddy covariance observations at flux sites capture changes in biosphere-atmosphere exchange at a high temporal resolution.
Temporal variability is due to long term changes in environmental conditions, vegetation dynamic, disturbance events, etc.To preserve the temporal variability in GPP captured by the instruments in the upscaled product, only the observations from the same month were used in our IDW-based interpolation method.
Thus the pattern of time integrated annual GPP over the years show significant interannual variability for all latitudinal zones (Figure 10 The FLUXNET-MTE model was trained using data across all years, and thus possibly was less susceptible to the strong interannual variability observed at the flux sites and rather is closer to a climatological mean pattern. Due to variability in the number and spatial distribution of flux sites for which data were available in FLUXNET2015, the upscaled GPP product also exhibits year to year variability in accuracy.Figure 12 and     This work highlights the regions that are poorly understood by the existing (and available) networks, like Southern Hemisphere in general and carbon rich tropical forest ecosystems in particular.
This approach can easily be applied to flux data as they become available in order to routinely derive upscaled carbon fluxes.
While the current study was focused on GPP fluxes, the same method can be applied to other carbon, water and energy fluxes.

Figure 1 .
Figure 1.Eddy covariance sites in the FLUXNET2015 collection.
Figure 1(b) shows the number of active sites for which data was available in the FLUXNET2015 collection in any year during the period 1991-2014.Due to extremely limited data availability before year 2000, we limited our study to the period 2000-2014.
represent the conditions at any other location and time.Eddy covariance flux towers measure a range of meteorological and flux variables.While the measurements captured by the sensors are direct measures of a small region in the footprint of the 5 Earth Syst.Sci.Data Discuss., doi:10.5194/essd-2016-36forjournal Earth Syst.Sci.Data Published: 23 August 2016 c Author(s) 2016.CC-BY 3.0 License.tower, understanding the representativeness of the measurements for a broader landscape is critical for upscaling of point measurements to the larger area.There has been a number of attempts to assess the representativeness of flux tower networks.Hargrove et al. (2003) analyzed the representativeness of the AmeriFlux network using an ecoregion based approach.For each ecoregion in the map, they calculated the Euclidean distance in their data space to the single closest ecoregion that contains a site from the network.Yang et al. (2008) studied the representativeness of the AmeriFlux network using MODIS and GOES data and used Euclidean distance to quantify the similarity between any location and the collection of AmeriFlux sites.Using a combination of footprint modeling and moderate-resolution remote sensing, Chen et al. (2011) calculated the representativeness with a sensor location bias metric.Sulkava et al. (2011) developed a tool to evaluate the representativeness of the European eddy covariance network.They calculated representativeness as the Euclidean distance measured in data space between the cluster centers and the established stations.He et al. (2015) studied the ChinaFlux sites and calculated the represenativeness of flux tower sites as the Euclidean distance in a 4-dimensional environmental space.They calculated the Euclidean distance from each pixel to each tower and selected the minimum distance as their network representativeness.Hoffman et al. (2013) utilized climate and permafrost characteristics and derived a Euclidean-based representativeness for a sampling design in the Arctic environment.They developed two approaches for calculating representativeness for sampling network design.First, the ecoregions-based representativeness, where the dissimilarity of any ecoregion containing a sampling site to any other ecoregion, was calculated as the Euclidean distance between the two ecoregion centroids within the standardized n-dimensional data space.
review for journal Earth Syst.Sci.Data Published: 23 August 2016 c Author(s) 2016.CC-BY 3.0 License.

Figure 6 .
Figure 6.Spatial distribution of FLUXNET sites across global ecoregions.
review for journal Earth Syst.Sci.Data Published: 23 August 2016 c Author(s) 2016.CC-BY 3.0 License.
review for journal Earth Syst.Sci.Data Published: 23 August 2016 c Author(s) 2016.CC-BY 3.0 License.
flux sites capture the dynamic terrestrial biosphere-atmosphere gas exchange and provides us insights into terrestrial ecosystem processes.In this study we have presented a method to assess the representativeness of the global network of flux sites, specifically those included in the FLUXNET2015 data set.Due to independent operation of member flux sites in the FLUXNET network, a wider range of variation exists in duration of operation, measurements being conducted and availability of data sets for open research.Thus unlike previous studies, we conducted our assessment to quantify the representativeness of the network through time.The analysis provides a window into the evolution of the global eddy covariance measurements over time, quantify the contribution of any individual sites, identify the regions that are better represented by addition of sites over time, and ones that continue to be under-represented.Upscaling of measured carbon fluxes is key for exploiting the full potential of rich FLUXNET2015 observations beyond 10 the tower footprints to landscape scale understanding of ecosystem processes.We also developed a representativeness-based 18 Earth Syst.Sci.Data Discuss., doi:10.5194/essd-2016-36forjournal Earth Syst.Sci.Data Published: 23 August 2016 c Author(s) 2016.CC-BY 3.0 License.upscalingapproach to develop monthly time series high resolution global gridded estimates of GPP.Our upscaling method utilized the observations only from flux sites that sampled similar environmental conditions.Spatio-temporal variability and accuracy of the upscaled GPP data set strongly reflects the spatio-temporal variability in distribution of flux observations.Thus, the data set has higher accuracy and low uncertainty in the ecoregions that are well sampled by the available flux sites, and lower accuracy and higher uncertainty in the ecoregions where available observations are sparse to none.Study was an attempt to understand the global carbon cycle (specifically GPP) through the lens of eddy covariance measurements across the globe.
While a larger number of sites are part of the full global FLUXNET network, full potential of the network to understand the global carbon cycle can be realized only when data from all the sites are made available for research and analysis.Our future efforts will focus on integration of data sets from other flux and forest inventory networks, like the Global Ecosystem Monitoring (GEM) and RAINFOR networks, especially to improve our understanding of biodiversity and ecosystem function in carbon rich tropical forests.Earth Syst.Sci.Data Discuss., doi:10.5194/essd-2016-36forjournal Earth Syst.Sci.Data Published: 23 August 2016 c Author(s) 2016.CC-BY 3.0 License.

Table 1 .
Environmental variables used for ecoregion delineation, representativeness analysis and upscaling.These data are in the form of ∼4 km raster grids.

Table 2 .
Active flux sites present in each ecoregion during 1991-2014.Data from the period 1991-1999 was not used in the study.Some ecoregions had no flux sites available during the period 2000-2014.

Table 3 .
Comparison statistics for upscaled GPP data set compared to FLUXNET-MTE