如何在使用 xarray.open_mfdataset 打开之前预处理数据集以分配坐标？

Question

open_mfdataset

函数的 xarray 文档指出，您可以使用

preprocess

参数在串联之前将函数应用于每个数据集。我拥有的 NetCDF 数据集在逐一打开时没有分配坐标，因此我尝试在与

combine='by_coords'

函数中的

open_mfdataset

连接之前分配它们。

如果打开其中一个数据集，它看起来就是这样的：

path = 'path/to/my/file/file.nc'
ds = xr.open_dataset(path, decode_times=False)
ds

# <xarray.Dataset> Size: 1GB
# Dimensions:       (comid: 2677612, time_mn: 120, time_yr: 10)
# Dimensions without coordinates: comid, time_mn, time_yr
#Data variables:
#    COMID         (comid) int32 11MB ...
#    Time_mn       (time_mn) int32 480B ...
#    Time_yr       (time_yr) int32 40B ...
#    RAPID_mn_cfs  (comid, time_mn) float32 1GB ...
#    RAPID_yr_cfs  (comid, time_yr) float32 107MB ...

要使用

open_mfdataset

，我的代码如下所示。

assignCoordinates

函数按预期工作，但仍然无法打开数据集。

def assignCoordinates(df):
    df = df.assign_coords({
        "comid": df['COMID'], 
        "time_mn": fd.calcDatetimes(df, 'Time_mn', df.sizes['time_mn']), #this just calculates datetimes for the weird time units used in these files, the function works properly
        "time_yr": fd.calcDatetimes(df, 'Time_yr', df.sizes['time_yr'])
    })
    return df

path = "path/to/files/*.nc"
ds = xr.open_mfdataset(path, preprocess=assignCoordinates, combine='by_coords', decode_times=False)

ds

这是我收到的错误：

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

我假设预处理的文件实际上并没有被

open_mfdataset

使用，但是我真的不明白这个论点的意义是什么。我怀疑它不使用预处理的数据集，这一事实进一步证实了这一点：如果是的话，我应该能够删除

decode_times=False

，因为时间现在以一种有意义的方式计算，并且可以在运行后解码

assignCoordinates

函数，但是如果我删除它，我会得到一个关于时间无法解码的错误。

有没有办法做我想要的事情，或者我真的必须单独打开每个数据集吗？

最小可重现示例

复制此代码，并填写导出路径。这将在您指定的目录中创建三个

.nc

文件。

import xarray as xr 
import numpy as np

np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 3, 4)
precipitation = 10 * np.random.rand(2, 3, 4)
lon = [-99.83, -99.32]
lat = [42.25, 42.21]
instruments = ["manufac1", "manufac2", "manufac3"]
time = pd.date_range("2014-09-06", periods=4)
reference_time = pd.Timestamp("2014-09-05")
ds = xr.Dataset(
    data_vars=dict(
        temperature=(["loc", "instrument", "time"], temperature),
        precipitation=(["loc", "instrument", "time"], precipitation),
    ),
    attrs=dict(description="Weather related data."),
)
for i in range(1,4):
    ds.to_netcdf(f'yourdirectory/test{i}.nc') #### EDIT HERE #####

完成上述操作后，运行此代码（请记住将目录更改为保存上面创建的文件的位置）：

def assignCoordinates(df):
    df = df.assign_coords({
        "loc": df['loc'],
        "instrument": df['instrument'],
        "time": df['time']
    })
    return df

ds = xarr.open_mfdataset('yourdirectory/*.nc', preprocess=assignCoordinates, combine='by_coords') #### EDIT HERE #####
ds

Answer 1

您不会在数据集中保存任何时间坐标。你应该写：

ds = xr.Dataset(
    data_vars=dict(
        temperature=(["loc", "instrument", "time"], temperature),
        precipitation=(["loc", "instrument", "time"], precipitation),
    ),
    coords={'time': time},
    attrs=dict(description="Weather related data."),
)

此外，如果您想在

open_mfdataset

中沿此维度连接，则循环中保存的每个数据集的时间必须不同。

如何在使用 xarray.open_mfdataset 打开之前预处理数据集以分配坐标？

问题描述投票：0回答：1

1个回答

最新问题

如何在使用 xarray.open_mfdataset 打开之前预处理数据集以分配坐标？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1