tensorflow 数据验证 tfdv 在 Google Cloud 数据流上失败,并显示“无法获取属性‘NumExamplesStatsGenerator’”

问题描述 投票:0回答:1

我正在关注this“入门”张量流教程,了解如何在谷歌云数据流上的apache beam上运行tfdv。我的代码与教程中的代码非常相似:

import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions

PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'

# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'


# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'

setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]

# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2

print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")

上面的代码启动了一个数据流作业,但不幸的是它失败并出现以下错误:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>

我在互联网上找不到类似的东西。 如果它可以帮助,在我的本地计算机(MACOS)上我有以下版本:

Apache Beam version: 2.34.0
Tensorflow version: 2.6.2
TensorFlow Transform version: 1.4.0
TFDV version: 1.4.0

云上的 Apache Beam 运行于

Apache Beam Python 3.8 SDK 2.34.0

额外问题:我的另一个问题是围绕

PATH_TO_WHL_FILE
。我试图把它放在一个储物桶上,但光束似乎无法把它捡起来。仅限本地,这实际上是一个问题,因为这会使分发此代码变得更加困难。分发这个轮子文件的好习惯是什么?

tensorflow google-cloud-platform google-cloud-dataflow apache-beam tensorflow-data-validation
1个回答
0
投票

根据属性名称

NumExamplesStatsGenerator
,它是一个不可 pickle 的生成器。

但是我现在无法从模块中找到该属性。 搜索表明在1.4.0这个模块包含这个属性。 所以您可能想尝试更新版本的 TFDV。

PATH_TO_WHL_FILE 表示要暂存/分发到 Dataflow 执行的文件,因此您可以在 GCS 上使用文件。

© www.soinside.com 2019 - 2024. All rights reserved.