如何将 SparkDFDataset 添加到我的远大前程验证器中?

问题描述 投票:0回答:1

感谢您对以下问题的建议。

我正在测试远大前程是否可以在我的蜂巢表上使用。理想情况下,我想打开一个 html 文件,在用户友好的页面中显示我的期望。

我并不是在初始化远大前程。所以基本上我会输入 pyspark3 并运行以下命令,然后我最终收到一条消息,表明我的 Spark 数据帧数据集没有持久属性:

>>> import great_expectations as ge
>>> sk = SparkSession.builder.appName("GE_TEST").getOrCreate()
>>> sk.sql("use DB1")
>>> hive_table=sk.sql("SELECT * FROM TABLEX")
>>> df_ge = ge.dataset.SparkDFDataset(hive_table)
>>> context = ge.get_context()
>>> datasource = context.sources.add_spark("my_spark_datasource")
>>> name = "my_df_asset"
>>> data_asset = datasource.add_dataframe_asset(name=name)
>>> my_batch_request = data_asset.build_batch_request(dataframe=df_ge)
>>> expectation_suite_name = "test"
>>> context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)
>>> validator = context.get_validator(batch_request=my_batch_request,expectation_suite_name=expectation_suite_name,)
23/11/22 17:58:26 WARN  sql.SparkSession: [Thread-3]: Using an existing Spark session; only runtime SQL configurations will take effect.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/GX/testenv/lib64/python3.8/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 2388, in get_validator
    return self.get_validator_using_batch_list(
  File "/GX/testenv/lib64/python3.8/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 2444, in get_validator_using_batch_list
    validator = Validator(
  File "/GX/testenv/lib64/python3.8/site-packages/great_expectations/validator/validator.py", line 211, in __init__
    self.load_batch_list(batch_list=batches)
  File "/GX/testenv/lib64/python3.8/site-packages/great_expectations/validator/validator.py", line 322, in load_batch_list
    self._execution_engine.batch_manager.load_batch_list(batch_list=batch_list)
  File "/GX/testenv/lib64/python3.8/site-packages/great_expectations/core/batch_manager.py", line 156, in load_batch_list
    self._execution_engine.load_batch_data(
  File "/GX/testenv/lib64/python3.8/site-packages/great_expectations/execution_engine/sparkdf_execution_engine.py", line 248, in load_batch_data
    batch_data.dataframe.persist()
AttributeError: 'SparkDFDataset' object has no attribute 'persist'

有人知道如何解决这个问题吗?或者有其他选择可以实现我的需要。 toPandas() 似乎不起作用,因为我不断从我的 hive_table 收到 java 堆错误,该表至少有 400 万行和许多列。

谢谢你们的一切!

pyspark apache-spark-sql hive great-expectations apache-spark-3.0
1个回答
0
投票

您收到此错误是因为您在构建批处理请求时使用 Great Expectation 的数据帧,而它需要 Spark 数据帧作为持久方法。

试试这个: my_batch_request = data_asset.build_batch_request(dataframe=df_ge.spark_df)

© www.soinside.com 2019 - 2024. All rights reserved.