我正在尝试在实体集中添加考拉数据框。这是它的代码
subset_kdf_fp_eta_gt_prd.spark.print_schema()
root
|-- booking_code: string (nullable = true)
|-- order_id: string (nullable = true)
|-- restaurant_id: string (nullable = true)
|-- country_id: long (nullable = true)
|-- inferred_prep_time: long (nullable = true)
|-- inferred_wait_time: long (nullable = true)
|-- is_integrated_model: integer (nullable = true)
|-- sub_total: double (nullable = true)
|-- total_quantity: integer (nullable = true)
|-- dish_name: string (nullable = true)
|-- sub_total_in_sgd: double (nullable = true)
|-- city_id: long (nullable = true)
|-- hour: integer (nullable = true)
|-- weekday: integer (nullable = true)
|-- request_time_epoch_utc: timestamp (nullable = true)
|-- year: string (nullable = true)
|-- month: string (nullable = true)
|-- day: string (nullable = true)
|-- is_takeaway: string (nullable = false)
|-- is_scheduled: string (nullable = false)
es = ft.EntitySet(id="koalas_es")
from woodwork.logical_types import Categorical, Double, Integer, NaturalLanguage, Datetime, Boolean
es.add_dataframe(dataframe_name="fp_eta_gt_prd",
dataframe=subset_kdf_fp_eta_gt_prd,
index="order_id",
time_index="request_time_epoch_utc",
already_sorted="false",
logical_types={
"booking_code": Categorical,
"order_id": Categorical,
"restaurant_id": Categorical,
"country_id": Double,
"inferred_prep_time": Double,
"inferred_wait_time": Double,
"is_integrated_model": Categorical,
"sub_total": Double,
"total_quantity": Integer,
"dish_name": NaturalLanguage,
"sub_total_in_sgd": Double,
"city_id": Categorical,
"hour": Categorical,
"weekday": Categorical,
"request_time_epoch_utc": Datetime,
"year": Categorical,
"month": Categorical,
"day": Categorical,
"is_takeaway": Categorical,
"is_scheduled": Categorical,
})
运行此程序时,我遇到错误当前索引名称必须完全匹配。我已经仔细检查了所有字段名称、索引唯一性等。不确定这里的错误原因是什么。
我在尝试向 pyspark.sql.dataframe.DataFrame 添加列时遇到了类似的情况:
df['new_column'] = df.pandas_api().apply(somefunc)
这产生了 ValueError:索引名称当前必须完全匹配。
为了诊断,我查看了原始数据帧上的索引以及应用返回的结果:
result = df.pandas_api().apply(somefunc)
print(df.index)
print(result.index)
输出是:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object', name='my_index')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
注意结果索引上的name完全不存在。 ValueError 需要按字面解释 - 每个索引上的名称必须完全匹配。
解决问题的代码:
result = df.pandas_api().apply(somefunc)
result.index.name = 'my_index'
df['new_column'] = result
没有值错误!