我有两个通过摄取 json 数据和以下模式创建的数据框
provider:
{
npi: "..."
name: "..."
location: {
address: "...",
insurances: ["...", "..."],
...
},
...
}
insurance:
{
id: ...,
...
}
我想将provider df加入到insurance df上,其中provider.location.insurances包含insurance.id,并将匹配的保险添加为新的数组字段insurances。这可能吗?
因此生成的数据结构将是这样的:
{
npi: "..."
name: "..."
location: {
address: "...",
insurances: [123, ...],
...
},
insurances: [{id: 123, ...}, ...]
}
我可以使用下面的代码来完成此操作,但是如果您看到更内存优化的方法来完成此操作,请告诉我,谢谢。
insuranceDF = insuranceDF.withColumn("insurance", F.struct("uuid", "carrier_name"))
joinDF = providerDF.join(insuranceDF, F.expr("array_contains(location.insurances, uuid)")).groupBy("npi", "location").agg(F.collect_list("insurance"))