如何在嵌套数组属性上连接两个数据框

问题描述 投票:0回答:1

我有两个通过摄取 json 数据和以下模式创建的数据框


provider:

{
    npi: "..."
    name: "..."
    location: {
        address: "...",
        insurances: ["...", "..."],
        ...
    },
    ...
}

insurance:

{
    id: ...,
    ...
}

我想将provider df加入到insurance df上,其中provider.location.insurances包含insurance.id,并将匹配的保险添加为新的数组字段insurances。这可能吗?

因此生成的数据结构将是这样的:

{
    npi: "..."
    name: "..."
    location: {
        address: "...",
        insurances: [123, ...],
        ...
    },
    insurances: [{id: 123, ...}, ...]
}
apache-spark join pyspark nested
1个回答
0
投票

我可以使用下面的代码来完成此操作,但是如果您看到更内存优化的方法来完成此操作,请告诉我,谢谢。

insuranceDF = insuranceDF.withColumn("insurance", F.struct("uuid", "carrier_name"))

joinDF = providerDF.join(insuranceDF, F.expr("array_contains(location.insurances, uuid)")).groupBy("npi", "location").agg(F.collect_list("insurance"))
© www.soinside.com 2019 - 2024. All rights reserved.