重复即使没有重复

问题描述 投票:0回答:1

由于多个连接,我有一个数据框。当我检查时,它告诉我我有一个副本,尽管从我的角度来看这是不可能的。这是一个抽象的例子:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import pyspark.sql.functions as f
from pyspark.sql.functions import lit

# Create a Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()

# User input for number of rows
n_a = 10
n_a_c = 5
n_a_c_d = 3
n_a_c_e = 4

# Define the schema for the DataFrame
schema_a = StructType([StructField("id1", StringType(), True)])
schema_a_b = StructType(
    [
        StructField("id1", StringType(), True),
        StructField("id2", StringType(), True),
        StructField("extra", StringType(), True),
    ]
)
schema_a_c = StructType(
    [
        StructField("id1", StringType(), True),
        StructField("id3", StringType(), True),
    ]
)
schema_a_c_d = StructType(
    [
        StructField("id3", StringType(), True),
        StructField("id4", StringType(), True),
    ]
)
schema_a_c_e = StructType(
    [
        StructField("id3", StringType(), True),
        StructField("id5", StringType(), True),
    ]
)

# Create a list of rows with increasing integer values for "id1" and a constant value of "1" for "id2"
rows_a = [(str(i),) for i in range(1, n_a + 1)]
rows_a_integers = [str(i) for i in range(1, n_a + 1)]
rows_a_b = [(str(i), str(1), "A") for i in range(1, n_a + 1)]


def get_2d_list(ids_part_1: list, n_new_ids: int):
    rows = [
        [
            (str(i), str(i) + "_" + str(j))
            for i in ids_part_1
            for j in range(1, n_new_ids + 1)
        ]
    ]
    return [item for sublist in rows for item in sublist]


rows_a_c = get_2d_list(ids_part_1=rows_a_integers, n_new_ids=n_a_c)
rows_a_c_d = get_2d_list(ids_part_1=[i[1] for i in rows_a_c], n_new_ids=n_a_c_d)
rows_a_c_e = get_2d_list(ids_part_1=[i[1] for i in rows_a_c], n_new_ids=n_a_c_e)

# Create the DataFrame
df_a = spark.createDataFrame(rows_a, schema_a)
df_a_b = spark.createDataFrame(rows_a_b, schema_a_b)
df_a_c = spark.createDataFrame(rows_a_c, schema_a_c)
df_a_c_d = spark.createDataFrame(rows_a_c_d, schema_a_c_d)
df_a_c_e = spark.createDataFrame(rows_a_c_e, schema_a_c_e)

# Join everything
df_join = (
    df_a.join(df_a_b, on="id1")
    .join(df_a_c, on="id1")
    .join(df_a_c_d, on="id3")
    .join(df_a_c_e, on="id3")
)

# Nested structure
# show
df_nested = df_join.withColumn("id3", f.struct(f.col("id3"))).orderBy("id3")

for i, index in enumerate([(5, 3), (4, 3), (3, None)]):
    remaining_columns = list(set(df_nested.columns).difference(set([f"id{index[0]}"])))
    df_nested = (
        df_nested.groupby(*remaining_columns)
        .agg(f.collect_list(f.col(f"id{index[0]}")).alias(f"id{index[0]}_tmp"))
        .drop(f"id{index[0]}")
        .withColumnRenamed(
            f"id{index[0]}_tmp",
            f"id{index[0]}",
        )
    )

    if index[1]:
        df_nested = df_nested.withColumn(
            f"id{index[1]}",
            f.struct(
                f.col(f"id{index[1]}.*"),
                f.col(f"id{index[0]}"),
            ).alias(f"id{index[1]}"),
        ).drop(f"id{index[0]}")

我根据

id3
检查重复项,这在第二级的整个数据框中应该是唯一的:

# Investigate for duplicates
df_test = df_nested.select("id2", "extra", f.explode(f.col("id3")["id3"]).alias("id3"))
df_test.groupby("id3").count().filter(f.col("count") > 1).show()

告诉我

ID3 == 8_3
存在两次:

+---+-----+
|id3|count|
+---+-----+
|8_3|    2|
+---+-----+

然而,在数据框中显然对于 ID3 是唯一的。可以显示(ID4和ID5在下一层)

df_join.groupby("id3", "id4", "id5").count().filter(f.col("count") > 1).show()

导致

+---+---+---+-----+
|id3|id4|id5|count|
+---+---+---+-----+
+---+---+---+-----+

如果有帮助,我使用 Databricks Runtime Version 11.3 LTS(包括 Apache Spark 3.3.0、Scala 2.12)

python apache-spark pyspark apache-spark-sql databricks
1个回答
0
投票

您提供的用于检查重复项的代码基于

id3
列。但是,在检查重复项时必须考虑使记录唯一的所有列。在您的情况下,您应该根据
id3
id4
id5
的组合检查重复项。

这是检查重复项的更新代码:


df_test = df_nested.select(
    "id2",
    "extra",
    f.explode(f.col("id3")["id3"]).alias("id3"),
    f.explode(f.col("id3")["id4"]).alias("id4"),
    f.explode(f.col("id3")["id5"]).alias("id5"),
)

df_test.groupby("id3", "id4", "id5").count().filter(f.col("count") > 1).show()

这应该会为您提供正确的输出,根据 id3、id4 和 id5 的组合显示真正重复的记录。如果没有重复项,您应该会看到一个空结果:

+---+---+---+-----+
|id3|id4|id5|count|
+---+---+---+-----+
+---+---+---+-----+

如果您仍然发现重复项,请在每个连接步骤检查数据以确保原始数据框中没有重复项。

© www.soinside.com 2019 - 2024. All rights reserved.