假设我在 c 列上有一个分区的镶木地板文件 p1。我已经在 p1 上创建了数据帧,并在更适合的条件下使用此列 c 与其他一些数据帧连接。 这对执行时间有帮助吗?
df = spark.read.parquet(p1 partitioned on c).createOrReplaceTempView('t1')
df2 = spark.read.parquet(someNonPartitionedFile).createOrReplaceTempView('t2')
spark.sql("select * from t1 join t2 on t1.keycolumn = t2.keycolumn where c = 'somevalue'")
执行顺序是什么?对数据进行分区会减少查询时间吗?
对数据进行分区有助于减少查询时间,因为 Spark 只需要读取相关分区,而不需要扫描整个数据集。 正如您提到的,使用分区列在过滤条件上将分区数据帧与其他一些数据帧数据帧连接起来。这将有助于减少执行时间
我同意@ Steven 你可以使用
result.explain(True)
在 Spark UI 中检查查询计划
result = spark.sql("SELECT * FROM t1 JOIN t2 ON t1.keycolumn = t2.keycolumn WHERE c = '2024-01-01'")
result.explain(True)
结果:
== Optimized Logical Plan ==
Join Inner, (keycolumn#790L = keycolumn#796L)
:- Filter ((isnotnull(c#792) AND isnotnull(keycolumn#790L)) AND (c#792 = 2024-01-01))
: +- Relation [keycolumn#790L,value#791,c#792] parquet
+- Filter isnotnull(keycolumn#796L)
+- Relation [keycolumn#796L,other_value#797] parquet
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- == Initial Plan ==
ColumnarToRow
+- PhotonResultStage
+- PhotonBroadcastHashJoin [keycolumn#790L], [keycolumn#796L], Inner, BuildLeft, false, true
:- PhotonShuffleExchangeSource
: +- PhotonShuffleMapStage
: +- PhotonShuffleExchangeSink SinglePartition
: +- PhotonScan parquet [keycolumn#790L,value#791,c#792] DataFilters: [isnotnull(keycolumn#790L)], DictionaryFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[dbfs:/tmp/partitioned_data], OptionalDataFilters: [], PartitionFilters: [isnotnull(c#792), (c#792 = 2024-01-01)], ReadSchema: struct<keycolumn:bigint,value:string>, RequiredDataFilters: [isnotnull(keycolumn#790L)]
+- PhotonScan parquet [keycolumn#796L,other_value#797] DataFilters: [isnotnull(keycolumn#796L)], DictionaryFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[dbfs:/tmp/non_partitioned_data], OptionalDataFilters: [hashedrelationcontains(keycolumn#796L)], PartitionFilters: [], ReadSchema: struct<keycolumn:bigint,other_value:string>, RequiredDataFilters: [isnotnull(keycolumn#796L)]