Spark运行日志中两个DAG调度程序时间有什么区别?

问题描述 投票:1回答:1

我运行一个火花工作,它记录了该过程的进展情况。最后,它给出了两种类型的时间,指的是完成时间。这两种类型有什么区别。

是否添加了读写差异或聚合开销?

DAGScheduler:54 - ResultStage 1 (runJob at SparkHadoopWriter.scala:78) finished in 41.988 s
DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 67.610115 s

更长的输出

.
.
.
2019-01-15 21:25:32 INFO  TaskSetManager:54 - Finished task 2974.0 in stage 1.0 (TID 5956) in 898 ms on 172.17.6.100 (executor 8) (2982/2982)
2019-01-15 21:25:32 INFO  TaskSchedulerImpl:54 - Removed TaskSet 1.0, whose tasks have all completed, from pool 
2019-01-15 21:25:32 INFO  DAGScheduler:54 - ResultStage 1 (runJob at SparkHadoopWriter.scala:78) finished in 41.988 s
2019-01-15 21:25:32 INFO  DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 67.610115 s
2019-01-15 21:25:45 INFO  SparkHadoopWriter:54 - Job job_20190115212425_0001 committed.
2019-01-15 21:25:45 INFO  AbstractConnector:318 - Stopped Spark@4d4d8fcf{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-01-15 21:25:45 INFO  SparkUI:54 - Stopped Spark web UI at http://node-100.iris-cluster.uni.lux:4040
2019-01-15 21:25:45 INFO  StandaloneSchedulerBackend:54 - Shutting down all executors
2019-01-15 21:25:45 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-01-15 21:25:45 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-01-15 21:25:45 INFO  MemoryStore:54 - MemoryStore cleared
2019-01-15 21:25:45 INFO  BlockManager:54 - BlockManager stopped
2019-01-15 21:25:45 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-01-15 21:25:45 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-01-15 21:25:45 INFO  SparkContext:54 - Successfully stopped SparkContext
2019-01-15 21:25:45 INFO  ShutdownHookManager:54 - Shutdown hook called

评估此类输出日志的正确方法是什么?

apache-spark directed-acyclic-graphs
1个回答
1
投票

考虑DAG调度程序是否维护散列映射以从其“n”分区收集排序列表。然后,在从最后一个分区接收列表时,结果登台步骤将结束。但是,必须将最后一个分区中的数字列表插入到散列映射中。这将采取:log(total-no-of-elements / no.of partition)次数 - 让它为log(nip),其中nip是分区中元素的数量。此外,读取整个已排序数字列表(写入文件)将需要另外的日志N次。因此,总的来说,我们需要“2 log N”的额外时间。

因此,如果增加分区的数量(即,没有工作节点),从2增加到2 ^ 4,则最终延迟将从例如250个单位变为大约31个单位。

希望这会有所帮助!

© www.soinside.com 2019 - 2024. All rights reserved.