我写了一个hadoop流作业,它使用python代码来转换数据。但是作业发生了一些错误。当输入文件较大(例如70M字节)时,它会在reduce阶段挂起。当我减少输入文件时变成更小的(例如700kb),它运行成功。下面是一些日志:
Reducer container logs:
2024-08-23 10:08:26,080 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : finalMerge called with 11 in-memory map-outputs and 0 on-disk map-outputs
2024-08-23 10:08:26,086 INFO [main] org.apache.hadoop.mapred.Merger : Merging 11 sorted segments
2024-08-23 10:08:26,087 INFO [main] org.apache.hadoop.mapred.Merger : Down to the last merge-pass, with 10 segments left of total size : 79232287 bytes
2024-08-23 10:08:26,466 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merged 11 segments, 79233409 bytes to disk to satisfy reduce memory limit
2024-08-23 10:08:26,469 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merging 1 files, 13421702 bytes from disk
2024-08-23 10:08:26,472 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl : Merging 0 segments, 0 bytes from memory into reduce
2024-08-23 10:08:26,472 INFO [main] org.apache.hadoop.mapred.Merger : Merging 1 sorted segments
2024-08-23 10:08:26,480 INFO [main] org.apache.hadoop.mapred.Merger : Down to the last merge-pass, with 1 segments left of total size : 79233279 bytes
application master logs:
24/08/05 10:08:51 INFO mapreduce.Job: map 100% reduce 100%
24/08/05 10:29:19 INFO mapreduce.Job: Task Id attempt_XXXXXX, Status: FAILED
AttemptID : attempt_XXXXXX Timed out after 1200 secs
我也查了一下柜台:
Map input records: 703,640 (This is correct)
Map output records: 685,583
Reduce input records : 685,583 (not correct)
Custom Counter From Code-ReduceInputRecords : 685,489 (this is counted in code)
表明Reduce端接收到的记录的真实值为685,489,而不是hadoop统计的计数器值685,583。该代码应挂在 sys.stdin 代码行中。 有谁知道为什么吗?
我发现关键原因是reduce任务太复杂,执行时间太长。
我为我的减速器任务添加了一些状态和计数器报告。 所以我发现关键原因是reduce中有一个for循环,复杂度为N*N,所以它在我的reducer任务上花费了这么多时间。