可拆分DoFn导致随机播放键太大问题

问题描述 投票:0回答:1

我正在尝试实现ListFlatten函数,我已经使用SimpleDoFn实现了它,虽然可以正常工作,但可以进行并行化。我将功能转换为可拆分功能。在DataFlow中运行单元测试时,我设法使用DirectRunner在具有5000个元素的本地运行了单元测试,但失败并出现以下错误。

Error Details: 
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.io.IOException: INVALID_ARGUMENT: Shuffle key too large:3749653 > 1572864
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output (GroupAlsoByWindowsParDoFn.java:184)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue (GroupAlsoByWindowFnRunner.java:102)
at org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn.processElement (BatchGroupAlsoByWindowViaIteratorsFn.java:126)
at org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn.processElement (BatchGroupAlsoByWindowViaIteratorsFn.java:54)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement (GroupAlsoByWindowFnRunner.java:115)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement (GroupAlsoByWindowFnRunner.java:73)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement (GroupAlsoByWindowsParDoFn.java:114)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process (ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process (OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop (ReadOperation.java:201)
Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.io.IOException: INVALID_ARGUMENT: Shuffle key too large:3749653 > 1572864
at com.abc.common.batch.functions.AbcListFlattenFn.splitRestriction (AbcListFlattenFn.java:68)

下面给出了本地DirectRunner和Cloud DataFlow运行器之间的数据差异。

本地的DirectRunner:

  1. 在示例输入PCollection元素中具有5000 abcs

云中的DataflowRunner:

  1. 600个输入PCollection元素中的abc大小各不相同
  2. 很少输入元素具有50000 abcs要展平
   public class AbcList implements Serializable {
        private List<Abc> abcs;
        private List<Xyz> xyzs;
   }

        public class AbcListFlattenFn extends DoFn<AbcList, KV<Abc, List<Xyz>> {

            @ProcessElement
            public void process(@Element AbcList input,
                ProcessContext context, RestrictionTracker<OffsetRange, Long> tracker) {

                try {
            /* Below commented lines are without the Splittable DoFn
                       input.getAbcs().stream().forEach(abc -> {
                                context.output(KV.of(abc, input.getXyzs()));
                         }); */

                    for (long index = tracker.currentRestriction().getFrom(); tracker.tryClaim(index);
                        ++index) {
                        context.output(KV.of(input.getAbcs().get(Math.toIntExact(index),input.getXyzs())));
                    }
                } catch (Exception e) {
                    log.error("Flattening AbcList has failed ", e);
                }

            }

            @GetInitialRestriction
            public OffsetRange getInitialRestriction(AbcList input) {
                return new OffsetRange(0, input.getAbcs().size());
            }

            @SplitRestriction
            public void splitRestriction(final AbcList input,
                final OffsetRange range, final OutputReceiver<OffsetRange> receiver) {
              List<OffsetRange> ranges =
                  range.split(input.getAbcs().size() > 5000 ? 5000
                        : input.getAbcs().size(), 2000);
                for (final OffsetRange p : ranges) {
                    receiver.output(p);
                }
            }

            @NewTracker
            public OffsetRangeTracker newTracker(OffsetRange range) {
                return new OffsetRangeTracker(range);
            }
        }

有人可以在这里建议ListFlatten函数出什么问题吗?是splitRestriction导致以下问题?如何解决随机播放密钥大小问题?

google-cloud-dataflow apache-beam apache-beam-io
1个回答
1
投票

随机密钥的大小限制是由于原始大小。为了摆脱此问题,您可能想在SDF之前添加一个Reshuffle。改组将帮助您进行第一轮分发。

© www.soinside.com 2019 - 2024. All rights reserved.