在火花的rdd中找到最少的子集

问题描述 投票:0回答:1

任何人都可以帮助我,我有一个像BitSets的RDD

Array[scala.collection.mutable.BitSet] = Array(BitSet(1, 2), BitSet(1, 7), BitSet(8, 9, 10, 11), BitSet(1, 2, 3, 4),BitSet(8,9,10),BitSet(1,2,3))

我想要(BitSet(1,2),BitSet(1,7),BitSet(8,9,10))的rdd,我需要最少的子集或BitSet,它不是任何BitSet的子集

说明:这里(1,2),(1,2,3)(1,2,3,4) -----这里最少的子集是(1,2)(1,7)不是其他BitSet的子集---所以(1,7)也出现在结果中

scala apache-spark
1个回答
4
投票

可以使用聚合来解决此问题

val rdd = sc.parallelize(
  Seq(BitSet(1, 2),
      BitSet(1, 7),
      BitSet(8, 9, 10, 11),
      BitSet(1, 2, 3, 4),
      BitSet(8, 9, 10),
      BitSet(1, 2, 3)))
val result = rdd.aggregate(Seq[BitSet]())(accumulate, merge)
println(result)

输出:

List(BitSet(8, 9, 10), BitSet(1, 7), BitSet(1, 2))

accumulate是一个收集每个spark分区数据的函数

def accumulate(acc: Seq[BitSet], elem: BitSet): Seq[BitSet] = {
  val subsets = acc.filterNot(elem.subsetOf)
  if (notSuperSet(subsets)(elem)) elem +: subsets else subsets
}

merge结合了分区的结果

def merge(left: Seq[BitSet], right: Seq[BitSet]): Seq[BitSet] = {
  val leftSubsets = left.filter(notSuperSet(right))
  val rightSubsets = right.filter(notSuperSet(leftSubsets))
  leftSubsets ++ rightSubsets
}

和辅助函数notSuperSet,它是一个谓词,用于检查新集是否是其他集的超集

def notSuperSet(subsets: Seq[BitSet])(set: BitSet): Boolean =
  subsets.forall(!_.subsetOf(set))
© www.soinside.com 2019 - 2024. All rights reserved.