我有一个数据集,我有序列号说0和1。
Category Value Sequences
1 10 0
1 11 1
1 13 1
1 16 1
1 20 0
1 21 0
1 22 1
1 25 1
1 27 1
1 29 1
1 30 0
1 32 1
1 34 1
1 35 1
1 38 0
这里序列列中的1个出现三次。我需要单独总结该序列值。
我正在尝试使用以下代码:
%livy2.spark
import org.apache.spark.rdd.RDD
val df = df.select( $"Category", $"Value", $"Sequences").rdd.groupBy(x =>
(x.getInt(0))
).map(
x => {
val Category= x(0).getInt(0)
val Value= x(0).getInt(1)
val Sequences = x(0).getInt(2)
for (i <- x.indices){
val vi = x(i).getFloat(4)
if (vi(0) >0 )
{
summing+ = Value//
}
(Category, summing)
}
}
)
df_new.take(10).foreach(println)
当我写这个代码时出现错误,说明不完整的语句。值df表示我最初给出的数据集。
预期的产出是:
Category summing
1 40
1 103
1 101
我不知道我在哪里落后。如果有人帮助我学习这个新东西会很棒。
可以通过为每个行分配唯一的id,然后将每个单元包含在由下一个零唯一id指定的组中来完成:
val df = Seq(
(1, 10, 0),
(1, 11, 1),
(1, 13, 1),
(1, 16, 1),
(1, 20, 0),
(1, 21, 0),
(1, 22, 1),
(1, 25, 1),
(1, 27, 1),
(1, 29, 1),
(1, 30, 0),
(1, 32, 1),
(1, 34, 1),
(1, 35, 1),
(1, 38, 0)
).toDF("Category", "Value", "Sequences")
// assign each row unique id
val zipped = df.withColumn("zip", monotonically_increasing_id())
// Make range from zero to next zero
val categoryWindow = Window.partitionBy("Category").orderBy($"zip")
val groups = zipped
.filter($"Sequences" === 0)
.withColumn("rangeEnd", lead($"zip", 1).over(categoryWindow))
.withColumnRenamed("zip", "rangeStart")
println("Groups:")
groups.show(false)
// Assign range for each unit
val joinCondition = ($"units.zip" > $"groups.rangeStart").and($"units.zip" < $"groups.rangeEnd")
val unitsByRange = zipped
.filter($"Sequences" === 1).alias("units")
.join(groups.alias("groups"), joinCondition, "left")
.select("units.Category", "units.Value", "groups.rangeStart")
println("Units in groups:")
unitsByRange.show(false)
// Group by range
val result = unitsByRange
.groupBy($"Category", $"rangeStart")
.agg(sum("Value").alias("summing"))
.orderBy("rangeStart")
.drop("rangeStart")
println("Result:")
result.show(false)
输出:
Groups:
+--------+-----+---------+----------+----------+
|Category|Value|Sequences|rangeStart|rangeEnd |
+--------+-----+---------+----------+----------+
|1 |10 |0 |0 |4 |
|1 |20 |0 |4 |5 |
|1 |21 |0 |5 |8589934595|
|1 |30 |0 |8589934595|8589934599|
|1 |38 |0 |8589934599|null |
+--------+-----+---------+----------+----------+
Units in groups:
+--------+-----+----------+
|Category|Value|rangeStart|
+--------+-----+----------+
|1 |11 |0 |
|1 |13 |0 |
|1 |16 |0 |
|1 |22 |5 |
|1 |25 |5 |
|1 |27 |5 |
|1 |29 |5 |
|1 |32 |8589934595|
|1 |34 |8589934595|
|1 |35 |8589934595|
+--------+-----+----------+
Result:
+--------+-------+
|Category|summing|
+--------+-------+
|1 |40 |
|1 |103 |
|1 |101 |
+--------+-------+