PySpark-删除以n克为单位的空白

Question

我正在尝试产生3个字母的n-gram，但是Spark NGram在每个字母之间插入一个空格。我想删除（或不产生）此空白。我可以分解数组，删除空白，然后重新组装数组，但这将是非常昂贵的操作。最好，我也想避免由于PySpark UDF的性能问题而创建UDF。使用PySpark内置函数有没有更便宜的方法？

from pyspark.ml import Pipeline, Model, PipelineModel
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram
from pyspark.sql.functions import *


wordDataFrame = spark.createDataFrame([
    (0, "Hello I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic regression models are neat")
], ["id", "words"])

pipeline = Pipeline(stages=[
        RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1),
        NGram(n=3, inputCol="tokens", outputCol="ngrams")
    ])

model = pipeline.fit(wordDataFrame).transform(wordDataFrame)

model.show()

当前输出是：

+---+--------------------+--------------------+--------------------+
| id|               words|              tokens|              ngrams|
+---+--------------------+--------------------+--------------------+
|  0|Hi I heard about ...|[h, e, l, l, o,  ...|[h e l, e l l,   ...|
+---+--------------------+--------------------+--------------------+

但是需要的是：

+---+--------------------+--------------------+--------------------+
| id|               words|              tokens|              ngrams|
+---+--------------------+--------------------+--------------------+
|  0|Hello I heard ab ...|[h, e, l, l, o,  ...|[hel, ell, llo,  ...|
+---+--------------------+--------------------+--------------------+

Answer 1

您可以使用高阶函数transform和regex来实现。（spark2.4 +（假设ngarms列为stringtype的arraytype）]

#sampledataframe
df.show()
+---+----------------+---------------+--------------+
| id|           words|         tokens|        ngrams|
+---+----------------+---------------+--------------+
|  0|Hi I heard about|[h, e, l, l, o]|[h e l, e l l]|
+---+----------------+---------------+--------------+

from pyspark.sql import functions as F
df.withColumn("ngrams", F.expr("""transform(ngrams,x-> regexp_replace(x,"\ ",""))""")).show()

+---+----------------+---------------+----------+
| id|           words|         tokens|    ngrams|
+---+----------------+---------------+----------+
|  0|Hi I heard about|[h, e, l, l, o]|[hel, ell]|
+---+----------------+---------------+----------+

PySpark-删除以n克为单位的空白

问题描述投票：-1回答：1

1个回答

最新问题

PySpark-删除以n克为单位的空白

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1