MLlib regexTokenizer忽略口音

问题描述 投票:0回答:1

我正在用pySpark(Python3)测试MLlib令牌生成器:

# -*- coding: utf-8 -*-

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.ml.feature import Tokenizer, RegexTokenizer

# Creating dataframe
sentenceData = spark.createDataFrame([
(["Eu acho que MLlib é incrível!"]),
(["Muito mais legal do que scikit-learn"])
], ["raw"])

# Putting sequential indexer on DataFrame
w = Window.orderBy('raw')
sentenceData = sentenceData.withColumn("id", row_number().over(w))

# Configuring regexTokenizer
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")

# Applying Tokenizer to dataset
sentenceData = regexTokenizer.transform(sentenceData)

sentenceData.select(
   'id','raw','words'
).show(truncate=False)

结果是这样:

+---+------------------------------------+--------------------------------------------+
|id |raw                                 |words                                       |
+---+------------------------------------+--------------------------------------------+
|1  |Eu acho que MLlib é incrível!       |[eu, acho, que, mllib, incr, vel]           |
|2  |Muito mais legal do que scikit-learn|[muito, mais, legal, do, que, scikit, learn]|
+---+------------------------------------+--------------------------------------------+

如您所见,由于字符“í”,“incrível”(葡萄牙语单词意为“惊人”)被转换为两个“新单词”。我没有找到解决该问题的文档。所以,我在这里迷路了!

我试图直接在'regexTokenizer'配置上更改'pattern',包括'í'和其他模式,包括'class'模式中的'\ w'char(例如pattern =“ [\Wí\ w] +”),但没有用!有某种方法可以设置“葡萄牙语”或以某种方式强制Spark不忽略重音符号?

谢谢!

regex tokenize apache-spark-mllib
1个回答
0
投票

尝试

pattern="[\\p{L}\\w]+"

它通过使用Scala代码如下对我有用:

val tokenizer = new RegexTokenizer().setGaps(false)
                .setPattern("[\\p{L}\\w]+")
                .setInputCol("raw")
                .setOutputCol("words")
© www.soinside.com 2019 - 2024. All rights reserved.