如何在 PySpark 中使用正则表达式从 UDF 中获取最后一个特定字符?

问题描述 投票:0回答:1

我正在尝试使用正则表达式 (regex) 从 PySpark DataFrame 中提取最后一个字符,以便执行一些数据清理和解析为列。

目前,我正在使用 UDF(用户定义函数),其中字段之一将是用户输入。我专门寻找一种方法来捕获最后一个右方括号]。我尝试了几种不同的方法,但一直陷入困境,所以我将不胜感激您能提供的任何帮助。预先感谢您!

示例文本:

* the cat in the hat is fat\r\n
* synnopsis -- hey the cat who is wearing the hat is getting to big for its outfit as evidenced by this photo [img123123.png] \r\n
* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]
signed
bob <[email protected]>

这些正文文本被放入一个 Python 列表中,我正在执行一个 pyspark 函数来分割最后一个]。 如果有人使用 ] 字符或 Outlook 决定将 [imgblala.jpg] 放入原始文本中,那么我的所有列都会被覆盖,这就是一个问题。

我尝试插入一个适用于 regex101 的正则表达式来捕获最后一个 ],它应该是 python 列表中的 ],但我得到的是未定义的,否则如果我只使用 ] 进行分割,我们将在 img123123 上进行分割。 .jpg.

将正则表达式放入拆分中时,在 delim 列之前是否需要使用一些特定的字符? 这是我正在尝试做的一个例子。

from pyspark.sql.functions import split

one_value = "['[email protected]', '[* the cat in the hat is fat\r\n* synnopsis -- hey the cat who is wearing the hat is getting to big for its outfit as evidenced by this photo [img123123.png] \r\n* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]\r\nsigned\r\nbob <[email protected]>]', '[email protected]']"


df = spark.createDataFrame(data=[(one_value,),], schema='emailobj: string')

df = df.withColumn("to", split(df["emailobj"], "]").getItem(0)) \
                     .withColumn("from", split(df["emailobj"], "]").getItem(2)) \
                     .withColumn("body", split(df["emailobj"], r"[^]]*]$").getItem(3)) \
                     .withColumn("messageid", split(df["emailobj"], "]").getItem(4)) \
                     .withColumn("subject", split(df["emailobj"], "]").getItem(6))

我要么得到一个未定义的结果,要么最后一个 ] 上的分割没有分割。 这个正则表达式似乎可以作为 java 正则表达式在 regex101 上工作。这是一个链接:(https://regex101.com/r/1ZbVG3/2)我需要做一些事情才能让spark正确看到它吗?

regex apache-spark pyspark apache-spark-sql
1个回答
0
投票

假设该列中的值始终是格式良好的数组,您可以使用

eval()
将其转换为数组。注意关于
encode('unicode_escape')
的小技巧,以解决你在体内没有逃脱
\n\r
的事实,

>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import ArrayType, StringType, StructField, StructType
>>>
>>>
...     StructField("to", StringType(), False),
...     StructField("body", StringType(), False),
...     StructField("from", StringType(), False),
... ])
>>>
>>> email_split = F.udf(lambda s: eval(s.encode('unicode_escape')), return_type)
>>> one_value = "['[email protected]', '[* the cat in the hat is fat\r\n* synnopsis -- hey the cat who is wearing the hat is getting to big for its outf
it as evidenced by this photo [img123123.png] \r\n* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]\r\n
signed\r\nbob <[email protected]>]', '[email protected]']"
>>>
>>> df = spark.createDataFrame(data=[(one_value,),], schema='emailobj: string')
>>> df = df.withColumn('emailobj_split', email_split(df.emailobj))
>>> df.printSchema()
root
 |-- emailobj: string (nullable = true)
 |-- emailobj_split: struct (nullable = true)
 |    |-- to: string (nullable = false)
 |    |-- body: string (nullable = false)
 |    |-- from: string (nullable = false)

>>> df2 = df.select('emailobj_split.*')
>>> df2.show()
+----------+--------------------+------------+
|        to|                body|        from|
+----------+--------------------+------------+
|[email protected]|[* the cat in the...|[email protected]|
+----------+--------------------+------------+

>>> df2.show(truncate=False)
+----------+-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------
---------------+------------+
|to        |body

               |from        |
+----------+-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------
---------------+------------+
|[email protected]|[* the cat in the hat is fat\r\n* synnopsis -- hey the cat who is wearing the hat is getting to big for its outfit as evidenced by th
is photo [img123123.png] \r\n* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]\r\nsigned\r\nbob <bob@so
mecompany.com>]|[email protected]|
+----------+-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------
---------------+------------+
>>>
© www.soinside.com 2019 - 2024. All rights reserved.