我正在尝试使用正则表达式 (regex) 从 PySpark DataFrame 中提取最后一个字符,以便执行一些数据清理和解析为列。
目前,我正在使用 UDF(用户定义函数),其中字段之一将是用户输入。我专门寻找一种方法来捕获最后一个右方括号]。我尝试了几种不同的方法,但一直陷入困境,所以我将不胜感激您能提供的任何帮助。预先感谢您!
* the cat in the hat is fat\r\n
* synnopsis -- hey the cat who is wearing the hat is getting to big for its outfit as evidenced by this photo [img123123.png] \r\n
* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]
bob <[email protected]>
这些正文文本被放入一个 Python 列表中,我正在执行一个 pyspark 函数来分割最后一个]。 如果有人使用 ] 字符或 Outlook 决定将 [imgblala.jpg] 放入原始文本中,那么我的所有列都会被覆盖,这就是一个问题。
我尝试插入一个适用于 regex101 的正则表达式来捕获最后一个 ],它应该是 python 列表中的 ],但我得到的是未定义的,否则如果我只使用 ] 进行分割,我们将在 img123123 上进行分割。 .jpg.
将正则表达式放入拆分中时,在 delim 列之前是否需要使用一些特定的字符? 这是我正在尝试做的一个例子。
from pyspark.sql.functions import split
one_value = "['[email protected]', '[* the cat in the hat is fat\r\n* synnopsis -- hey the cat who is wearing the hat is getting to big for its outfit as evidenced by this photo [img123123.png] \r\n* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]\r\nsigned\r\nbob <[email protected]>]', '[email protected]']"
df = spark.createDataFrame(data=[(one_value,),], schema='emailobj: string')
df = df.withColumn("to", split(df["emailobj"], "]").getItem(0)) \
.withColumn("from", split(df["emailobj"], "]").getItem(2)) \
.withColumn("body", split(df["emailobj"], r"[^]]*]$").getItem(3)) \
.withColumn("messageid", split(df["emailobj"], "]").getItem(4)) \
.withColumn("subject", split(df["emailobj"], "]").getItem(6))
我要么得到一个未定义的结果,要么最后一个 ] 上的分割没有分割。 这个正则表达式似乎可以作为 java 正则表达式在 regex101 上工作。这是一个链接:(https://regex101.com/r/1ZbVG3/2)我需要做一些事情才能让spark正确看到它吗?
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import ArrayType, StringType, StructField, StructType
... StructField("to", StringType(), False),
... StructField("body", StringType(), False),
... StructField("from", StringType(), False),
... ])
>>> email_split = F.udf(lambda s: eval(s.encode('unicode_escape')), return_type)
>>> one_value = "['[email protected]', '[* the cat in the hat is fat\r\n* synnopsis -- hey the cat who is wearing the hat is getting to big for its outf
it as evidenced by this photo [img123123.png] \r\n* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]\r\n
signed\r\nbob <[email protected]>]', '[email protected]']"
>>> df = spark.createDataFrame(data=[(one_value,),], schema='emailobj: string')
>>> df = df.withColumn('emailobj_split', email_split(df.emailobj))
>>> df.printSchema()
|-- emailobj: string (nullable = true)
|-- emailobj_split: struct (nullable = true)
| |-- to: string (nullable = false)
| |-- body: string (nullable = false)
| |-- from: string (nullable = false)
>>> df2 = df.select('emailobj_split.*')
>>> df2.show()
| to| body| from|
|[email protected]|[* the cat in the...|[email protected]|
>>> df2.show(truncate=False)
|to |body
|from |
|[email protected]|[* the cat in the hat is fat\r\n* synnopsis -- hey the cat who is wearing the hat is getting to big for its outfit as evidenced by th
is photo [img123123.png] \r\n* please get a new cat, or a new hat I would suggest something like this [newhatforcat.jpg]\r\nsigned\r\nbob <bob@so
mecompany.com>]|[email protected]|