Pyspark <3.3 Extract all regex and replace with another DataFrame

问题描述 投票:0回答:1

使用 pyspark 3.3(无

regexp_extract_all
),我想获取一列值

+-------------------------+
|          value          |
+-------------------------+
|  MONOCYTES 1511|A5905.5 |
+-------------------------+

数据格式不固定。 IE。该值也可以是

1511;MONO->A5905.5

并提取与正则表达式

r'\w?\d+\.?\d*'
匹配的所有部分。然后,我想用另一个数据帧中的值替换任何提取的值:

+-----------+--------------+
|    code   |     value    |
+-----------+--------------+
|    1511   |  monocytes1  |
+-----------+--------------+
|  A5905.5  |  monocytes2  |
+-----------+--------------+

这样我就可以以某种方式映射

{"MONOCYTES 1511|A5905.5": ["monocytes1", "monocytes2"]}

考虑到版本限制,最快的方法是什么?

谢谢:)

python apache-spark pyspark
1个回答
0
投票

您可以使用

UDF

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType
import re


def _find(s):
    return re.findall(r'\w?\d+(?:\.\d+)?', s)

def _replace(parts):
    return [D.get(part, part) for part in parts]


spark = SparkSession.builder.appName("ReplaceValues").getOrCreate()
df = spark.createDataFrame([("MONOCYTES 1511|A5905.5",)], ["value"])

TP = [("1511", "monocytes1"), ("A5905.5", "monocytes2")]
CDF = spark.createDataFrame(TP, ["code", "value"])

D = {row['code']: row['value'] for row in CDF.collect()}

UDF = F.udf(_find, ArrayType(StringType()))
df = df.withColumn("udf vals", UDF(F.col("value")))

REP = F.udf(_replace, ArrayType(StringType()))
df = df.withColumn("final values", REP(F.col("udf vals")))
df.select("value", "final values").show(truncate=False)

© www.soinside.com 2019 - 2024. All rights reserved.