使用 pyspark 3.3(无
regexp_extract_all
),我想获取一列值
+-------------------------+
| value |
+-------------------------+
| MONOCYTES 1511|A5905.5 |
+-------------------------+
数据格式不固定。 IE。该值也可以是
1511;MONO->A5905.5
并提取与正则表达式
r'\w?\d+\.?\d*'
匹配的所有部分。然后,我想用另一个数据帧中的值替换任何提取的值:
+-----------+--------------+
| code | value |
+-----------+--------------+
| 1511 | monocytes1 |
+-----------+--------------+
| A5905.5 | monocytes2 |
+-----------+--------------+
这样我就可以以某种方式映射
{"MONOCYTES 1511|A5905.5": ["monocytes1", "monocytes2"]}
考虑到版本限制,最快的方法是什么?
谢谢:)
您可以使用
UDF
:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType
import re
def _find(s):
return re.findall(r'\w?\d+(?:\.\d+)?', s)
def _replace(parts):
return [D.get(part, part) for part in parts]
spark = SparkSession.builder.appName("ReplaceValues").getOrCreate()
df = spark.createDataFrame([("MONOCYTES 1511|A5905.5",)], ["value"])
TP = [("1511", "monocytes1"), ("A5905.5", "monocytes2")]
CDF = spark.createDataFrame(TP, ["code", "value"])
D = {row['code']: row['value'] for row in CDF.collect()}
UDF = F.udf(_find, ArrayType(StringType()))
df = df.withColumn("udf vals", UDF(F.col("value")))
REP = F.udf(_replace, ArrayType(StringType()))
df = df.withColumn("final values", REP(F.col("udf vals")))
df.select("value", "final values").show(truncate=False)