使用 pyspark databricks 正则表达式在特定位置添加双引号

问题描述 投票:0回答:1

我有下面的数据框,只有一列作为值

abc,1,2,345,765,876,Kumar r,Raghvan ,04041996

abc,1,2,345,765,876,"sam Bailey,20541789 #here double quote already present after 6th comma

abc,1011,2,32,678,,,,,

我正在 pyspark 中寻找正则表达式,它在第 6 个逗号之后和数字之前添加引号。

上述值的预期输出如下

abc,1,2,345,765,876,"Kumar r,Raghvan" ,04041996 

abc,1,2,345,765,876,"sam Bailey",20541789 

abc,1011,2,32,678,,,,,

我已尝试使用以下代码,但未收到预期结果

如果第 6 列周围尚不存在引号,请使用正则表达式添加引号

df_with_quotes = df.withColumn("data_with_quotes",regexp_replace(col("data"), r"((?:[^,],){6})([^"].[^"$])(,[^,]+$)", r'\1"\2"\3'))

这里赞赏任何代码片段。

pyspark databricks
1个回答
0
投票

您可以创建一个

UDF
来实现所需的结果:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ("abc,1,2,345,765,876,Kumar r,Raghvan ,04041996",),
    ("abc,1,2,345,765,876,\"sam Bailey,20541789",),
    ("abc,1011,2,32,678,,,,,",),
]
df = spark.createDataFrame(data, ["value"])

def add_quotes(value):
    parts = value.split(',')
    if len(parts) > 6:
        # fetch the parts that need to be quoted (parts from the sixth comma until a part with digits is reached)
        quoted_parts = [part.strip() for part in parts[6:] if part and not part[0].isdigit()]
        combined = ','.join(quoted_parts)
        # add quotes
        if combined:
            combined = f'"{combined}"' if not combined.startswith('"') else combined
            combined += '"' if not combined.endswith('"') else ''
            parts[6:6 + len(quoted_parts)] = [combined]
    return ','.join(parts)

add_quotes_udf = F.udf(add_quotes)
df_with_quotes = df.withColumn("updated_value", add_quotes_udf(F.col("value")))
df_with_quotes.select("value", "updated_value").show(truncate=False)

输出:

+---------------------------------------------+----------------------------------------------+
|value                                        |updated_value                                 |
+---------------------------------------------+----------------------------------------------+
|abc,1,2,345,765,876,Kumar r,Raghvan ,04041996|abc,1,2,345,765,876,"Kumar r,Raghvan",04041996|
|abc,1,2,345,765,876,"sam Bailey,20541789     |abc,1,2,345,765,876,"sam Bailey",20541789     |
|abc,1011,2,32,678,,,,,                       |abc,1011,2,32,678,,,,,                        |
+---------------------------------------------+----------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.