我有下面的数据框,只有一列作为值
abc,1,2,345,765,876,Kumar r,Raghvan ,04041996
abc,1,2,345,765,876,"sam Bailey,20541789 #here double quote already present after 6th comma
abc,1011,2,32,678,,,,,
我正在 pyspark 中寻找正则表达式,它在第 6 个逗号之后和数字之前添加引号。
上述值的预期输出如下
abc,1,2,345,765,876,"Kumar r,Raghvan" ,04041996
abc,1,2,345,765,876,"sam Bailey",20541789
abc,1011,2,32,678,,,,,
我已尝试使用以下代码,但未收到预期结果
df_with_quotes = df.withColumn("data_with_quotes",regexp_replace(col("data"), r"((?:[^,],){6})([^"].[^"$])(,[^,]+$)", r'\1"\2"\3'))
这里赞赏任何代码片段。
UDF
来实现所需的结果:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [
("abc,1,2,345,765,876,Kumar r,Raghvan ,04041996",),
("abc,1,2,345,765,876,\"sam Bailey,20541789",),
("abc,1011,2,32,678,,,,,",),
]
df = spark.createDataFrame(data, ["value"])
def add_quotes(value):
parts = value.split(',')
if len(parts) > 6:
# fetch the parts that need to be quoted (parts from the sixth comma until a part with digits is reached)
quoted_parts = [part.strip() for part in parts[6:] if part and not part[0].isdigit()]
combined = ','.join(quoted_parts)
# add quotes
if combined:
combined = f'"{combined}"' if not combined.startswith('"') else combined
combined += '"' if not combined.endswith('"') else ''
parts[6:6 + len(quoted_parts)] = [combined]
return ','.join(parts)
add_quotes_udf = F.udf(add_quotes)
df_with_quotes = df.withColumn("updated_value", add_quotes_udf(F.col("value")))
df_with_quotes.select("value", "updated_value").show(truncate=False)
输出:
+---------------------------------------------+----------------------------------------------+
|value |updated_value |
+---------------------------------------------+----------------------------------------------+
|abc,1,2,345,765,876,Kumar r,Raghvan ,04041996|abc,1,2,345,765,876,"Kumar r,Raghvan",04041996|
|abc,1,2,345,765,876,"sam Bailey,20541789 |abc,1,2,345,765,876,"sam Bailey",20541789 |
|abc,1011,2,32,678,,,,, |abc,1011,2,32,678,,,,, |
+---------------------------------------------+----------------------------------------------+