我有两个文件:-orders_renamed.csv,customers.csv 我使用完整的外部连接将它们连接起来,然后删除同一列(customer_id)。 我想将“order_id”列中的 null 值替换为“-1”。
我试过这个:
from pyspark.sql.functions import regexp_extract, monotonically_increasing_id, unix_timestamp, from_unixtime, coalesce from pyspark.sql.types import IntegerType, StructField, StructType, StringType
ordersDf = spark.read.format("csv").option("header", True).option("inferSchema", True).option("path", "C:/Users/Lenovo/Desktop/week12/week 12 dataset/orders_renamed.csv").load()
customersDf = spark.read.format("csv").option("header", True).option("inferSchema", True).option("path", "C:/Users/Lenovo/Desktop/week12/week 12 dataset/customers.csv").load()
joinCondition1 = ordersDf.customer_id == customersDf.customer_id
joinType1 = "outer"
joinenullreplace = ordersDf.join(customersDf, joinCondition1, joinType1).drop(ordersDf.customer_id).select("order_id", "customer_id", "customer_fname").sort("order_id").withColumn("order_id",coalesce("order_id",-1))
joinenullreplace.show(50)
在最后一行中我使用了合并,但它给了我错误..我尝试了多种方法,例如将合并作为一个表达式并应用“expr” 但它不起作用。 我也用过lit,但是没用。 请回复解决方案。
从 pyspark.sql.functions 导入 lit