在 Pyspark 中以正确的数据类型读取 CSV

问题描述 投票:0回答:3

当我尝试使用 Spark 导入本地 CSV 时,默认情况下每一列都会作为字符串读入。但是,我的列仅包含整数和时间戳类型。更具体地说,CSV 如下所示:

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000

我找到了应该在这个问题中工作的代码,但是当我执行它时,所有条目都返回为

NULL

我使用以下内容创建自定义架构:

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

customSchema = StructType(Array(
        StructField("Customer", IntegerType, true),
        StructField("TransDate", TimestampType, true),
        StructField("Quantity", IntegerType, true),
        StructField("Cost", IntegerType, true),
        StructField("TransKey", IntegerType, true)))

然后在 CSV 中读取:

myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)

返回:

+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
|    null|     null|    null|null|    null|
+--------+---------+--------+----+--------+

我是否错过了关键的一步?我怀疑日期列是问题的根源。注意:我在 GoogleCollab 中运行它。

csv pyspark apache-spark-sql
3个回答
4
投票

给你!

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

您的日期将作为字符串列类型加载,但是当您将其更改为日期类型时,它将将此日期格式视为 NULL。

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

所以我们需要将日期格式从 dd.mm.yy 更改为 yy-mm-dd。

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

更改日期格式的Python函数:

  change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

立即为您的数据框列调用此函数:

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

下面是架构:

 |-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

让我知道它是否适合您。


2
投票

您可以为

('dateFormat','d.M.y')
指定一个选项
DataFrameReader
以解析特定格式的日期。

df = spark.read.format("csv").option("header","true").option("dateFormat","M.d.y").schema(my_schema).load("path_to_csv")

参考


0
投票

我得到这个是因为我的 csv 文件中重复了相同的列名称。

浪费了一天时间,因为我的 csv 中有 220 列。

$ cat x.csv
aa,bb,cc,aa,dd
0,1,2,3,4
5,6,7,8,9

$
>>> spark.read.csv('./x.csv', header=True, inferSchema=True).show()
24/10/30 17:10:16 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: aa, bb, cc, aa, dd
 Schema: aa0, bb, cc, aa3, dd
Expected: aa0 but found: aa
CSV file: file:///c:/My/workspaces/dlh-databricks/fxc/src/fxc_dbk/x.csv
+---+---+---+---+---+
|aa0| bb| cc|aa3| dd|
+---+---+---+---+---+
|  0|  1|  2|  3|  4|
|  5|  6|  7|  8|  9|
+---+---+---+---+---+

>>>
© www.soinside.com 2019 - 2024. All rights reserved.