PySpark：如何使用逗号作为小数指定列

Question

我与PySpark工作，并加载csv文件。我有一个在欧洲的格式，这意味着逗号代替点，反之亦然数字的列。

例如：我已经2.416,67而不是2,416.67。

My data in .csv file looks like this -    
ID;    Revenue
21;    2.645,45
23;   31.147,05
.
.
55;    1.009,11

在熊猫，这样的文件可以很容易地通过指定内部decimal=',' thousands='.'和pd.read_csv()选择阅读欧洲格式的读取。

熊猫代码：

import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')

我不知道这怎么能在PySpark来完成。

PySpark代码：

from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
            StructField("ID", StringType(), True),
            StructField("Revenue", FloatType(), True)
                    ])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)

任何人都可以提出来，我们怎么能在PySpark使用上述.csv()功能加载这样的文件？

Answer 1

您将无法读取它作为一个浮动，因为数据的格式。你需要阅读它作为一个字符串，它清理干净，然后转换为浮动：

from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))

你可能只需要链接这些都在一起过：

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
         df
         .withColumn('revenue', regexp_replace('revenue', '\\.', ''))
         .withColumn('revenue', regexp_replace('revenue', ',', '.'))
         .withColumn('revenue', df['revenue'].cast("float"))
     )

请注意，这我没有测试，所以有可能是一个错字或两个在那里。

Answer 2

确保您的SQL表已预格式化读取用数字而不是整数。我有一个很大的麻烦，试图找出所有关于编码和格式不同点，逗号等，并在最后的问题是更原始，它是预格式化为只读整数，因此没有小数会曾经被接受，不管用逗号或点。然后，我不得不改变我的SQL表接受实数（数字），而不是仅此而已。

PySpark：如何使用逗号作为小数指定列

问题描述投票：2回答：2

2个回答

最新问题

PySpark：如何使用逗号作为小数指定列

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2