如何将 xlsx 或 xls 文件读取为 Spark 数据帧

Question

任何人都可以让我知道在不转换 xlsx 或 xls 文件的情况下我们如何将它们读取为 Spark 数据框

我已经尝试用 pandas 读取，然后尝试转换为 Spark 数据帧，但收到错误，错误是

错误：

Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

代码：

import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

Answer 1

我尝试根据 @matkurek 和 @Peter Pan 的答案在 2021 年 4 月给出一般更新版本。

火花

您应该在 databricks 集群上安装以下 2 个库：

集群 -> 选择集群 -> 库 -> 安装新的 -> Maven -> 在坐标: com.crealytics:spark-excel_2.12:0.13.5
集群 -> 选择集群 -> 库 -> 安装新的 -> PyPI-> 在 Package 中：xlrd

然后，您将能够按如下方式读取Excel：

sparkDF = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

熊猫

您应该在 databricks 集群上安装以下 2 个库：

集群 -> 选择集群 -> 库 -> 安装新的 -> PyPI-> 在 Package 中：xlrd
集群 -> 选择集群 -> 库 -> 安装新的 -> PyPI-> 在 Package 中：openpyxl

然后，您将能够按如下方式读取Excel：

import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet')

请注意，您将有两个不同的对象，在第一个场景中是 Spark Dataframe，在第二个场景中是 Pandas Dataframe。

Answer 2

正如@matkurek提到的，你可以直接从Excel中读取它。事实上，这应该是比使用 pandas 更好的做法，因为那样 Spark 的好处就不再存在了。

您可以运行与定义的 qbove 相同的代码示例，但只需将所需的类添加到 SparkSession 的配置中即可。

spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()

然后，您可以读取您的Excel文件。

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))

Answer 3

您的帖子中没有显示您的excel数据，但我重现了与您相同的问题。

这是我的示例excel的数据

test.xlsx

，如下。

您可以看到我的列中有不同的数据类型

：双精度值

2.2

和字符串值

。

所以如果我运行下面的代码，

import pandas

df = pandas.read_excel('test.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

它将返回与您相同的错误。

TypeError: field B: Can not merge type <class 'pyspark.sql.types.DoubleType'> and class 'pyspark.sql.types.StringType'>

如果我们尝试通过

dtypes

检查

df

列的

df.dtypes

，我们就会看到。

第

dtype

列的

为

object

，

spark.createDateFrame

函数无法从真实数据推断出 B 列的真实数据类型。因此，要解决这个问题，解决方案是传递一个模式来帮助 B 列的数据类型推断，如下面的代码。

from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([StructField("A", DoubleType(), True), StructField("B", StringType(), True)])
sdf = spark.createDataFrame(df, schema=schema)

强制将B列设置为

StringType

以解决数据类型冲突。

Answer 4

将 .xls / .xlsx 文件从 Azure Blob 存储读取到 Spark DF 的步骤

您可以借助名为 spark-excel 的库将位于 Azure Blob 存储中的 excel 文件读取到 pyspark 数据帧。（也称为

com.crealytics.spark.excel

）

使用 UI 或 Databricks CLI 安装库。（集群设置页面 > 库 > 安装新选项。确保选择
```
maven
```
）
安装库后。您需要正确的凭据才能访问 Azure Blob 存储。您可以在集群设置页面 > 高级选项 > Spark 配置中提供访问密钥

示例：

spark.hadoop.fs.azure.account.key.<storage-account>.blob.core.windows.net <access key>

注意：如果您是集群所有者，您可以将其作为秘密提供，而不是像docs

中所述以纯文本形式提供访问密钥

重启集群。您可以使用下面的代码来读取位于 Blob 存储中的 Excel 文件

filePath = "wasbs://<container-name>@<storage-account>.blob.core.windows.net/MyFile1.xls"

DF = spark.read.format("excel").option("header", "true").option("inferSchema", "true").load(filePath)

display(DF)

PS：

spark.read.format("excel")

是V2方法。而

spark.read.format("com.crealytics.spark.excel")

是V1，您可以在这里

阅读更多内容

Answer 5

可以通过spark的read功能读取excel文件。这需要一个 Spark 插件，要将其安装在 databricks 上，请访问：

集群 > 你的集群 > 库 > 安装新的 > 选择 Maven 并在“坐标”中粘贴 com.crealytics:spark-excel_2.12:0.13.5

之后，您可以这样读取文件：

df = spark.read.format("com.crealytics.spark.excel") \
    .option("useHeader", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

Answer 6

只需打开文件 xlsx 或 xlms，在 pandas 中打开文件，然后在 Spark 中打开

将 pandas 导入为 pd

df = pd.read_excel('file.xlsx', engine='openpyxl')

df = Spark_session.createDataFrame(df.astype(str))

Answer 7

下面的配置和代码适用于我将 excel 文件读入 pyspark 数据帧。执行 python 代码之前的先决条件。

在您的 databricks 集群上安装 Maven 库。

Maven 库名称和版本：com.crealytics:spark-excel_2.12:0.13.5

Databricks 运行时：9.0（包括 Apache Spark 3.1.2、Scala 2.12）

在 python 笔记本中执行以下代码，将 excel 文件加载到 pyspark 数据框中：

  sheetAddress = "'<enter sheetname>'!A1"
  filePath = "<enter excel file full path>"
  df = spark.read.format("com.crealytics.spark.excel") \
                                .option("header", "true") \
                                .option("dataAddress", sheetAddress) \
                                .option("treatEmptyValuesAsNulls", "false") \
                                .option("inferSchema", "true") \
                                .load(filePath)

Answer 8

将 Excel 数据读取到 Spark DataFrame 的简单一行代码是使用 Spark 上的 Pandas API 读取数据并立即将其转换为 Spark DataFrame。看起来像这样：

import pyspark.pandas as ps
spark_df = ps.read_excel('<excel file path>', sheet_name='Sheet1', inferSchema='').to_spark()

Answer 9

我们也可以尝试这个：安装此库后：com.crealytics:spark-excel_2.12:0.13.5

df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "<SHEETNAME>!A1") \
.load("FILEPATH")
display(df)

如何将 xlsx 或 xls 文件读取为 Spark 数据帧

问题描述投票：0回答：9

9个回答

将 .xls / .xlsx 文件从 Azure Blob 存储读取到 Spark DF 的步骤

最新问题

如何将 xlsx 或 xls 文件读取为 Spark 数据帧

问题描述 投票：0回答：9

9个回答

将 .xls / .xlsx 文件从 Azure Blob 存储读取到 Spark DF 的步骤

最新问题

问题描述投票：0回答：9