数据源表不支持LOAD DATA

Question

我是 ADB 新手，尝试使用 parquet 文件将数据加载到 databricks 中的表中，我给出以下命令：

load data local inpath '/FileStore/tables/Subsidiary__1_-2.parquet' into table Subsidiary

但它抛出如下错误：

SQL语句错误: AnalysisException: LOAD DATA is not support 对于数据源表：`default`.`subsidiary`;

谁能解释一下为什么会这样

Answer 1

根据Databricks的官方文档关于

LOAD DATA

（突出显示的是我的）：

从用户指定的目录或文件将数据加载到Hive SerDe 表。

根据异常消息（突出显示的是我的），您使用 Spark SQL 表（数据源表）：

AnalysisException：数据源表不支持加载数据：
default
。
subsidiary
；

最简单的方法是

DESCRIBE EXTENDED

并验证提供者不是

Hive

而是其他东西（例如

parquet

）。

演示

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9)

scala> spark.range(5).write.saveAsTable("demo")

scala> sql("DESCRIBE EXTENDED demo").show(truncate = false)
20/12/29 21:57:35 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+----------------------------+--------------------------------------------------------------+-------+
|col_name                    |data_type                                                     |comment|
+----------------------------+--------------------------------------------------------------+-------+
|id                          |bigint                                                        |null   |
|                            |                                                              |       |
|# Detailed Table Information|                                                              |       |
|Database                    |default                                                       |       |
|Table                       |demo                                                          |       |
|Owner                       |jacek                                                         |       |
|Created Time                |Tue Dec 29 21:57:09 CET 2020                                  |       |
|Last Access                 |UNKNOWN                                                       |       |
|Created By                  |Spark 3.0.1                                                   |       |
|Type                        |MANAGED                                                       |       |
|Provider                    |parquet                                                       |       |
|Statistics                  |2582 bytes                                                    |       |
|Location                    |file:/Users/jacek/dev/oss/spark/spark-warehouse/demo          |       |
|Serde Library               |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |       |
|InputFormat                 |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |       |
|OutputFormat                |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|       |
+----------------------------+--------------------------------------------------------------+-------+

scala> sql("load data local inpath 'NOTICE' into table demo")
org.apache.spark.sql.AnalysisException: LOAD DATA is not supported for datasource tables: `default`.`demo`;
  at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:317)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
  ... 47 elided

Answer 2

更新：请参阅 Jacek 的答案，但它仅在特定情况下有效

Spark SQL 中没有

load data

这样的命令。最接近的是INSERT INTO，它允许将数据从其他表插入到表中。

但是如果您确实想访问给定文件中的数据，那么您可以使用 CREATE TABLE 来代替。像这样的东西：

CREATE TABLE IF NOT EXISTS Subsidiary
USING PARQUET
LOCATION '/FileStore/tables/Subsidiary__1_-2.parquet'

Answer 3

对亚历克斯·奥特的补充：
首先，使用CSV文件创建一个临时表，其中字段类型全部设置为

string

:

CREATE OR REPLACE TEMPORARY VIEW temp_csv_table
USING csv
OPTIONS (path "hdfs://your_file.csv", header "true");

然后，转换临时表的数据类型，插入到目标表中：

INSERT INTO TABLE your_table
SELECT 
    CAST(id AS BIGINT) AS id,
    CAST(dtype AS INT) AS dtype,
    ...
FROM temp_csv_table;

数据源表不支持LOAD DATA

问题描述投票：0回答：3

3个回答

演示

最新问题

数据源表不支持LOAD DATA

问题描述 投票：0回答：3

3个回答

演示

最新问题

问题描述投票：0回答：3