我不确定如何处理将 pandas 数据帧转换为 PySpark 数据帧时遇到的一些错误。我的 Pandas 数据框有一列“array_output”,它是使用 OpenCV 从图像创建的数组。它看起来像这样:
array([[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[255, 255, 255],
[255, 255, 255],
[255, 255, 255]],
...
将数据帧转换为 PySpark 会出现以下错误:
UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Can only convert 1-dimensional array values
我尝试用这个来压平数组:
.apply(lambda x: [item for sublist in x for item in sublist])
但这似乎也不起作用,因为我遇到了同样的错误。
任何关于我如何实现这一目标的想法将不胜感激。预先感谢。
无需压平。只需定义正确的模式并直接转换如下:
import pandas as pd
import pyspark.sql.types as T
from pyspark.sql import SparkSession
data = {
'array_output': [
[[255, 255, 255], [255, 255, 255], [255, 255, 255]],
[[255, 255, 255], [255, 255, 255], [255, 255, 255]],
[[255, 255, 255], [255, 255, 255], [255, 255, 255]]
]
}
pdf = pd.DataFrame(data)
spark = SparkSession.builder.getOrCreate()
array_schema = T.StructType([
T.StructField("array_output", T.ArrayType(T.ArrayType(T.IntegerType())), True)
])
spark_df = spark.createDataFrame(pdf, schema=array_schema)
spark_df.printSchema()
spark_df.show(truncate=False)
输出:
root
|-- array_output: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: integer (containsNull = true)
+---------------------------------------------------+
|array_output |
+---------------------------------------------------+
|[[255, 255, 255], [255, 255, 255], [255, 255, 255]]|
|[[255, 255, 255], [255, 255, 255], [255, 255, 255]]|
|[[255, 255, 255], [255, 255, 255], [255, 255, 255]]|
+---------------------------------------------------+