(Spark 3.3.2 OpenJDK19 PySpark Pandas_UDF Python3.10 Ubuntu22.04 Dockerized)测试脚本产生类型错误:'JavaPackage'对象不可调用

问题描述 投票:0回答:0

我创建了一个安装 Ubuntu 22.04、Python 3.10、Spark 3.3.2、Hadoop 3、Scala 13 和 Open JDK 19 的 docker 容器。

在 AWS 中部署代码之前,我目前正在使用它作为测试环境。

这个容器在过去的 2-3 个月里一直运行良好,直到最近的重建产生了以下错误:

TypeError: 'JavaPackage' object is not callable

Dockerfile 没有改变,我已经测试过在不同的机器/环境上重建都产生相同的错误。

Docker文件

# Container for Apache Spark, PySpark-Pandas, RasterFrames, GeoParquet
# FROM ubuntu:23:04 note Spark 3.4.0 will support Python 3.11
FROM ubuntu:22.04

# Setup Spark Version Requirements
ENV SPARK_HOME=/usr/local/spark
ARG SPARK_V="3.3.2"
ARG HADOOP_V="3"
ARG SCALA_V="13"
ARG OPENJDK_V="19"

# Apache Spark 3.3.2 with Scale2.13 Checksum from https://spark.apache.org/downloads.html
ARG spark_checksum="3ce800ca3e0569ccb8736e4fcdb8146ec6d3070da7622dcc9d0edbeb2dc9524224f3a082a70a0faff91306369a837caa13291a09f3ad0d2b0b51548365f90ead"

# Handle user-prompt for Ubuntu installation time zone selection
ENV TZ=America
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
    echo $TZ > /etc/timezone

# Update Ubuntu
RUN apt-get update && apt-get upgrade -y && \
    apt-get install -y apt-utils

# Install python
RUN apt-get update --fix-missing && apt-get upgrade -y && \
    apt-get install -y sudo dialog git openssh-server wget \
    curl cmake nano python3 python3-pip python3-setuptools \
    build-essential libpq-dev gdal-bin libgdal-dev

RUN apt-get update --yes && \
    apt-get install --yes --no-install-recommends \
    "openjdk-${OPENJDK_V}-jre-headless" \
    ca-certificates-java && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
  
RUN apt-get autoclean && apt-get autoremove && \
    apt-get update --fix-missing && apt-get upgrade -y && \
    dpkg --configure -a && apt-get install -f
  
# Install Non-Spark Python Packages
RUN pip install numpy GDAL pandas \
    geopandas geoparquet rtree \
    plotly rasterio folium descartes 

# Spark installation
WORKDIR /tmp
RUN wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${SPARK_V}/spark-${SPARK_V}-bin-hadoop${HADOOP_V}-scala2.${SCALA_V}.tgz"
RUN echo "${spark_checksum} *spark.tgz" | sha512sum -c -
RUN tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner
RUN rm "spark.tgz"

# Handle spark home error for /usr/local/spark
#RUN rm /usr/local/spark
RUN mv /usr/local/spark-3.3.2-bin-hadoop3-scala2.13 /usr/local/spark

# Configure Spark
ENV SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" PATH="${PATH}:${SPARK_HOME}/bin"
RUN mkdir -p /usr/local/bin/before-notebook.d
RUN ln -s "/usr/local/spark/sbin/spark-config.sh" /usr/local/bin/before-notebook.d/spark-config.sh

# Install Spark Python Packages
RUN pip install pyarrow py4j scipy \
    pyspark pyspark[sql] pyspark[pandas_on_spark]

WORKDIR /home

这是产生这个错误的脚本部分,有问题的函数生成 10m 个随机点,然后用于空间连接。

test-pyspark.py

import os, sys, json, io, rtree, fiona, math, findspark

import plotly.express as px
import plotly.graph_objects as go
from random import random
from scipy.special import expit as logistic

import numpy as np
import pandas as pd
import geopandas as gpd
import pyspark.pandas as ps

import pyarrow as pa
import pyarrow.parquet as pq

import shapely
from shapely import wkb, wkt
from shapely.geometry.base import BaseGeometry
from shapely.geometry.collection import GeometryCollection
from shapely.geometry import \
    LineString, MultiLineString, MultiPoint, MultiPolygon, Point, Polygon

import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import \
    col, lag, first, last, count, udf, lit, pandas_udf, split, array_contains
from pyspark.sql.types import \
    UserDefinedType, StructField, BinaryType, StructType, \
    StringType, IntegerType, FloatType, DoubleType, DecimalType, LongType

from ast import literal_eval as make_tuple

from IPython.display import display

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

findspark.init() 

job_name = "can300-june-pothole"
spark = SparkSession.builder.master("local[*]") \
    .appName(job_name) \
    .config("spark.config.option", "true") \
    .config("spark.executor.memory", "32g") \
    .config("spark.driver.memory", "32g") \
    .config("spark.sql.debug.maxToStringFields",1000000) \
    .getOrCreate()

spark_context_few_slices = 64
spark_context_some_slices = 256
spark_context_many_slices = 16384
plot_size = 800

def test_para():
    # Generate Number of Random Samples for Spatial Join: 10 million
    num_datapoints_to_match = int(10e6) # 10m datapoints

    data = spark.sparkContext.parallelize(
        [[random()*3, random()*3 ] for x in range(num_datapoints_to_match) ],
        numSlices=spark_context_some_slices)
    sdf0 = data \
        .map(lambda x: str(x)) \
        .map(lambda w: w[1:-1].split(',')) \
        .toDF()
    sdf0 = sdf0 \
        .withColumnRenamed('_1', 'x') \
        .withColumnRenamed('_2', 'y') \

if __name__ == "__main__":
    print("Start!")
    test_para()

错误:

root@b2924c37ab5a:/home/bl300# python3 test-pyspark.py 
/usr/local/lib/python3.10/dist-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/18 20:14:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Start!
Traceback (most recent call last):
  File "/home/bl300/test-pyspark.py", line 107, in <module>
    test_join()
  File "/home/bl300/test-pyspark.py", line 100, in test_join
    .toDF()
  File "/usr/local/lib/python3.10/dist-packages/pyspark/sql/session.py", line 115, in toDF
    return sparkSession.createDataFrame(self, schema, sampleRatio)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/sql/session.py", line 1276, in createDataFrame
    return self._create_dataframe(
  File "/usr/local/lib/python3.10/dist-packages/pyspark/sql/session.py", line 1316, in _create_dataframe
    rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/sql/session.py", line 931, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio, names=schema)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/sql/session.py", line 874, in _inferSchema
    first = rdd.first()
  File "/usr/local/lib/python3.10/dist-packages/pyspark/rdd.py", line 2869, in first
    rs = self.take(1)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/rdd.py", line 2836, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/context.py", line 2319, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/rdd.py", line 5441, in _jrdd
    wrapped_func = _wrap_function(
  File "/usr/local/lib/python3.10/dist-packages/pyspark/rdd.py", line 5243, in _wrap_function
    return sc._jvm.SimplePythonFunction(
TypeError: 'JavaPackage' object is not callable
docker pyspark apache-spark-sql rdd pandas-udf
© www.soinside.com 2019 - 2024. All rights reserved.