当尝试使用 pyspark 读取 XML 时,它在 databricks 上运行良好 - 但在本地安装上失败:
spark.read.format("xml").option("rowTag","result").load("ingestion/sap_sf/compound_employee/test/soap_metadata.xml")
spark.read.format("xml").option("rowTag", "result").load("ingestion/sap_sf/compound_employee/test/soap_metadata.xml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/sql/readwriter.py", line 307, in load
return self._df(self._jreader.load(path))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: xml. Please find packages at `https://spark.apache.org/third-party-projects.html`.
at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:725)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:697)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.ClassNotFoundException: xml.DefaultSource
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:633)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:633)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:633)
... 15 more`
当我使用附加 jar 从终端手动启动 pyspark 时:
pyspark --jars ~/dp-elt/python/libs/spark-xml_2.13-0.18.0.jar
然后我收到以下错误:
spark.read.format("xml").option("rowTag", "result").load("ingestion/sap_sf/compound_employee/test/soap_metadata.xml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/sql/readwriter.py", line 307, in load
return self._df(self._jreader.load(path))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/mnt/d/data/github/dp-elt/.dp-venv/lib64/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.load.
: java.lang.NoClassDefFoundError: scala/$less$colon$less
at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:82)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
... 21 more
一些消息来源表明这是一个版本混淆 - 知道如何进一步分析这个问题甚至修复它吗?
您不应该(必须)指定您尝试使用的库 (
commons-io-2.11.0.jar
) 的依赖关系 (xmlschema-core-2.3.0.jar
、spark-xml_2.12-0.18.0.jar
...),除非您正在做一些非常自定义的事情。
下面示例中的 Spark 安装位于 poetry
虚拟环境中。您可以对 virtualenv/conda 等执行相同的操作。以下是conda 的官方说明。如果您使用的是 Windows,那么您还需要 winutils 和 hadoop.dll。
使用 virtualenv 是避免版本冲突的最佳方法(IMO)。对于您自己的笔记本电脑上的本地/开发环境特别有用。
如您所见,它会自动安装
spark-xml
及其所有依赖项。如果您在代码中创建 SparkSession,则可以使用 .config('spark.jars.packages', 'com.databricks:spark-xml_2.12:0.18.0')
。
(venv) C:\My\workspaces\project> set PYSPARK_DRIVER_PYTHON=C:\My\workspaces\project\fxc\.venv\Scripts\python3
(venv) C:\My\workspaces\project> set PYSPARK_PYTHON=C:\My\workspaces\project\fxc\.venv\Scripts\python3
(venv) C:\My\workspaces\project> set SPARK_HOME=C:\My\workspaces\project\fxc\.venv\Lib\site-packages\pyspark\
(venv) C:\My\workspaces\project> set HADOOP_HOME=C:\My\workspaces\project\fxc\.venv\Lib\site-packages\pyspark\
(venv) C:\My\workspaces\project> set JAVA_HOME="C:\ProgFiles\jdk-1.8"
(venv) C:\My\workspaces\project>
(venv) C:\My\workspaces\project> pyspark --packages com.databricks:spark-xml_2.12:0.18.0
Python 3.10.12 (main, Jun 8 2023, 17:32:40) [MSC v.1936 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/C:/My/workspaces/project/.venv/Lib/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\kash\.ivy2\cache
The jars for the packages stored in: C:\Users\kash\.ivy2\jars
com.databricks#spark-xml_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5128ab4d-ce73-4022-9bf5-02f47538e923;1.0
confs: [default]
found com.databricks#spark-xml_2.12;0.18.0 in central
found commons-io#commons-io;2.11.0 in local-m2-cache
found org.glassfish.jaxb#txw2;3.0.2 in central
found org.apache.ws.xmlschema#xmlschema-core;2.3.0 in central
found org.scala-lang.modules#scala-collection-compat_2.12;2.9.0 in central
downloading https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/0.18.0/spark-xml_2.12-0.18.0.jar ...
[SUCCESSFUL ] com.databricks#spark-xml_2.12;0.18.0!spark-xml_2.12.jar (70ms)
downloading file:/C:/Users/kash/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar ...
[SUCCESSFUL ] commons-io#commons-io;2.11.0!commons-io.jar (10ms)
downloading https://repo1.maven.org/maven2/org/glassfish/jaxb/txw2/3.0.2/txw2-3.0.2.jar ...
[SUCCESSFUL ] org.glassfish.jaxb#txw2;3.0.2!txw2.jar (60ms)
downloading https://repo1.maven.org/maven2/org/apache/ws/xmlschema/xmlschema-core/2.3.0/xmlschema-core-2.3.0.jar ...
[SUCCESSFUL ] org.apache.ws.xmlschema#xmlschema-core;2.3.0!xmlschema-core.jar(bundle) (61ms)
downloading https://repo1.maven.org/maven2/org/scala-lang/modules/scala-collection-compat_2.12/2.9.0/scala-collection-compat_2.12-2.9.0.jar ...
[SUCCESSFUL ] org.scala-lang.modules#scala-collection-compat_2.12;2.9.0!scala-collection-compat_2.12.jar (77ms)
:: resolution report :: resolve 2973ms :: artifacts dl 287ms
:: modules in use:
com.databricks#spark-xml_2.12;0.18.0 from central in [default]
commons-io#commons-io;2.11.0 from local-m2-cache in [default]
org.apache.ws.xmlschema#xmlschema-core;2.3.0 from central in [default]
org.glassfish.jaxb#txw2;3.0.2 from central in [default]
org.scala-lang.modules#scala-collection-compat_2.12;2.9.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 5 | 5 | 0 || 5 | 5 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5128ab4d-ce73-4022-9bf5-02f47538e923
confs: [default]
5 artifacts copied, 0 already retrieved (989kB/24ms)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ `_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Python version 3.10.12 (main, Jun 8 2023 17:32:40)
Spark context Web UI available at http://xxxx:4040
Spark context available as 'sc' (master = local[*], app id = local-1736972509514).
SparkSession available as 'spark'.
>>>
>>> spark.read.format('xml').option('rowTag', 'dependency').load('../../pom.xml').show()
+--------------------+----------+--------------------+--------------------+--------+----+------------------+
| artifactId|classifier| exclusions| groupId| scope|type| version|
+--------------------+----------+--------------------+--------------------+--------+----+------------------+
| netty-all| NULL| NULL| io.netty| NULL|NULL| NULL|
| commons-net| NULL| NULL| commons-net| NULL|NULL| NULL|
| quickfixj-all| NULL| NULL| org.quickfixj|provided|NULL| NULL|
| forms| NULL| NULL| jgoodies| NULL|NULL| NULL|
| jxlayer| NULL| NULL| org.swinglabs| NULL|NULL| NULL|
| proxy-vole| NULL| NULL|com.github.markus...| NULL|NULL| 20131209|
| jbusycomponent| NULL| NULL| org.divxdede| NULL|NULL| NULL|
|timingframework-s...| NULL| NULL|net.java.timingfr...| NULL|NULL| NULL|
| miglayout| NULL| NULL| com.miglayout| NULL|NULL| NULL|
| jasperreports| NULL| NULL|net.sf.jasperreports| NULL|NULL| NULL|
| font| NULL| NULL|net.sf.jasperreports| runtime|NULL| NULL|
| mina-core| NULL| NULL| org.apache.mina| NULL|NULL| NULL|
| jna| NULL| NULL| net.java.dev.jna| NULL|NULL| NULL|
| jna-platform| NULL| NULL| net.java.dev.jna| NULL|NULL| NULL|
| CertUtils| NULL| NULL| net.java.dev.jna| NULL|NULL| NULL|
| castor| NULL| NULL| castor| NULL|NULL| NULL|
+--------------------+----------+--------------------+--------------------+--------+----+------------------+
only showing top 20 rows
>>>
如果这不适合您,并且您必须修复自己的 Spark 设置,然后发布您的环境的所有详细信息。