首先,我不明白为什么人们在这个问题上给分数打分。要么解释我如何改善问题。我可以进一步阐述。这是我的反馈。尽管我是新手,但我不打算不加努力就提问。
我正在尝试在使用jep解释器的Google Cloud Platform Dataproc集群上运行用Scala编写的spark作业。
我已将jep添加为依赖项。
使用Google Cloud Platform Dataproc在Scala上运行jep的完整简短解决方案是什么?>
"black.ninia" % "jep" % "3.9.0"
在我编写的install.sh脚本中
sudo -E pip install jep export JEP_PATH=$(pip show jep | grep "^Location:" | cut -d ':' -f 2,3 | cut -d ' ' -f 2) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
仍然出现以下错误(java.library.path中没有jep)
20/01/07 09:07:23 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 4.0 in stage 9.0 (TID 74, fs-xxxx-xxx-xxxx-test-w-1.c.xx-xxxx.internal, executor 1): java.lang.UnsatisfiedLinkError: no jep in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at jep.MainInterpreter.initialize(MainInterpreter.java:128) at jep.MainInterpreter.getMainInterpreter(MainInterpreter.java:101) at jep.Jep.<init>(Jep.java:256) at jep.SharedInterpreter.<init>(SharedInterpreter.java:56) at dunnhumby.sciencebank.SubsCommons$$anonfun$getUnitVecEmbeddings$1.apply(SubsCommons.scala:33) at dunnhumby.sciencebank.SubsCommons$$anonfun$getUnitVecEmbeddings$1.apply(SubsCommons.scala:31) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
(编辑):-
1。)我在本地计算机上看到了具体的答案,但对于Google Cloud Platform却没有。
2。)我找到了https://github.com/ninia/jep/issues/141,但这没有帮助
3。)我也找到了answer,但未得到回答,Google Cloud Platform也不接受。我什至从那里执行了所有步骤。
4。)如果问题遗漏了一些快照,我会附上。但请提供一些评论。
((编辑:-08012020我正在添加使用的install.sh)
#!/bin/bash
set -x -e
# Disable ipv6 since it seems to cause intermittent SocketTimeoutException when collecting data
# See CENG-1268 in Jira
printf "\nnet.ipv6.conf.default.disable_ipv6=1\nnet.ipv6.conf.all.disable_ipv6=1\n" >> /etc/sysctl.conf
sysctl -p
if [[ $(/usr/share/google/get_metadata_value attributes/dataproc-role) == Master ]]; then
config_bucket="$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-configuration-directory | cut -d'/' -f3)"
dataproc_cluster_name="$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)"
hdfs dfs -mkdir -p gs://${config_bucket}/${dataproc_cluster_name}/spark_events
systemctl restart spark-history-server.service
fi
tee -a /etc/hosts << EOM
$$(/usr/share/google/get_metadata_value /attributes/preprod-mjr-dataplatform-metrics-mig-ip) influxdb
EOM
echo "[global]
index-url = https://cs-anonymous:[email protected]/artifactory/api/pypi/pypi-remote/simple" >/etc/pip.conf
PIP_REQUIREMENTS_FILE=gs://preprod-xxx-dpl-artif/dataproc/requirements.txt
PIP_TRANSITIVE_REQUIREMENTS_FILE=gs://preprod-xxx-dpl-artif/dataproc/transitive-requirements.txt
gsutil cp ${PIP_REQUIREMENTS_FILE} .
gsutil cp ${PIP_TRANSITIVE_REQUIREMENTS_FILE} .
gsutil -q cp gs://preprod-xxx-dpl-artif/dataproc/apt-transport-https_1.4.8_amd64.deb /tmp/apt-transport-https_1.4.8_amd64.deb
export http_proxy=http://preprod-xxx-securecomms.preprod-xxx-securecomms.il4.us-east1.lb.dh-xxxxx-media-55595.internal:3128
export https_proxy=http://preprod-xxx-securecomms.preprod-xxx-securecomms.il4.us-east1.lb.dh-xxxxx-media-55595.internal:3128
export no_proxy=google.com,googleapis.com,localhost
echo "deb https://cs-anonymous:[email protected]/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list
echo "deb https://cs-anonymous:[email protected]/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list
echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update
echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout
echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout
sudo dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb
sudo apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb
sudo -E apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
sudo -E apt-get --allow-unauthenticated -y install python-pip gcc python-dev python-tk curl
#requires index-url specifying because the version of pip installed by previous command
#installs an old version that doesn't seem to recognise pip.conf
sudo -E pip install --index-url https://cs-anonymous:[email protected]/artifactory/api/pypi/pypi-remote/simple --ignore-installed pip setuptools wheel
sudo -E pip install jep
sudo -E pip install gensim
JEP_PATH=$(pip show jep | grep "^Location:" | cut -d ':' -f 2,3 | cut -d ' ' -f 2)
cat << EOF >> /etc/spark/conf/spark-env.sh
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
export LD_PRELOAD=$LD_PRELOAD:$JEP_PATH/jep
EOF
tee -a /etc/spark/conf/spark-defaults.conf << EOM
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
export LD_PRELOAD=$LD_PRELOAD:$JEP_PATH/jep
EOM
tee -a /etc/*bashrc << EOM
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JEP_PATH/jep
export LD_PRELOAD=$LD_PRELOAD:$JEP_PATH/jep
EOM
source /etc/*bashrc
sudo -E apt-get install --allow-unauthenticated -y \
pkg-config \
freetype* \
python-matplotlib \
libpq-dev \
libssl-dev \
libcrypto* \
python-dev \
libtext-csv-xs-perl \
libmysqlclient-dev \
libfreetype* \
libzmq3-dev \
libzmq3*
sudo -E pip install -r ./requirements.txt
首先,我不明白为什么人们在这个问题上给分数打分。要么解释我如何改善问题。我可以进一步阐述。这是我的反馈。虽然我是新手,但我...
假设您将install.sh用作Dataproc的初始化动作,您的export
命令将仅在运行init动作的本地shell会话中导出那些环境变量,而不是随后运行的所有Spark进程的持久变量。