如何将PyCharm与PySpark连接?

问题描述 投票:66回答:11

我是apache spark的新手,显然我在我的macbook中用自制软件安装了apache-spark:

Last login: Fri Jan  8 12:52:04 on console
user@MacBook-Pro-de-User-2:~$ pyspark
Python 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
16/01/08 14:46:50 INFO Remoting: Starting remoting
16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:50199]
16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.
16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.
16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040
16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201.
16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

我想开始玩,以了解有关MLlib的更多信息。但是,我使用Pycharm在python中编写脚本。问题是:当我去Pycharm并尝试调用pyspark时,Pycharm无法找到该模块。我尝试将路径添加到Pycharm,如下所示:

cant link pycharm with spark

然后从blog我尝试了这个:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"

# Append pyspark  to Python Path
sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print ("Successfully imported Spark Modules")

except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

并且仍然无法开始使用Pychark与Pycharm,任何想法如何“链接”PyCharm与apache-pyspark?

更新:

然后我搜索apache-spark和python路径以设置Pycharm的环境变量:

apache-spark路径:

user@MacBook-Pro-User-2:~$ brew info apache-spark
apache-spark: stable 1.6.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb

python路径:

user@MacBook-Pro-User-2:~$ brew info python
python: stable 2.7.11 (bottled), HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org
/usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *

然后用上面的信息我试着设置环境变量如下:

configuration 1

知道如何正确链接Pycharm与pyspark?

然后,当我运行具有上述配置的python脚本时,我有以下异常:

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module>
    from pyspark import SparkContext
ImportError: No module named pyspark

更新:然后我尝试了@ zero323提出的配置

配置1:

/usr/local/Cellar/apache-spark/1.5.1/ 

conf 1

出:

 user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls
CHANGES.txt           NOTICE                libexec/
INSTALL_RECEIPT.json  README.md
LICENSE               bin/

配置2:

/usr/local/Cellar/apache-spark/1.5.1/libexec 

enter image description here

出:

user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls
R/        bin/      data/     examples/ python/
RELEASE   conf/     ec2/      lib/      sbin/
apache-spark pyspark pycharm homebrew
11个回答
96
投票

使用PySpark包(Spark 2.2.0及更高版本)

随着SPARK-1267的合并,您应该能够通过pip在您用于PyCharm开发的环境中安装Spark来简化流程。

  1. 转到文件 - >设置 - >项目解释器
  2. 单击“安装”按钮并搜索PySpark enter image description here
  3. 单击“安装包”按钮。

手动提供用户提供的Spark安装

创建运行配置:

  1. 转到运行 - >编辑配置
  2. 添加新的Python配置
  3. 设置脚本路径,使其指向要执行的脚本
  4. 编辑环境变量字段,使其至少包含: SPARK_HOME - 它应该指向Spark安装目录。它应该包含目录,如bin(与spark-submitspark-shell等)和conf(与spark-defaults.confspark-env.sh等) PYTHONPATH - 它应该包含$SPARK_HOME/python和可选的$SPARK_HOME/python/lib/py4j-some-version.src.zip否则不可用。 some-version应匹配给定Spark安装使用的Py4J版本(0.8.2.1 - 1.5,0.9 - 1.6,0.10.3 - 2.0,0.10.4 - 2.1,0.10.4 - 2.2,0.10.6 - 2.3) enter image description here
  5. 应用设置

将PySpark库添加到解释器路径(代码完成所需):

  1. 转到文件 - >设置 - >项目解释器
  2. 打开要与Spark一起使用的解释器的设置
  3. 编辑解释器路径,使其包含$SPARK_HOME/python的路径(如果需要,可以使用Py4J)
  4. 保存设置

可选

  1. 安装或添加路径type annotations匹配已安装的Spark版本以获得更好的完成和静态错误检测(免责声明 - 我是该项目的作者)。

最后

使用新创建的配置来运行脚本。


0
投票

我在线跟踪教程并将env变量添加到.bashrc:

# add pyspark to python
export SPARK_HOME=/home/lolo/spark-1.6.1
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

然后我将SPARK_HOME和PYTHONPATH中的值赋予pycharm:

(srz-reco)lolo@K:~$ echo $SPARK_HOME 
/home/lolo/spark-1.6.1
(srz-reco)lolo@K:~$ echo $PYTHONPATH
/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:

然后我将其复制到脚本的运行/调试配置 - >环境变量。


-1
投票

最简单的方法是

转到anaconda / python安装的site-packages文件夹,复制粘贴pyspark和pyspark.egg-info文件夹。

重启pycharm以更新索引。上面提到的两个文件夹存在于spark安装的spark / python文件夹中。这样你也可以从pycharm获得代码完成建议。

可以在python安装中轻松找到站点包。在anaconda下它的anaconda / lib / pythonx.x / site-packages下


33
投票

这是我在mac osx上解决这个问题的方法。

  1. brew install apache-spark
  2. 将其添加到〜/ .bash_profile export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
  3. 将pyspark和py4j添加到内容根目录(使用正确的Spark版本): /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip

enter image description here


12
投票

这是适合我的设置(Win7 64bit,PyCharm2017.3CE)

设置智能感知:

  1. 单击文件 - >设置 - >项目: - >项目解释器
  2. 单击Project Interpreter下拉列表右侧的齿轮图标
  3. 从上下文菜单中单击“更多...”
  4. 选择解释器,然后单击“显示路径”图标(右下角)
  5. 单击+图标两个添加以下路径: \ python的\ LIB \ py4j-0.9-src.zip \ BIN \ python的\ LIB \ pyspark.zip
  6. 单击确定,确定,确定

继续测试您的新intellisense功能。


5
投票

在pycharm(windows)中配置pyspark

File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok

确保在windows环境中设置SPARK_HOME,pycharm将从那里开始。确认 :

Run menu - edit configurations - environment variables - [...] - show

(可选)在环境变量中设置SPARK_CONF_DIR。


4
投票

我使用以下页面作为参考,并能够获得在PyCharm 5中导入的pyspark / Spark 1.6.1(通过自制软件安装)。

http://renien.com/blog/accessing-pyspark-pycharm/

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.6.1"

# Append pyspark  to Python Path
sys.path.append("/usr/local/Cellar/apache-spark/1.6.1/libexec/python")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)
sys.exit(1)

使用上面的,pyspark加载,但是当我尝试创建SparkContext时,我收到网关错误。来自自制软件的Spark存在一些问题,所以我只是从Spark网站上获取Spark(下载为Hadoop 2.6及更高版本预构建)并指向其下的spark和py4j目录。这是pycharm中的代码有效!

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6"

# Need to Explicitly point to python3 if you are using Python 3.x
os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3"

#You might need to enter your local IP
#os.environ['SPARK_LOCAL_IP']="192.168.2.138"

#Path for pyspark and py4j
sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python")
sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

sc = SparkContext('local')
words = sc.parallelize(["scala","java","hadoop","spark","akka"])
print(words.count())

我从这些说明中获得了很多帮助,这些帮助我在PyDev中进行了故障排除,然后让它运行PyCharm - https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/

我敢肯定有人花了几个小时抨击他们的显示器试图让这个工作,所以希望这有助于拯救他们的理智!


2
投票

我使用conda来管理我的Python包。所以我在PyCharm外面的终端所做的就是:

conda install pyspark

或者,如果你想要一个早期版本,比如2.2.0,那么:

conda install pyspark=2.2.0

这也会自动拉入py4j。 PyCharm然后不再抱怨import pyspark...并且代码完成也起作用。注意我的PyCharm项目已经配置为使用Anaconda附带的Python解释器。


1
投票

看看this video.

假设你的spark python目录是:/home/user/spark/python

假设你的Py4j源是:/home/user/spark/python/lib/py4j-0.9-src.zip

基本上你将spark python目录和py4j目录添加到解释器路径中。我没有足够的声誉来发布屏幕截图或我愿意。

在视频中,用户在pycharm本身内创建虚拟环境,但是,您可以在pycharm之外创建虚拟环境或激活预先存在的虚拟环境,然后使用它启动pycharm并将这些路径添加到虚拟环境解释器路径中。在pycharm内。

我使用其他方法通过bash环境变量添加spark,这在pycharm之外工作得很好,但由于某些原因它们在pycharm中无法识别,但这种方法工作得很好。


1
投票

在启动IDE或Python之前,需要设置PYTHONPATH,SPARK_HOME。

Windows,编辑环境变量,添加spark python和py4j

PYTHONPATH=%PYTHONPATH%;{py4j};{spark python}

Unix的,

export PYTHONPATH=${PYTHONPATH};{py4j};{spark/python}

0
投票

来自documentation

要在Python中运行Spark应用程序,请使用位于Spark目录中的bin / spark-submit脚本。此脚本将加载Spark的Java / Scala库,并允许您将应用程序提交到群集。您还可以使用bin / pyspark来启动交互式Python shell。

您正在使用CPython解释器直接调用您的脚本,我认为这会导致问题。

尝试运行您的脚本:

"${SPARK_HOME}"/bin/spark-submit test_1.py

如果有效,你应该能够通过将项目的解释器设置为spark-submit来使它在PyCharm中工作。

© www.soinside.com 2019 - 2024. All rights reserved.