我正在 Windows 上使用 PySpark 执行字数统计任务。这是我的脚本:
from pyspark import SparkConf, SparkContext
import os
def main(input_file, output_dir):
# Spark configuration
conf = SparkConf().setAppName("WordCountTask").setMaster("local[*]")
sc = SparkContext(conf=conf)
# Reading the input file
text_file = sc.textFile(input_file)
# Word count
counts = (
text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
)
# Saving the results
if not os.path.exists(output_dir):
os.makedirs(output_dir)
counts.saveAsTextFile(output_dir)
print(f"Results saved in the directory: {output_dir}")
if __name__ == "__main__":
input_file = r"C:\Users\Documents\pyspark_python\wordcount\input\loremipsum.txt"
output_dir = r"C:\Users\Documents\pyspark_python\wordcount\output"
main(input_file, output_dir)
当我使用 Python 3.12 使用以下命令执行此脚本时:
PS C:\Users\> & "C:/Program Files/Python312/python.exe" c:/Users/Documents/pyspark_python/wordcount/output/wordcount_task.py
我收到以下错误:
Traceback (most recent call last):
File "c:\Users\Documents\pyspark_python\wordcount\output\wordcount_task.py", line 22, in <module>
main(input_file, output_dir)
File "c:\Users\Documents\pyspark_python\wordcount\output\wordcount_task.py", line 12, in main
text_file = sc.textFile(input_file)
File "path/to/pyspark/context.py", line XYZ, in textFile
...
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Documents\\pyspark_python\\wordcount\\input\\loremipsum.txt'
PySpark 中可能导致此 PermissionError 的原因是什么?
在 Windows 上处理此问题是否需要特定权限或 Spark 配置?
我们中的许多人都是基于spark的unit/linux。在windows中查看,并根据网上找到的调整权限。您应该执行以下步骤
另外....PySpark 依赖于 Hadoop 兼容的文件系统来访问本地文件。确保本地文件路径的 Spark 配置设置正确:
conf = SparkConf().setAppName("WordCountTask").setMaster("local[*]")
conf.set("spark.hadoop.validateOutputSpecs", "false")