我正在运行带有 Hadoop 3.3.4 的本地 Dockerized Spark 3.5.3。我想从公开可用的 AWS S3 存储桶下载二进制文件,因此我正在尝试以下 python 脚本:
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder \
.appName("ReadBinaryFilesFromPublicS3") \
.config("spark.hadoop.fs.s3a.access.key", "none") \
.config("spark.hadoop.fs.s3a.secret.key", "none") \
.config("spark.hadoop.fs.s3a.signature", "none") \
.config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
.getOrCreate()
binary_files_df = spark.read.format("binaryFile").load(f"s3a://my-bucket-name/path/to/files/*")
旁注:我尝试将签名设置为无,因为之前我使用以下代码成功地使用 boto3 库逐一下载文件:
s3_client = boto3.client("s3", config=Config(signature_version=UNSIGNED))
s3_client.download_file("my-bucket-name", "path/to/files/filename, my_local_filename)
但是 Spark 返回以下错误:
java.nio.file.AccessDeniedException: s3a://my-bucket-name/path/to/files/*: getFileStatus on s3a://my-bucket-name/path/to/files/*: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: DPKPZWH7HY8YDX61; S3 Extended Request ID: YoV3ws/nFtE06tg/9oU2SGHp5gabb2WDbOD/+UkK4h95QCjHGNhn1mnCNOl38btZePNFlsl00AInFBbVdC/Fkde+3W4Ne+DIJGawfY9DPS4=; Proxy: null), S3 Extended Request ID: YoV3ws/nFtE06tg/9oU2SGHp5gabb2WDbOD/+UkK4h95QCjHGNhn1mnCNOl38btZePNFlsl00AInFBbVdC/Fkde+3W4Ne+DIJGawfY9DPS4=:InvalidAccessKeyId
除了我无法简单地在任何文档中找到有关如何连接到打开的存储桶的任何信息之外:如果我什至可以使用 HTTPS URL 直接从浏览器访问文件,为什么我会收到这些错误?如果我删除访问和密钥配置行:
.config("spark.hadoop.fs.s3a.access.key", "none") \
.config("spark.hadoop.fs.s3a.secret.key", "none") \
或将值“none”更改为“”,我收到它们丢失的错误,如下所示:
py4j.protocol.Py4JJavaError: An error occurred while calling o53.load.
: java.nio.file.AccessDeniedException: s3a://my-bucket-name/path/to/files/*: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
预先感谢您提供有关如何解决此问题的任何提示。
AnonymousAWSCredentialsProvider
spark = (
SparkSession.builder
.appName("ReadBinaryFilesFromPublicS3")
.config(
"spark.hadoop.fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
)
.config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com")
.getOrCreate()
)
是的,从身份验证的角度来看,http 和 s3 访问是不同的。