我正在尝试将 amazon dynamodb 与 pyspark 连接,并且我正在按照此文档执行此操作 https://github.com/audienceproject/spark-dynamodb 但在输出中我没有在表中获取任何数据。这是我将 dynamodb 与 pyspark 连接的代码
import os
from pyspark.sql import SparkSession
os.environ["AWS_ACCESS_KEY"] ="myAccessKey"
os.environ["AWS_SECRET_ACCESS_KEY"] = "mySecretKey"
print(os.environ["AWS_ACCESS_KEY"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
spark = SparkSession.builder \
.appName("SQL Server to PySpark") \
.config("spark.jars.packages","com.audienceproject:spark-dynamodb_2.12:1.1.2") \
.getOrCreate()
# Load a DataFrame from a Dynamo table. Only incurs the cost of a single scan for schema inference.
df = spark.read.option("tableName", "test_db") \
.format("dynamodb") \
.load()
df.show()
In output i am getting this
24/10/07 12:05:19 WARN Utils: Your hostname, arvind-ThinkPad-T480s resolves to a loopback address: 127.0.1.1; using 192.168.1.77 instead (on interface wlp61s0)
24/10/07 12:05:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/arvind/Desktop/pyspark/spark-3.5.0-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/arvind/.ivy2/cache
The jars for the packages stored in: /home/arvind/.ivy2/jars
com.audienceproject#spark-dynamodb_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5377cf76-fb30-4e81-9f86-88e9a7474777;1.0
confs: [default]
found com.audienceproject#spark-dynamodb_2.12;1.1.2 in central
found com.amazonaws#aws-java-sdk-sts;1.11.678 in central
found com.amazonaws#aws-java-sdk-core;1.11.678 in central
found commons-logging#commons-logging;1.1.3 in central
found org.apache.httpcomponents#httpclient;4.5.9 in central
found org.apache.httpcomponents#httpcore;4.4.11 in central
found commons-codec#commons-codec;1.11 in central
found software.amazon.ion#ion-java;1.0.2 in central
found com.fasterxml.jackson.core#jackson-databind;2.6.7.3 in central
found com.fasterxml.jackson.core#jackson-annotations;2.6.0 in central
found com.fasterxml.jackson.core#jackson-core;2.6.7 in central
found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.6.7 in central
found joda-time#joda-time;2.8.1 in central
found com.amazonaws#jmespath-java;1.11.678 in central
found com.amazonaws#aws-java-sdk-dynamodb;1.11.678 in central
found com.amazonaws#aws-java-sdk-s3;1.11.678 in central
found com.amazonaws#aws-java-sdk-kms;1.11.678 in central
found org.slf4j#slf4j-api;1.7.25 in central
:: resolution report :: resolve 602ms :: artifacts dl 28ms
:: modules in use:
com.amazonaws#aws-java-sdk-core;1.11.678 from central in [default]
com.amazonaws#aws-java-sdk-dynamodb;1.11.678 from central in [default]
com.amazonaws#aws-java-sdk-kms;1.11.678 from central in [default]
com.amazonaws#aws-java-sdk-s3;1.11.678 from central in [default]
com.amazonaws#aws-java-sdk-sts;1.11.678 from central in [default]
com.amazonaws#jmespath-java;1.11.678 from central in [default]
com.audienceproject#spark-dynamodb_2.12;1.1.2 from central in [default]
com.fasterxml.jackson.core#jackson-annotations;2.6.0 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.6.7 from central in [default]
com.fasterxml.jackson.core#jackson-databind;2.6.7.3 from central in [default]
com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.6.7 from central in [default]
commons-codec#commons-codec;1.11 from central in [default]
commons-logging#commons-logging;1.1.3 from central in [default]
joda-time#joda-time;2.8.1 from central in [default]
org.apache.httpcomponents#httpclient;4.5.9 from central in [default]
org.apache.httpcomponents#httpcore;4.4.11 from central in [default]
org.slf4j#slf4j-api;1.7.25 from central in [default]
software.amazon.ion#ion-java;1.0.2 from central in [default]
:: evicted modules:
commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 19 | 0 | 0 | 1 || 18 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5377cf76-fb30-4e81-9f86-88e9a7474777
confs: [default]
0 artifacts copied, 18 already retrieved (0kB/16ms)
24/10/07 12:05:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/07 12:05:31 WARN V2ScanPartitioningAndOrdering: Spark ignores the partitioning OutputPartitioning. Please use KeyGroupedPartitioning for better performance
++
| |
++
++
为什么我遇到这个问题是这个驱动程序问题吗? 另外,如果有其他连接方式,请解释一下。
尝试在 Spark 会话配置中添加 S3 访问密钥和 S3 密钥:
# Configure Spark session
conf = SparkSession.builder \
.appName("SQL Server to PySpark") \
.config("spark.jars.packages", "com.audienceproject:spark-dynamodb_2.12:1.1.2") \
.config("spark.hadoop.fs.s3a.access.key", os.environ["AWS_ACCESS_KEY"]) \
.config("spark.hadoop.fs.s3a.secret.key", os.environ["AWS_SECRET_ACCESS_KEY"])
另请检查附加到您正在运行的服务的 AWS IAM 角色是否具有所有必需的访问权限。