如何将dynamodb与pypspark或databricks集成?

问题描述 投票:0回答:1

我正在尝试将 amazon dynamodb 与 pyspark 连接,并且我正在按照此文档执行此操作 https://github.com/audienceproject/spark-dynamodb 但在输出中我没有在表中获取任何数据。这是我将 dynamodb 与 pyspark 连接的代码

import os


from pyspark.sql import SparkSession

os.environ["AWS_ACCESS_KEY"] ="myAccessKey" 
os.environ["AWS_SECRET_ACCESS_KEY"] = "mySecretKey"
print(os.environ["AWS_ACCESS_KEY"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
spark = SparkSession.builder \
    .appName("SQL Server to PySpark") \
    .config("spark.jars.packages","com.audienceproject:spark-dynamodb_2.12:1.1.2") \
    .getOrCreate()

# Load a DataFrame from a Dynamo table. Only incurs the cost of a single scan for schema inference.
df  = spark.read.option("tableName", "test_db") \
                     .format("dynamodb") \
                     .load()
df.show()

In output i am getting this 
24/10/07 12:05:19 WARN Utils: Your hostname, arvind-ThinkPad-T480s resolves to a loopback address: 127.0.1.1; using 192.168.1.77 instead (on interface wlp61s0)
24/10/07 12:05:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/arvind/Desktop/pyspark/spark-3.5.0-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/arvind/.ivy2/cache
The jars for the packages stored in: /home/arvind/.ivy2/jars
com.audienceproject#spark-dynamodb_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5377cf76-fb30-4e81-9f86-88e9a7474777;1.0
        confs: [default]
        found com.audienceproject#spark-dynamodb_2.12;1.1.2 in central
        found com.amazonaws#aws-java-sdk-sts;1.11.678 in central
        found com.amazonaws#aws-java-sdk-core;1.11.678 in central
        found commons-logging#commons-logging;1.1.3 in central
        found org.apache.httpcomponents#httpclient;4.5.9 in central
        found org.apache.httpcomponents#httpcore;4.4.11 in central
        found commons-codec#commons-codec;1.11 in central
        found software.amazon.ion#ion-java;1.0.2 in central
        found com.fasterxml.jackson.core#jackson-databind;2.6.7.3 in central
        found com.fasterxml.jackson.core#jackson-annotations;2.6.0 in central
        found com.fasterxml.jackson.core#jackson-core;2.6.7 in central
        found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.6.7 in central
        found joda-time#joda-time;2.8.1 in central
        found com.amazonaws#jmespath-java;1.11.678 in central
        found com.amazonaws#aws-java-sdk-dynamodb;1.11.678 in central
        found com.amazonaws#aws-java-sdk-s3;1.11.678 in central
        found com.amazonaws#aws-java-sdk-kms;1.11.678 in central
        found org.slf4j#slf4j-api;1.7.25 in central
:: resolution report :: resolve 602ms :: artifacts dl 28ms
        :: modules in use:
        com.amazonaws#aws-java-sdk-core;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-dynamodb;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-kms;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-s3;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-sts;1.11.678 from central in [default]
        com.amazonaws#jmespath-java;1.11.678 from central in [default]
        com.audienceproject#spark-dynamodb_2.12;1.1.2 from central in [default]
        com.fasterxml.jackson.core#jackson-annotations;2.6.0 from central in [default]
        com.fasterxml.jackson.core#jackson-core;2.6.7 from central in [default]
        com.fasterxml.jackson.core#jackson-databind;2.6.7.3 from central in [default]
        com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.6.7 from central in [default]
        commons-codec#commons-codec;1.11 from central in [default]
        commons-logging#commons-logging;1.1.3 from central in [default]
        joda-time#joda-time;2.8.1 from central in [default]
        org.apache.httpcomponents#httpclient;4.5.9 from central in [default]
        org.apache.httpcomponents#httpcore;4.4.11 from central in [default]
        org.slf4j#slf4j-api;1.7.25 from central in [default]
        software.amazon.ion#ion-java;1.0.2 from central in [default]
        :: evicted modules:
        commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   19  |   0   |   0   |   1   ||   18  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5377cf76-fb30-4e81-9f86-88e9a7474777
        confs: [default]
        0 artifacts copied, 18 already retrieved (0kB/16ms)
24/10/07 12:05:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/07 12:05:31 WARN V2ScanPartitioningAndOrdering: Spark ignores the partitioning OutputPartitioning. Please use KeyGroupedPartitioning for better performance

++

| |

++

++

为什么我遇到这个问题是这个驱动程序问题吗? 另外,如果有其他连接方式,请解释一下。

amazon-web-services pyspark amazon-dynamodb databricks warehouse
1个回答
0
投票

尝试在 Spark 会话配置中添加 S3 访问密钥和 S3 密钥:

# Configure Spark session
conf = SparkSession.builder \
    .appName("SQL Server to PySpark") \
    .config("spark.jars.packages", "com.audienceproject:spark-dynamodb_2.12:1.1.2") \
    .config("spark.hadoop.fs.s3a.access.key", os.environ["AWS_ACCESS_KEY"]) \
    .config("spark.hadoop.fs.s3a.secret.key", os.environ["AWS_SECRET_ACCESS_KEY"])

另请检查附加到您正在运行的服务的 AWS IAM 角色是否具有所有必需的访问权限。

© www.soinside.com 2019 - 2024. All rights reserved.