在AWS Glue中读取配置文件

问题描述 投票:0回答:1

我在部署到AWS Glue之前创建了一个Glue Dev Endpoint来测试我的代码。下面,您将找到项目架构的屏幕截图。在gluelibrary中有Project layout /有config.ini我能够成功调试代码并让它运行完成。我在DEV环境中调用库的方式如下所示:

Dev ENV

import sys
import os
import time
from configobj import ConfigObj
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

config = ConfigObj('/home/glue/scripts/gluelibrary/config.ini')

此过程成功找到我在配置文件中定义的所有变量,并以“退出代码0”退出

Console

注意:我开发的库是.zip并添加到s3存储桶中,我告诉Glue Job查找.zip。

但是,当我在Glue Console中,并且我尝试实现相同的代码(文件路径除外)时出现错误:

import sys
import os
import time
from configobj import ConfigObj
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

from gluelibrary.helpers import get_date
from gluelibrary import
from gluelibrary.boto3_.s3_utils import delete_data_in_sub_directories, check_for_empty_bucket
from gluelibrary.boto3_.s3_utils import replace_data_in_sub_directories, check_bucket_existence
print('starting job.')

print(os.getcwd())

config = ConfigObj('/home/glue/gluelibrary/config.ini')

--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms = 60000 --conf spark.hadoop.fs.defaultFS = hdfs://IP_ADDRESS.internal:8020 --conf spark.hadoop.yarn.resourcemanager .address = IP_ADDRESS.internal:8032 --conf spark.dynamicAllocation.enabled = true --conf spark.shuffle.service.enabled = true --conf spark.dynamicAllocation.minExecutors = 1 --conf spark.dynamicAllocation.maxExecutors = 18 --conf spark.executor.memory = 5g --conf spark.executor.cores = 4 --JOB_ID j_26c2ab188a2d8b7567006809c549f5894333cd38f191f58ae1f2258475ed03d1 --enable-metrics --extra-py-files s3://BUCKET_NAME/Python/gluelibrary.zip --JOB_RUN_ID jr_0292d34a8b82dad6872f5ee0cae5b3e6d0b1fbc503dca8a62993ea0f3b38a2ae --scriptLocation s3:// BUCKET_NAME / admin / JOB_NAME --job-bookmark-option job-bookmark-enable --job-language python --TempDir s3:// BUCKET_NAME / admin --JOB_NAME JOB_NAME YARN_RM_DNS = IP_ADDRESS.internal检测到的区域us-east-2 JOB_NAME = JOB_NAME在复制脚本时指定us-east-2。完成6.6 KiB / 6.6 KiB(70.9 KiB / s),剩余1个文件下载:s3:// BUCKET_NAME / admin / JOB_NAME to ./script_2018-10-12-14-57-20.py SCRIPT_URL = / tmp /g-6cad80fb460992d2c24a6f476b12275d2a9bc164-362894612904031505/script_2018-10-12-14-57-20.py

aws-glue
1个回答
0
投票

如果您需要从胶水作业中访问额外的文件,您必须:

  1. 将每个文件复制到Glue有权访问的S3上的某个位置
  2. 在你工作的extra-files special parameter中包含每个文件的完整S3密钥,逗号分隔

然后Glue将这些文件添加到--filesspark-submit参数中,您应该可以从Spark作业中访问它们,就像它们在工作目录中一样。

在您的示例中,您应该能够简单地执行以下操作:

config = ConfigObj("config.ini")
© www.soinside.com 2019 - 2024. All rights reserved.