我正在使用 Python 开发 AWS Lambda 函数,该函数只需将 JSON 文件转换为 Parquet。我设置了 5 分钟的超时限制并调用了该函数。虽然Parquet文件创建成功,但该函数仍然超时。当我调试它时,一切看起来都很好,直到 to_parquet 行,程序被卡住了。它在我的本地环境中完美运行。有人对如何解决这个问题有任何建议吗?
Python 函数
import awswrangler as wr
import pandas as pd
import urllib.parse
import os
# Temporary hard-coded AWS Settings; i.e. to be set as OS variable in Lambda
os_input_s3_cleansed_layer = os.environ['s3_cleansed_layer']
os_input_glue_catalog_db_name = os.environ['glue_catalog_db_name']
os_input_glue_catalog_table_name = os.environ[ 'glue_catalog_table_name']
os_input_write_data_operation = os.environ['write_data_operation']
def lambda_handler (event, context):
print('## EVENT')
print(event)
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
# Creating DF from content
df_raw = wr.s3.read_json ('s3://{}/{}'.format(bucket, key))
print('## DF RAW')
print(df_raw.shape)
# Extract required columns:
df_step_1 = pd.json_normalize(df_raw['items'])
print('## DF CLEANED')
print(df_step_1.shape)
# Write to S3
wr_response = wr.s3.to_parquet(
df=df_step_1,
path=os_input_s3_cleansed_layer,
dataset=True,
database=os_input_glue_catalog_db_name,
table=os_input_glue_catalog_table_name,
mode=os_input_write_data_operation
)
print('## RESPONSE')
return wr_response
except Exception as e:
print(e)
print('Error getting object {} from bucket {}.Make sure they exist and your bucket is in the same region as this function.'.format (key, bucket))
raise e
AWS Lambda 输出
Test Event Name
s3-put
Response
{
"errorMessage": "2024-07-31T07:11:36.885Z 7a24ab3f-13c8-4284-980f-6b705d64273a Task timed out after 307.11 seconds"
}
Function Logs
START RequestId: 7a24ab3f-13c8-4284-980f-6b705d64273a Version: $LATEST
## EVENT
{'Records': [{'eventVersion': '2.0', 'eventSource': 'aws:s3', 'awsRegion': 'us-east-1', 'eventTime': '1970-01-01T00:00:00.000Z', 'eventName': 'ObjectCreated:Put', 'userIdentity': {'principalId': 'EXAMPLE'}, 'requestParameters': {'sourceIPAddress': '127.0.0.1'}, 'responseElements': {'x-amz-request-id': 'EXAMPLE123456789', 'x-amz-id-2': 'EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH'}, 's3': {'s3SchemaVersion': '1.0', 'configurationId': 'testConfigRule', 'bucket': {'name': 'youtubepipeline-raw-useast1-dev', 'ownerIdentity': {'principalId': 'EXAMPLE'}, 'arn': 'arn:aws:s3:::youtubepipeline-raw-useast1-dev'}, 'object': {'key': 'youtube/raw_statistics_reference_data/CA_category_id.json', 'size': 1024, 'eTag': '0123456789abcdef0123456789abcdef', 'sequencer': '0A1B2C3D4E5F678901'}}}]}
## DF RAW
(31, 3)
## DF CLEANED
(31, 6)
2024-07-31T07:11:36.885Z 7a24ab3f-13c8-4284-980f-6b705d64273a Task timed out after 307.11 seconds
END RequestId: 7a24ab3f-13c8-4284-980f-6b705d64273a
REPORT RequestId: 7a24ab3f-13c8-4284-980f-6b705d64273a Duration: 307106.33 ms Billed Duration: 300000 ms Memory Size: 128 MB Max Memory Used: 128 MB Init Duration: 4043.97 ms
Request ID
7a24ab3f-13c8-4284-980f-6b705d64273a
我正在寻求解决此问题的建议。
您可能需要增加内存大小,因为分配的内存与输出中所示的最大已用内存相同:
Memory Size: 128 MB Max Memory Used: 128 MB