word_tokenize
的 Lambda 函数,它需要 punkt
数据文件才能运行。这是我
sam build
和 sam local invoke
我的 lambda 函数时抛出的错误:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/share/nltk_data'
- '/var/lang/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
...more errors (see below for details)
请帮忙!
我已经完成了 sam 初始化。
我做过的事情:
sam build
sam local invoke plag_check --event events/event.json
对于 plag_check,我创建了一个文件夹“plag_check”,里面有 app.py
plag_check/app.py:
import json
from nltk.tokenize import word_tokenize;
def lambda_handler(event, context):
text = "Hello, how are you doing?"
tokens = word_tokenize(text)
print(tokens)
return {
"statusCode": 200,
"body": json.dumps({
"message": "inbound",
"text":text,
"tokens": tokens,
}),
}
我的 SAM 项目中有一个文件夹“nltkDataFilesLayer”。其内容:
- nltkDataFilesLayer/
- nltk_data/
- corpora/
- wordnet.zip
- taggers/
- averaged_perceptron_tagger/
- averaged_perceptron_tagger.zip
- tokenizers/
- punkt/
- punkt.zip
它只有由
nltk.download
在我的本地计算机上生成的 nltk_data。
nltk.download
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: >
plag1
Sample SAM Template for plag1
Globals:
Function:
Timeout: 60
MemorySize: 512
Resources:
nltkDataFilesLayer:
Type: AWS::Serverless::LayerVersion
Properties:
LayerName: nltkDataFilesLayer
Description: Data files for nltk
ContentUri: ./nltkDataFilesLayer
CompatibleRuntimes:
- python3.9
PCheck:
Type: AWS::Serverless::Function
Properties:
CodeUri: plag_check/
Handler: app.lambda_handler
Runtime: python3.9
Layers:
- !Ref nltkDataFilesLayer
Architectures:
- x86_64
sam build
这里没有问题
我跑了
sam local invoke plag_check --event events/event.json
错误!
[ERROR] LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/share/nltk_data'
- '/var/lang/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
raise LookupError(resource_not_found), in findnpickle")sent_tokenizee)
END RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7
REPORT RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7 Init Duration: 0.56 ms Duration: 6051.22 ms Billed Duration: 6052 ms Memory Size: 512 MB Max Memory Used: 512 MB
{"errorMessage": "\n**********************************************************************\n Resource \u001b[93mpunkt\u001b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001b[31m>>> import nltk\n >>> nltk.download('punkt')\n \u001b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \u001b[93mtokenizers/punkt/PY3/english.pickle\u001b[0m\n\n Searched in:\n - '/root/nltk_data'\n - '/var/lang/nltk_data'\n - '/var/lang/share/nltk_data'\n - '/var/lang/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n - ''\n**********************************************************************\n", "errorType": "LookupError", "requestId": "188fc3e0-f33c-4f40-be51-a6a631ac53b7", "stackTrace": [" File \"/var/task/app.py\", line 8, in lambda_handler\n tokens = word_tokenize(text)\n", " File \"/var/task/nltk/tokenize/__init__.py\", line 129, in word_tokenize\n sentences = [text] if preserve_line else sent_tokenize(text, language)\n", " File \"/var/task/nltk/tokenize/__init__.py\", line 106, in sent_tokenize\n tokenizer = load(f\"tokenizers/punkt/{language}.pickle\")\n", " File \"/var/task/nltk/data.py\", line 750, in load\n opened_resource = _open(resource_url)\n", " File \"/var/task/nltk/data.py\", line 876, in _open\n return find(path_, path + [\"\"]).open()\n", " File \"/var/task/nltk/data.py\", line 583, in find\n raise LookupError(resource_not_found)\n"]}
终于找到解决办法了。就是这么简单!
将
nltk.data.path.append("/opt/nltk_data")
添加到我的 plag_check 的 lambda_handler
新的plag_check代码:
import json
from nltk.tokenize import word_tokenize;
def lambda_handler(event, context):
nltk.data.path.append("/opt/nltk_data") # this one line fixed it!
text = "Hello, how are you doing?"
tokens = word_tokenize(text)
print(tokens)
return {
"statusCode": 200,
"body": json.dumps({
"message": "inbound",
"text":text,
"tokens": tokens,
}),
}
注意:我将“/opt/”放在 nltk_data 之前
在阅读从函数访问图层内容(AWS 文章)
后我想到了这一点它说:
如果您的 Lambda 函数包含图层,Lambda 会将图层内容提取到函数执行环境中的 /opt 目录中。
这是使用图层时需要了解的重要事项之一。 所有图层内容都保存在/opt文件夹中。这就是 AWS 在 lambda 中做事的方式。
摘自文章:
Lambda 按照函数列出的顺序(从低到高)提取层。
Lambda 会合并同名文件夹,因此如果同一文件出现在多个层中,该函数将使用最后提取的层中的版本。
重定向 NLTK 以将丢失的数据下载到 /tmp 目录,该目录在 Lambda 中是可写的。
nltk.data.path.append("/tmp/nltk_data")
# Download necessary packages
if not os.path.exists("/tmp/nltk_data/"):
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")
# Proceed with tokenization
from nltk.tokenize import sent_tokenize
text = "This is a sample text. It has multiple sentences."
sentences = sent_tokenize(text)
print(sentences)