我正在开发一个 Django 应用程序,其部分过程是用时间戳转录音频。当用户单击 Web 界面中的按钮时,Django 服务器会启动一个有助于转录的 Python 脚本。
现在,这里有一些我已经尝试过的方法: 我有一个单独的 transcribe.py 文件。当用户单击网页中的转录按钮时,它将访问项目应用程序中的视图。但是,部分运行脚本后,Django 服务器将从终端终止。
由于 Python 脚本是一个长时间运行的进程,我认为我应该在后台运行该程序,这样 Django 服务器就不会终止。所以,我实现了 Celery 和 Redis。首先,当我从 Django shell 运行 transcribe.py 脚本时,它运行得非常好。但是,当我尝试从视图/网页执行它时,它再次终止。
python管理.py shell
由于我实现了 celery Worker 部分,服务器不会终止,但 Worker 会抛出以下错误。
[tasks]
. transcribeApp.tasks.run_transcription
[2024-11-25 03:26:04,500: INFO/MainProcess] Connected to redis://localhost:6379/0
[2024-11-25 03:26:04,514: INFO/MainProcess] mingle: searching for neighbors
[2024-11-25 03:26:05,520: INFO/MainProcess] mingle: all alone
[2024-11-25 03:26:05,544: INFO/MainProcess] [email protected] ready.
[2024-11-25 03:26:16,253: INFO/MainProcess] Task searchApp.tasks.run_transcription[c684bdfa-ec21-4b4e-9542-0ca1f7729682] received
[2024-11-25 03:26:16,255: INFO/ForkPoolWorker-15] Starting transcription process.
[2024-11-25 03:26:16,509: WARNING/ForkPoolWorker-15] /Users/user/Desktop/project/django_app/django_venv/lib/python3.12/site-packages/whisper/__init__.py:150: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(fp, map_location=device)
[2024-11-25 03:26:16,670: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:38956 exited with 'signal 11 (SIGSEGV)'
[2024-11-25 03:26:16,683: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.')
Traceback (most recent call last):
File "/Users/user/Desktop/project/django_app/django_venv/lib/python3.12/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.einfo.ExceptionWithTraceback:
"""
Traceback (most recent call last):
File "/Users/user/Desktop/project/django_app/django_venv/lib/python3.12/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.
"""
实现看起来像这样,
# Views.py
from . import tasks
from django.shortcuts import render
from django.http import HttpResponse, JsonResponse
def trainVideos(request):
try:
tasks.run_transcription.delay()
return JsonResponse({"status": "success", "message": "Transcription has started check back later."})
# return render(request, 'embed.html', {'data': data})
except Exception as e:
JsonResponse({"status": "error", "message": str(e)})
这是转录函数的样子,芹菜工作人员会抛出工作人员过早退出错误。
# Add one or two audios possibly .wav, .mp3 in a folder,
# and provide the file path here.
# transcribe.py
import whisper_timestamped as whisper
import os
def transcribeTexts(model_id, filePath):
result = []
fileNames = os.listdir()
model = whisper.load_model(model_id)
for files in fileNames:
audioPath = filePath + "/" + files
audio = whisper.load_audio(audioPath)
result.append(model.transcribe(audio, language="en"))
return result
model_id = "tiny"
audioFilePath = path/to/audio
transcribeTexts(model_id, audioFilePath)
安装以下库以重现问题:
pip install openai-whisper
pip3 install whisper-timestamped
pip install Django
pip install celery redis
pip install redis-server
Celery 实现:# celery.py 来自项目 main_app 目录
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'main_app.settings')
app = Celery('main_app')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
def debug_tasks(self):
print(f"Request: {self.request!r}")
transcribe_app 目录中的 tasks.py:
from __future__ import absolute_import, unicode_literals
from . import transcribe
from celery import shared_task
@shared_task
def run_transcription():
transcribe.transcribe()
return "Transcription Completed..."
settings.py 还更新了以下内容:
CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True
另外,修改了 django_app 中的 init.py 文件
from __future__ import absolute_import, unicode_literals
from .celery import app as celery_app
__all__ = ('celery_app',)
对于此应用程序,某些库依赖于特定版本。下面列出了所有库和包:
Package Version
-------------------- -----------
amqp 5.3.1
asgiref 3.8.1
billiard 4.2.1
celery 5.4.0
certifi 2024.8.30
charset-normalizer 3.3.2
click 8.1.7
click-didyoumean 0.3.1
click-plugins 1.1.1
click-repl 0.3.0
Cython 3.0.11
Django 5.1.2
django-widget-tweaks 1.5.0
dtw-python 1.5.3
faiss-cpu 1.9.0
ffmpeg 1.4
filelock 3.16.1
fsspec 2024.9.0
huggingface-hub 0.25.2
idna 3.10
Jinja2 3.1.4
kombu 5.4.2
lfs 0.2
llvmlite 0.43.0
MarkupSafe 3.0.1
more-itertools 10.5.0
mpmath 1.3.0
msgpack 1.1.0
networkx 3.3
numba 0.60.0
numpy 2.0.2
packaging 24.1
panda 0.3.1
pillow 10.4.0
pip 24.3.1
prompt_toolkit 3.0.48
pydub 0.25.1
python-dateutil 2.9.0.post0
PyYAML 6.0.2
redis 5.2.0
regex 2024.9.11
requests 2.32.3
safetensors 0.4.5
scipy 1.14.1
semantic-version 2.10.0
setuptools 75.1.0
setuptools-rust 1.10.2
six 1.16.0
sqlparse 0.5.1
sympy 1.13.3
tiktoken 0.8.0
tokenizers 0.20.1
torch 2.4.1
torchaudio 2.4.1
torchvision 0.19.1
tqdm 4.66.5
transformers 4.45.2
txtai 7.4.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
vine 5.1.0
wcwidth 0.2.13
whisper-timestamped 1.15.4
总的来说,当我独立运行该程序时,它运行得很好。但在 Django 中,无论我如何执行它,它都会终止。我认为原因之一可能是因为我正在加载长音频,所以我将其分块并尝试使用用户界面运行 transcribe.py 程序;然而,这也是工作人员过早退出的情况,信号 11 (SIGSEGV) 作业:0。我尝试将工作人员的内存池大小更改为更高的级别,但没有成功。我不确定在 Django 中运行 transcribe.py 文件需要做什么,因为大多数已知的方法都不适合我。我可能错过了一些东西,所以请帮我解决这个问题。谢谢您的宝贵时间。
sigsegv,请参阅此处。我可以重新创建代码,它在我这边工作得很好。以下是您发生这种情况的可能原因:
我将引导您完成如何重新创建代码,也许您犯了一个拼写错误或一个小错误,导致了您提到的错误。
django-admin startproject project101
cd project101
python3 manage.py startapp app101
project101/urls.py:
from django.contrib import admin
from django.urls import path, include
urlpatterns = [
path('admin/', admin.site.urls),
path('', include("app101.urls"))
]
project101/settings.py:
INSTALLED_APPS = [
# ...
'app101'
]
# put this at the end of settings.py
CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True
project101/celery.py
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project101.settings')
app = Celery('project101')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
def debug_tasks(self):
print(f"Request: {self.request!r}")
project101/init.py:
from __future__ import absolute_import, unicode_literals
from .celery import app as celery_app
__all__ = ('celery_app',)
app101/views.py:
from . import tasks
from django.shortcuts import render
from django.http import HttpResponse, JsonResponse
def trainVideos(request):
try:
tasks.run_transcription.delay()
return JsonResponse({"status": "success", "message": "Transcription has started check back later."})
# return render(request, 'embed.html', {'data': data})
except Exception as e:
JsonResponse({"status": "error", "message": str(e)})
app101/urls.py:
from django.urls import path, include
from . import views
urlpatterns = [
path('transcribe', views.trainVideos)
]
app101/tasks.py:
from __future__ import absolute_import, unicode_literals
from . import transcribe
from celery import shared_task
@shared_task
def run_transcription():
transcribe.transcribe()
return "Transcription Completed..."
app101/transcribe.py:
import whisper_timestamped as whisper
import os
def transcribeTexts(model_id, audio_directory_path):
result = []
fileNames = os.listdir(audio_directory_path)
model = whisper.load_model(model_id)
for files in fileNames:
print(files)
audioPath = audio_directory_path + "/" + files
audio = whisper.load_audio(audioPath)
result.append(model.transcribe(audio, language="en"))
print(result)
return result
def transcribe():
model_id = "tiny"
audio_directory_path = 'audio_sample'
transcribeTexts(model_id, audio_directory_path)
注意,
audio_sample
是app101之外的文件夹,与app101和project101具有相同的级别。您可以将其放在另一个文件夹中,但请确保指定正确的目录路径。我在下面添加了目录结构。
.
├── app101
│ ├── admin.py
│ ├── apps.py
│ ├── __init__.py
│ ├── migrations
│ ├── models.py
│ ├── __pycache__
│ ├── tasks.py
│ ├── tests.py
│ ├── transcribe.py
│ ├── urls.py
│ └── views.py
├── audio_sample
│ └── some_audio.mp3
├── db.sqlite3
├── manage.py
└── p101
├── asgi.py
├── celery.py
├── __init__.py
├── __pycache__
├── settings.py
├── urls.py
└── wsgi.py
此后,在不同的终端上运行以下命令:
python3 manage.py runserver
celery -A project101 worker --pool=solo -l info
这将使您的项目启动并运行。但请注意以下几点: