我正在为 Label Studio 创建 Azure Data Lake 存储集成。 Django 后端的工作方式是,它可以解析云存储集成,以便将它们解析为图像等预签名对象的 http 地址。这允许后端读取所有文件,无论云存储提供商如何。但是,我无法成功创建 Azure 数据湖的预签名 URL,有什么建议吗?
应该成功的测试如下所示:
- name: stage
request:
method: GET
url: '{django_live_url}/api/projects/{project_pk}/next'
response:
json:
data:
image_url: !re_match "/tasks/\\d+/presign/\\?fileuri=YXp1cmUtYmxvYjovL3B5dGVzdC1henVyZS1pbWFnZXMvYWJj"
status_code: 200
上面的测试失败,端点返回“azure-spi://
这意味着我没有按照后端的预期解析 uri。这里的问题是,我不知道 Azure Data Lake 是否支持具有服务主体身份验证的 http uri。请参阅我在下面的课程中尝试的resolve_uri 函数:
class AzureServicePrincipalImportStorageBase(AzureServicePrincipalStorageMixin, ImportStorage):
url_scheme = 'azure_spi'
presign = models.BooleanField(_('presign'), default=True, help_text='Generate presigned URLs')
presign_ttl = models.PositiveSmallIntegerField(
_('presign_ttl'), default=1, help_text='Presigned URLs TTL (in minutes)'
)
(…)
def generate_http_url(self, url):
match = re.match(AZURE_URL_PATTERN, url)
if match:
match_dict = match.groupdict()
sas_token = self.get_sas_token(match_dict['blob_name'])
url = f"{self.get_account_url()}/{self.container}/{match_dict['blob_name']}?{sas_token}"
return url
(…)
def resolve_uri(self, uri, task=None):
# list of objects
if isinstance(uri, list):
resolved = []
for item in uri:
result = self.resolve_uri(item, task)
resolved.append(result if result else item)
return resolved
# dict of objects
elif isinstance(uri, dict):
resolved = {}
for key in uri.keys():
result = self.resolve_uri(uri[key], task)
resolved[key] = result if result else uri[key]
return resolved
elif isinstance(uri, str):
try:
# extract uri first from task data
if self.presign and task is not None:
sig = urlparse(uri)
if sig.query != '':
return uri
# resolve uri to url using storages
http_url = self.generate_http_url(uri)
return http_url
except Exception:
logger.info(f"Can't resolve URI={uri}", exc_info=True)
class Meta:
abstract = True
要使用 Django 在 Label Studio 集成中解析 Azure Data Lake 的 HTTP URI,您需要确保为 Azure Data Lake 中存储的文件正确生成带有 SAS 令牌的预签名 URL。请按照以下步骤操作:
通过创建客户端类以使用 Python 与 Azure Data Lake 交互来设置 Azure Data Lake 客户端
class AzureDataLakeClient:
def __init__(self, account_name, credential):
self.service_client = DataLakeServiceClient(account_url=f"https://{account_name}.dfs.core.windows.net", credential=credential)
def get_file_url(self, file_system_name, file_path):
file_system_client = self.service_client.get_file_system_client(file_system_name)
file_client = file_system_client.get_file_client(file_path)
return file_client.url # This will return the URL of the file
def generate_sas_token(self, file_system_name, file_path, permission="read", expiry=None)://add required permission
file_system_client = self.service_client.get_file_system_client(file_system_name)
file_client = file_system_client.get_file_client(file_path)
sas_token = file_client.generate_shared_access_signature(
permission=permission,
expiry=expiry
)
return f"{file_client.url}?{sas_token}"
将 Azure Data Lake 客户端集成到 Django 模型中
def resolve_uri(self, uri, task=None):
if isinstance(uri, list):
resolved = []
for item in uri:
result = self.resolve_uri(item, task)
resolved.append(result if result else item)
return resolved
elif isinstance(uri, dict):
resolved = {}
for key in uri.keys():
result = self.resolve_uri(uri[key], task)
resolved[key] = result if result else uri[key]
return resolved
elif isinstance(uri, str):
try:
if self.presign and task is not None:
sig = urlparse(uri)
if sig.query != '':
return uri
http_url = self.generate_http_url(uri)
return http_url
except Exception as e:
logger.info(f"Can't resolve URI={uri}", exc_info=True)