解析 azure 数据湖的 http URI

问题描述 投票:0回答:1

我正在为 Label Studio 创建 Azure Data Lake 存储集成。 Django 后端的工作方式是,它可以解析云存储集成,以便将它们解析为图像等预签名对象的 http 地址。这允许后端读取所有文件,无论云存储提供商如何。但是,我无法成功创建 Azure 数据湖的预签名 URL,有什么建议吗?

应该成功的测试如下所示:

- name: stage
  request:
    method: GET
    url: '{django_live_url}/api/projects/{project_pk}/next'
  response:
    json:
      data:
        image_url: !re_match "/tasks/\\d+/presign/\\?fileuri=YXp1cmUtYmxvYjovL3B5dGVzdC1henVyZS1pbWFnZXMvYWJj"
    status_code: 200

上面的测试失败,端点返回“azure-spi://”而不是“/tasks/\d+/presign/?fileuri=YX...”

这意味着我没有按照后端的预期解析 uri。这里的问题是,我不知道 Azure Data Lake 是否支持具有服务主体身份验证的 http uri。请参阅我在下面的课程中尝试的resolve_uri 函数:

 
class AzureServicePrincipalImportStorageBase(AzureServicePrincipalStorageMixin, ImportStorage):
     url_scheme = 'azure_spi'

     presign = models.BooleanField(_('presign'), default=True, help_text='Generate presigned URLs')
     presign_ttl = models.PositiveSmallIntegerField(
         _('presign_ttl'), default=1, help_text='Presigned URLs TTL (in minutes)'
     )      
(…)
     def generate_http_url(self, url):
         match = re.match(AZURE_URL_PATTERN, url)
         if match:
             match_dict = match.groupdict()
             sas_token = self.get_sas_token(match_dict['blob_name'])
             url = f"{self.get_account_url()}/{self.container}/{match_dict['blob_name']}?{sas_token}"
         return url
(…)
     def resolve_uri(self, uri, task=None):
         #  list of objects
         if isinstance(uri, list):
             resolved = []
             for item in uri:
                 result = self.resolve_uri(item, task)
                 resolved.append(result if result else item)
             return resolved

         # dict of objects
         elif isinstance(uri, dict):
             resolved = {}
             for key in uri.keys():
                 result = self.resolve_uri(uri[key], task)
                 resolved[key] = result if result else uri[key]
             return resolved
         elif isinstance(uri, str):
             try:
                 # extract uri first from task data
                 if self.presign and task is not None:
                     sig = urlparse(uri)
                     if sig.query != '':
                         return uri
                 # resolve uri to url using storages
                 http_url = self.generate_http_url(uri)
                 return http_url

             except Exception:
                 logger.info(f"Can't resolve URI={uri}", exc_info=True)

     class Meta:
         abstract = True

django pytest azure-data-lake label-studio
1个回答
0
投票

要使用 Django 在 Label Studio 集成中解析 Azure Data Lake 的 HTTP URI,您需要确保为 Azure Data Lake 中存储的文件正确生成带有 SAS 令牌的预签名 URL。请按照以下步骤操作:

通过创建客户端类以使用 Python 与 Azure Data Lake 交互来设置 Azure Data Lake 客户端


class AzureDataLakeClient:
    def __init__(self, account_name, credential):
        self.service_client = DataLakeServiceClient(account_url=f"https://{account_name}.dfs.core.windows.net", credential=credential)

    def get_file_url(self, file_system_name, file_path):
        file_system_client = self.service_client.get_file_system_client(file_system_name)
        file_client = file_system_client.get_file_client(file_path)
        return file_client.url  # This will return the URL of the file

    def generate_sas_token(self, file_system_name, file_path, permission="read", expiry=None)://add required permission
        file_system_client = self.service_client.get_file_system_client(file_system_name)
        file_client = file_system_client.get_file_client(file_path)
        sas_token = file_client.generate_shared_access_signature(
            permission=permission,
            expiry=expiry
        )
        return f"{file_client.url}?{sas_token}"



将 Azure Data Lake 客户端集成到 Django 模型中

 def resolve_uri(self, uri, task=None):
        if isinstance(uri, list):
            resolved = []
            for item in uri:
                result = self.resolve_uri(item, task)
                resolved.append(result if result else item)
            return resolved
        elif isinstance(uri, dict):
            resolved = {}
            for key in uri.keys():
                result = self.resolve_uri(uri[key], task)
                resolved[key] = result if result else uri[key]
            return resolved
        elif isinstance(uri, str):
            try:
                if self.presign and task is not None:
                    sig = urlparse(uri)
                    if sig.query != '':
                        return uri
                http_url = self.generate_http_url(uri)
                return http_url
            except Exception as e:
                logger.info(f"Can't resolve URI={uri}", exc_info=True)


enter image description here

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.