Azure .NET SDK - 使用正则表达式使用 DataLakeFileSystemClient 查询路径

问题描述 投票:0回答:1

我想使用

DataLakeFileSystemClient
通过正则表达式查询路径。 不幸的是,到目前为止,我能够弄清楚的唯一方法是遍历带有前缀的每个路径,然后在事后使用正则表达式来检查项目是否匹配。 有更好的方法吗?

private async static IAsyncEnumerable<string> TraverseDirectories(DataLakeFileSystemClient fileSystemClient, 
    string directoryPath, string filePattern, [EnumeratorCancellation] CancellationToken cancellationToken)
{
    cancellationToken.ThrowIfCancellationRequested();
    // List all paths (files and directories) in the current directory
    await foreach (PathItem pathItem in fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
    {
        cancellationToken.ThrowIfCancellationRequested();
        if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
            continue;

        // Match files using a wildcard pattern
        if (Regex.IsMatch(pathItem.Name, filePattern))
        {
            yield return pathItem.Name;
        }
    }
}
c# .net azure azure-data-lake
1个回答
0
投票

Azure .NET SDK - 使用正则表达式使用 DataLakeFileSystemClient 查询路径。

根据此MS-Document

DataLakeFileSystemClient
API 目前不直接支持使用正则表达式模式查询路径。相反,API 通过使用前缀匹配等参数进行过滤来启用列出路径。

但是,使用正则表达式遍历目录并过滤结果

client-side
的方法是一种有效的解决方法。

代码:

using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
using Azure.Identity;
using System.Text.RegularExpressions;


namespace DataLakeRegexExample
{
    public class DataLakeFileSearcher
    {
        private readonly DataLakeFileSystemClient _fileSystemClient;

        public DataLakeFileSearcher(DataLakeFileSystemClient fileSystemClient)
        {
            _fileSystemClient = fileSystemClient;
        }

        public async IAsyncEnumerable<string> TraverseDirectoriesAsync(
            string directoryPath,
            string filePattern,
            [System.Runtime.CompilerServices.EnumeratorCancellation] CancellationToken cancellationToken = default)
        {
            cancellationToken.ThrowIfCancellationRequested();

            // List all paths (files and directories) in the current directory with a specific prefix
            await foreach (PathItem pathItem in _fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
            {
                cancellationToken.ThrowIfCancellationRequested();
                if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
                    continue;
                if (Regex.IsMatch(pathItem.Name, filePattern))
                {
                    yield return pathItem.Name; // Yield matching file names
                }
            }
        }

        public static async Task Main(string[] args)
        {
            string accountName = "xxxx"; // Replace with your account name
            string fileSystemName = "xxx"; // Replace with your file system name
            string directoryPath = "xxxx"; // Replace with your directory path
            string filePattern = @"^.*\.csv$"; // sample Regex pattern for query only.csv files
            var serviceClient = new DataLakeServiceClient(
                new Uri($"https://{accountName}.dfs.core.windows.net"),
                new DefaultAzureCredential()
            );
            var fileSystemClient = serviceClient.GetFileSystemClient(fileSystemName);
            var searcher = new DataLakeFileSearcher(fileSystemClient);

            try
            {
                await foreach (var fileName in searcher.TraverseDirectoriesAsync(directoryPath, filePattern))
                {
                    Console.WriteLine(fileName);
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error occurred: {ex.Message}");
            }
        }
    }
}

上面的代码定义了一个

DataLakeFileSearcher
类,它利用
DataLakeFileSystemClient
来遍历目录并将文件名与正则表达式模式进行匹配。由于不支持直接正则表达式查询,因此这种方法在列出路径后可以有效地过滤客户端的路径。

输出:

backup/file1.csv
backup/large_file.csv

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.