我想使用
DataLakeFileSystemClient
通过正则表达式查询路径。 不幸的是,到目前为止,我能够弄清楚的唯一方法是遍历带有前缀的每个路径,然后在事后使用正则表达式来检查项目是否匹配。 有更好的方法吗?
private async static IAsyncEnumerable<string> TraverseDirectories(DataLakeFileSystemClient fileSystemClient,
string directoryPath, string filePattern, [EnumeratorCancellation] CancellationToken cancellationToken)
{
cancellationToken.ThrowIfCancellationRequested();
// List all paths (files and directories) in the current directory
await foreach (PathItem pathItem in fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
{
cancellationToken.ThrowIfCancellationRequested();
if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
continue;
// Match files using a wildcard pattern
if (Regex.IsMatch(pathItem.Name, filePattern))
{
yield return pathItem.Name;
}
}
}
Azure .NET SDK - 使用正则表达式使用 DataLakeFileSystemClient 查询路径。
根据此MS-Document,
DataLakeFileSystemClient
API 目前不直接支持使用正则表达式模式查询路径。相反,API 通过使用前缀匹配等参数进行过滤来启用列出路径。
但是,使用正则表达式遍历目录并过滤结果
client-side
的方法是一种有效的解决方法。
代码:
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
using Azure.Identity;
using System.Text.RegularExpressions;
namespace DataLakeRegexExample
{
public class DataLakeFileSearcher
{
private readonly DataLakeFileSystemClient _fileSystemClient;
public DataLakeFileSearcher(DataLakeFileSystemClient fileSystemClient)
{
_fileSystemClient = fileSystemClient;
}
public async IAsyncEnumerable<string> TraverseDirectoriesAsync(
string directoryPath,
string filePattern,
[System.Runtime.CompilerServices.EnumeratorCancellation] CancellationToken cancellationToken = default)
{
cancellationToken.ThrowIfCancellationRequested();
// List all paths (files and directories) in the current directory with a specific prefix
await foreach (PathItem pathItem in _fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
{
cancellationToken.ThrowIfCancellationRequested();
if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
continue;
if (Regex.IsMatch(pathItem.Name, filePattern))
{
yield return pathItem.Name; // Yield matching file names
}
}
}
public static async Task Main(string[] args)
{
string accountName = "xxxx"; // Replace with your account name
string fileSystemName = "xxx"; // Replace with your file system name
string directoryPath = "xxxx"; // Replace with your directory path
string filePattern = @"^.*\.csv$"; // sample Regex pattern for query only.csv files
var serviceClient = new DataLakeServiceClient(
new Uri($"https://{accountName}.dfs.core.windows.net"),
new DefaultAzureCredential()
);
var fileSystemClient = serviceClient.GetFileSystemClient(fileSystemName);
var searcher = new DataLakeFileSearcher(fileSystemClient);
try
{
await foreach (var fileName in searcher.TraverseDirectoriesAsync(directoryPath, filePattern))
{
Console.WriteLine(fileName);
}
}
catch (Exception ex)
{
Console.WriteLine($"Error occurred: {ex.Message}");
}
}
}
}
上面的代码定义了一个
DataLakeFileSearcher
类,它利用 DataLakeFileSystemClient
来遍历目录并将文件名与正则表达式模式进行匹配。由于不支持直接正则表达式查询,因此这种方法在列出路径后可以有效地过滤客户端的路径。
输出:
backup/file1.csv
backup/large_file.csv