使用 python api 根据上次修改日期字段对谷歌云存储中的 blob 进行排序

Question

我有一个场景，我想列出 blob，然后使用上次修改时间对其进行排序。

我正在尝试在 python api 中做到这一点。

我想执行这个脚本n次，并且在每次执行中我想列出10个文件并执行一些操作（例如复制）。我想将最后一个文件的日期保存在配置文件中，并希望在上次保存日期之后的另一个迭代中列出文件。需要一些建议，因为 google api 不允许我们在列出后对文件进行排序。

blobs = storage_client.list_blobs(bucket_name,prefix=prefix,max_results=10)

Answer 1

我能想到的几种解决方案。

每次创建文件时获取 pubsub 通知。每次读取 10 条消息或将主题数据保存到 bigquery。
使用文件后，将其移动到包含元数据文件的另一个文件夹，或更新已处理的文件元数据。
使用存储来触发函数并将事件数据保存到数据库。
如果您控制文件名和路径，请使用前缀参数将它们保存在易于查询的路径中。

我认为数据库解决方案必须灵活，它可以让您最好地控制数据并能够为您的数据创建仪表板。

更多地了解您的流程将有助于为您提供更细粒度的解决方案。

Answer 2

对于像我一样在 2024 年以上抵达这里的人。

如果您的目标是按日期顺序批量处理 Blob，并且您能够提前列出可用的 Blob，您可以执行以下操作：

bucket_name = "MY_GCS_BUCKET"
prefix = "the/folder/path"

# Make a list for your Blobs
my_blobs = []

# Create your GCS iterator. Do not set max_results
my_blobs_iter = storage_client.list_blobs(bucket_name, prefix=prefix)

# Iterate by pages, e.g. if there are hundreds or thousands of Blobs
for page in my_blobs_iter.pages:
    for blob in page:
        my_blobs.append(blob)

# Sort your list by the Blob.generation property, which is a 16-dig microsecond timestamp
# NOTE: This generation property updates when a File is successfully overwritten by a 
# new version with the same name, based on GCS rules.
# That means the generation timestamp may now be incorrect if any of the Blobs have 
# updated since they were listed above

my_blobs = sorted(my_blobs, key=lambda b: b.generation, reverse=True)

从这里您可以按降序迭代所有可用的 Blob，因此可以根据您的意愿跟踪“当前”时间戳（如果实际上您仍然需要的话）。

通过列出所有 Blob 并提前对它们进行排序，您不需要纯粹出于分页目的跟踪当前日期时间，但您当然可以出于其他原因访问它。

您仍然可以批量处理数据，例如使用发电机

HOW_MANY_BLOBS_AT_A_TIME = 10

def create_chunks_of_blobs(mylist, n):
    # Yield n-sized chunks from mylist
    for i in range(0, len(mylist), n):
        yield mylist[i:i + n]

my_blob_generator = create_chunks_of_blobs(my_blobs, HOW_MANY_BLOBS_AT_A_TIME)

some_other_condition_to_stop = False
while True:
    chunk_of_up_to_10_blobs = next(my_blob_generator, None)
    if chunk_of_up_to_10_blobs is None:
        print('Finished processing all the Blobs')
        break

    if some_other_condition_to_stop == True:
        print("Exiting generator")
        break

    for blob in chunk_of_up_to_10_blobs:
        name = blob.name
        print(f"Processing Blob {name}")

        if name != "what-i-wanted.txt":
            print("Breaking the Blob for loop...")
            some_other_condition_to_stop = True
            break

使用 python api 根据上次修改日期字段对谷歌云存储中的 blob 进行排序

问题描述投票：0回答：2

2个回答

最新问题

使用 python api 根据上次修改日期字段对谷歌云存储中的 blob 进行排序

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2