将 Apache Beam Python 与 GCP Dataflow 结合使用时,具体化 GroupByKey 的结果是否存在缺点,例如,计算元素的数量。例如:
def consume_group_by_key(element):
season, fruits = element
for fruit in fruits:
yield f"{fruit} grows in {season}"
def consume_group_by_key_materialize(element):
season, fruits = element
num_fruits = len(list(fruits))
print(f"There are {num_fruits} fruits grown in {season}")
for fruit in fruits:
yield f"{fruit} grows in {season}"
(
pipeline
| 'Create produce counts' >> beam.Create([
('spring', 'strawberry'),
('spring', 'carrot'),
('spring', 'eggplant'),
('spring', 'tomato'),
('summer', 'carrot'),
('summer', 'tomato'),
('summer', 'corn'),
('fall', 'carrot'),
('fall', 'tomato'),
('winter', 'eggplant'),
])
| 'Group counts per produce' >> beam.GroupByKey()
| beam.ParDo(consume_group_by_key_generator)
)
分组后的值
fruits
是否作为生成器传递给了我的DoFn?使用 consume_group_by_key_materialize
而不是 consume_group_by_key
是否会降低性能?或者换句话说,通过像fruits
这样的东西来实现len(list(fruits))
?如果有数十亿个水果,这会耗尽我所有的记忆吗?
你是对的,
len(list(fruits))
会在获取它的大小之前在内存中具体化整个列表,而你的 consume_group_by_key
函数迭代可迭代一次并且(在像 Dataflow 这样的分布式运行器上)不需要将整个值集放入内存一次。