我想使用queryset迭代器来迭代大型数据集。 Django为此提供了iterator()
,但是每次迭代都会打到数据库。我在块中迭代发现了以下代码 -
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator
does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk').values_list('pk', flat=True).first()
if last_pk is not None:
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
这适用于无序查询集。是否有任何解决方案/解决方法在有序的查询集上执行此操作?
这是我的,具有排序功能。
顺便说一下,你正在使用的迭代器在进行修改查询集项时有一个“永久循环”:删除或添加,甚至一个项目。
并且在iterator下面对last_pk没有任何无用的查询
def queryset_iterator(queryset, chunksize=10000, key=None):
key = [key] if isinstance(key, basestring) else (key or ['pk'])
counter = 0
count = chunksize
while count == chunksize:
offset = counter - counter % chunksize
count = 0
for item in queryset.all().order_by(*key)[offset:offset + chunksize]:
count += 1
yield item
counter += count
gc.collect()