我正在尝试构建一个批处理生成器,该生成器将大的Pandas DataFrame作为输入并输出给定数量的行(batch_size)。我一直在尝试对10行的较小数据框进行操作。我在使用生成器功能时遇到了麻烦。下面的for循环在练习数据帧上运行良好,并吐出了指定的批处理大小:
for i in range(0, len(df), 3):
lower = i
upper = i+3
print(df.iloc[lower:upper])
但是,很难将其构建到生成器函数中:
def Generator(batch_size, seed = None):
num_items = len(df)
x = df.sample(frac = 1, replace = False, random_state = seed)
for offset in range(0, num_items, batch_size):
lower_limit = offset
upper_limit = offset+batch_size
batch = x.iloc[lower_limit:upper_limit]
yield batch
不幸的是:
next(Generator(e.g.1))
一遍又一遍地返回同一行
我对使用此工具还很陌生,我觉得我一定很想念某些东西,但是,我无法发现什么。如果有人能指出问题所在,我将不胜感激。
编辑:数据框是预定义的,它是:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'],
'age': [42, 52, 36, 24, 73, 53, 26, 72, 73, 24],
'preTestScore': [4, 24, 31, 2, 3, 13, 52, 72, 26, 26],
'postTestScore': [25, 94, 57, 62, 70, 82, 52, 56, 234, 254]}
df = pd.DataFrame(raw_data,columns = ['first_name','last_name','age','preTestScore','postTestScore'])df
根据调用Generator的结果创建一个迭代器,并next()
调用该迭代器。否则,您为生成器重新创建新的生成器“状态”,如果提供了种子,则它们可能具有相同的“第一行”。
解决了缩进问题后,它应能正常工作:
import pandas as pd
# I dislike variable scope bleeding into the function, provide df explicitly
def Generator(df, batch_size, seed = None):
num_items = len(df)
x = df.sample(frac = 1, replace = False, random_state = seed)
for offset in range(0, num_items, batch_size):
lower_limit = offset
upper_limit = offset+batch_size
batch = x.iloc[lower_limit:upper_limit]
yield batch
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Sarah',
'Gueniva', 'Know', 'Sara', 'Cat'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Mornig',
'Jaker', 'Alom', 'Ormon', 'Koozer'],
'age': [42, 52, 36, 24, 73, 53, 26, 72, 73, 24],
'preTestScore': [4, 24, 31, 2, 3, 13, 52, 72, 26, 26],
'postTestScore': [25, 94, 57, 62, 70, 82, 52, 56, 234, 254]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
# capture a "state" for the generator function
i = iter(Generator(df, 2))
# get the next states from the iterator and print
print(next(i))
print(next(i))
print(next(i))
输出:
first_name last_name age preTestScore postTestScore
8 Sara Ormon 73 26 234
6 Gueniva Jaker 26 52 52
first_name last_name age preTestScore postTestScore
5 Sarah Mornig 53 13 82
9 Cat Koozer 24 26 254
first_name last_name age preTestScore postTestScore
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
如果您这样做
print(next(Generator(df, 2)))
print(next(Generator(df, 2)))
print(next(Generator(df, 2)))
您创建了三个单独的混洗的df,它们可能会显示相同的行,因为您只打印了它的第一个“迭代”,然后就将其丢弃了]]