我想根据将列表分为多个部分给出的子列表来分割数据帧,其中高于截止值的唯一值是第一个。
例如截止值 = 3
[4,2,3,5,2,1,6,7] => [4,2,3], [5,2,1], [6], [7]
我仍然需要跟踪数据框中的其他字段。
我应该从这个 df 得到给定的结果
data = {
"uid": ["Alice", "Bob", "Charlie"],
"time_deltas": [
[4,2, 3],
[1,1, 4, 8, 3],
[1,1, 7, 3, 2],
],
"other_field": [["x", "y", "z"], ["x", "y", "z", "x", "y"], ["x", "y", "z", "x", "y"]]
}
df = pl.DataFrame(data)
cutoff = 3
# Split the time_delta column into lists where the maximum time_delta (excluding the first value) is greater than the cutoff. Ensure that the other_field column is also split accordingly.
# Expected Output
# +--------+----------------------+----------------------+
# | uid | time_deltas | other_field |
# | --- | --- | --- |
# | str | list[duration[ms]] | list[str] |
# +--------+----------------------+----------------------+
# | Alice | [4, 2, 3] | ["x", "y", "z"] |
# | Bob | [1, 1] | ["x", "y"] |
# | Bob | [4] | ["z"] |
# | Bob | [8, 3] | ["x", "y"] |
# | Charlie| [1, 1] | ["x", "y"] |
# | Charlie| [7,3,2] | ["z", "x", "y"] |
您可以使用函数根据指定的截止值拆分
time_deltas
和 other_field
。每当子列表中的第一个值超出截止值时,就拆分 time_delas
列表,并确保关联的 other_field
元素与这些拆分相匹配
例如:
import pandas as pd
# Your data
data = {
"uid": ["Alice", "Bob", "Charlie"],
"time_deltas": [[4, 2, 3], [1, 1, 4, 8, 3], [1, 1, 7, 3, 2]],
"other_field": [["x", "y", "z"], ["x", "y", "z", "x", "y"], ["x", "y", "z", "x", "y"]],
}
df = pd.DataFrame(data)
cutoff = 3
# functio
def split_by_cutoff(uid, time_deltas, other_field, cutoff):
result = []
temp_time = []
temp_other = []
for i, (t, o) in enumerate(zip(time_deltas, other_field)):
if i == 0 or t > cutoff:
if temp_time:
result.append({"uid": uid, "time_deltas": temp_time, "other_field": temp_other})
temp_time = [t]
temp_other = [o]
else:
temp_time.append(t)
temp_other.append(o)
if temp_time:
result.append({"uid": uid, "time_deltas": temp_time, "other_field": temp_other})
return result
split_rows = []
for _, row in df.iterrows():
split_rows.extend(split_by_cutoff(row['uid'], row['time_deltas'], row['other_field'], cutoff))
split_df = pd.DataFrame(split_rows)
print(split_df)
希望这有帮助