通过截止值根据子列表分割数据帧

问题描述 投票:0回答:1

我想根据将列表分为多个部分给出的子列表来分割数据帧,其中高于截止值的唯一值是第一个。

例如截止值 = 3

[4,2,3,5,2,1,6,7] => [4,2,3], [5,2,1], [6], [7]

我仍然需要跟踪数据框中的其他字段。

我应该从这个 df 得到给定的结果

data = {
    "uid": ["Alice", "Bob", "Charlie"],
    "time_deltas": [
        [4,2, 3],
        [1,1, 4, 8, 3],
        [1,1, 7, 3, 2],
    ],
    "other_field": [["x", "y", "z"], ["x", "y", "z", "x", "y"], ["x", "y", "z", "x", "y"]]
}

df = pl.DataFrame(data)
cutoff = 3

# Split the time_delta column into lists where the maximum time_delta (excluding the first value) is greater than the cutoff. Ensure that the other_field column is also split accordingly.

# Expected Output
# +--------+----------------------+----------------------+
# | uid    | time_deltas          | other_field          |
# | ---    | ---                  | ---                  |
# | str    | list[duration[ms]]   | list[str]            |
# +--------+----------------------+----------------------+
# | Alice  | [4, 2, 3]            | ["x", "y", "z"]      |
# | Bob    | [1, 1]               | ["x", "y"]           |
# | Bob    | [4]                  | ["z"]                |
# | Bob    | [8, 3]               | ["x", "y"]           |
# | Charlie| [1, 1]               | ["x", "y"]           |
# | Charlie| [7,3,2]              | ["z", "x", "y"]      |
python dataframe python-polars
1个回答
0
投票

您可以使用函数根据指定的截止值拆分

time_deltas
other_field
。每当子列表中的第一个值超出截止值时,就拆分
time_delas
列表,并确保关联的
other_field
元素与这些拆分相匹配

例如:

import pandas as pd

# Your data
data = {
    "uid": ["Alice", "Bob", "Charlie"],
    "time_deltas": [[4, 2, 3], [1, 1, 4, 8, 3], [1, 1, 7, 3, 2]],
    "other_field": [["x", "y", "z"], ["x", "y", "z", "x", "y"], ["x", "y", "z", "x", "y"]],
}
df = pd.DataFrame(data)
cutoff = 3

# functio
def split_by_cutoff(uid, time_deltas, other_field, cutoff):
    result = []
    temp_time = []
    temp_other = []

    for i, (t, o) in enumerate(zip(time_deltas, other_field)):
        if i == 0 or t > cutoff:
            if temp_time:  
                result.append({"uid": uid, "time_deltas": temp_time, "other_field": temp_other})
            temp_time = [t]  
            temp_other = [o]
        else:
            temp_time.append(t)  
            temp_other.append(o)

    if temp_time:
        result.append({"uid": uid, "time_deltas": temp_time, "other_field": temp_other})
    return result

split_rows = []
for _, row in df.iterrows():
    split_rows.extend(split_by_cutoff(row['uid'], row['time_deltas'], row['other_field'], cutoff))

split_df = pd.DataFrame(split_rows)
print(split_df)

希望这有帮助

© www.soinside.com 2019 - 2024. All rights reserved.