我有一个数据转换问题,其中原始数据由三行数据的“块”组成,其中第一行表示“父级”,另外两行是相关的子级。最小工作示例如下所示:
import polars as pl
df_original = pl.DataFrame(
{
'Order ID': ['A', 'foo', 'bar'],
'Parent Order ID': [None, 'A', 'A'],
'Direction': ["Buy", "Buy", "Sell"],
'Price': [1.21003, None, 1.21003],
'Some Value': [4, 4, 4],
'Name Provider 1': ['P8', 'P8', 'P8'],
'Quote Provider 1': [None, 1.1, 1.3],
'Name Provider 2': ['P2', 'P2', 'P2'],
'Quote Provider 2': [None, 1.15, 1.25],
'Name Provider 3': ['P1', 'P1', 'P1'],
'Quote Provider 3': [None, 1.0, 1.4],
'Name Provider 4': ['P5', 'P5', 'P5'],
'Quote Provider 4': [None, 1.0, 1.4]
}
)
实际上,最多有 15 个 Provider(因此最多 30 列),但对于本例来说它们不是必需的。
我们希望将其转换为一种格式,其中每行代表该父级的单个提供商的买入和卖出报价。想要的结果如下:
df_desired = pl.DataFrame(
{
'Order ID': ['A', 'A', 'A', 'A'],
'Parent Direction': ['Buy', 'Buy', 'Buy', 'Buy'],
'Price': [1.21003, 1.21003, 1.21003, 1.21003],
'Some Value': [4, 4, 4, 4],
'Name Provider': ['P8', 'P2', 'P1', 'P5'],
'Quote Buy': [1.1, 1.15, 1.0, 1.0],
'Quote Sell': [1.3, 1.25, 1.4, 1.4],
}
)
df_desired
但是,我在极地很难做到这一点。
我的第一个方法是将数据分为父母和孩子,然后将它们连接到各自的 id 上:
df_parents = (
df_original
.filter(pl.col("Parent Order ID").is_null())
.drop(columns=['Parent Order ID'])
)
df_ch = (
df_original
.filter(pl.col("Parent Order ID").is_not_null())
.drop(columns=['Price', 'Some Value'])
)
ch_buy = df_ch.filter(pl.col("Direction") == 'Buy').drop(columns=['Direction'])
ch_sell = df_ch.filter(pl.col("Direction") == 'Sell').drop(columns=['Direction'])
df_joined = (
df_parents
.join(ch_buy, left_on='Order ID', right_on='Parent Order ID', suffix="_Buy")
.join(ch_sell, left_on='Order ID', right_on='Parent Order ID', suffix="_Sell")
# The Name and Quote columns in the parent are all empty, so they can go, buy they had to be there for the suffix to work for the first join
.drop(columns=[f'Name Provider {i}' for i in range(1, 5)])
.drop(columns=[f'Quote Provider {i}' for i in range(1, 5)])
)
但这仍然会让你陷入混乱,你必须以某种方式将其分成四行 - 而不是八行,就像你可以使用 .unpivot() 轻松做到的那样。关于如何最好地解决这个问题有什么建议吗?我在这里错过了一些明显的方法吗?
编辑: 添加了一个稍大的示例数据框,其中包含两个父订单及其子订单(实际数据集大约有 50k+ 个):
df_original_two_orders = pl.DataFrame(
{
'Order ID': ['A', 'foo', 'bar', 'B', 'baz', 'rar'], # Two parent orders
'Parent Order ID': [None, 'A', 'A', None, 'B', 'B'],
'Direction': ["Buy", "Buy", "Sell", "Sell", "Sell", "Buy"], # Second parent has different direction
'Price': [1.21003, None, 1.21003, 1.1384, None, 1.1384],
'Some Value': [4, 4, 4, 42, 42, 42],
'Name Provider 1': ['P8', 'P8', 'P8', 'P2', 'P2', 'P2'],
'Quote Provider 1': [None, 1.1, 1.3, None, 1.10, 1.40],
# Above, 1.10 corresponds to Buy for order A for to Sell for order B - depends on Direction
'Name Provider 2': ['P2', 'P2', 'P2', 'P1', 'P1', 'P1'],
'Quote Provider 2': [None, 1.15, 1.25, None, 1.11, 1.39],
'Name Provider 3': ['P1', 'P1', 'P1', 'P3', 'P3', 'P3'],
'Quote Provider 3': [None, 1.0, 1.4, None, 1.05, 1.55],
'Name Provider 4': ['P5', 'P5', 'P5', None, None, None],
'Quote Provider 4': [None, 1.0, 1.4, None, None, None]
}
)
我认为这更能代表现实世界,因为它有多个父订单,并且并非所有提供商列都填充了所有订单,同时仍然远离烦人的业务逻辑。
此示例的正确输出如下:
df_desired_two_parents = pl.DataFrame(
{
'Order ID': ['A']*4 + ['B'] * 3,
'Parent Direction': ['Buy']*4 + ['Sell'] * 3,
'Price': [1.21003] * 4 + [1.1384] * 3,
'Some Value': [4] * 4 + [42] * 3,
'Name Provider': ['P8', 'P2', 'P1', 'P5', 'P2', 'P1', 'P3'],
'Quote Buy': [1.1, 1.15, 1.0, 1.0, 1.40, 1.39, 1.55], # Note the last three values are the "second" values in the original column now because the parent order was 'Sell'
'Quote Sell': [1.3, 1.25, 1.4, 1.4, 1.10, 1.11, 1.05],
}
)
这是我的尝试方法:
填写父订单 ID 列中的空值并使用它来
.group_by()
columns = ["Order ID", "Direction", "Price", "Some Value"]
names = pl.col("^Name .*$") # All name columns
quotes = pl.col("^Quote .*$") # All quote columns
(
df_original_two_orders
.with_columns(pl.col("Parent Order ID").backward_fill())
.group_by("Parent Order ID")
.agg(
pl.col(columns).first(),
pl.concat_list(names.first()).alias("Name"), # Put all names into single column: ["Name1", "Name2", ...]
pl.col("^Quote .*$").slice(1), # Create list for each quote column (skip first row): [1.1, 1.3], [1.15, 1.25], ...
)
.with_columns(
pl.concat_list( # Create list of Buy values
pl.when(pl.col("Direction") == "Buy")
.then(quotes.list.first())
.otherwise(quotes.list.last())
.alias("Buy")
),
pl.concat_list( # Create list of Sell values
pl.when(pl.col("Direction") == "Sell")
.then(quotes.list.first())
.otherwise(quotes.list.last())
.alias("Sell")
)
)
.select(columns + ["Name", "Buy", "Sell"]) # Remove Name/Quote [1234..] columns
.explode("Name", "Buy", "Sell") # Turn into rows
)
shape: (8, 7)
┌──────────┬───────────┬─────────┬────────────┬──────┬──────┬──────┐
│ Order ID ┆ Direction ┆ Price ┆ Some Value ┆ Name ┆ Buy ┆ Sell │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ str ┆ f64 ┆ f64 │
╞══════════╪═══════════╪═════════╪════════════╪══════╪══════╪══════╡
│ A ┆ Buy ┆ 1.21003 ┆ 4 ┆ P8 ┆ 1.1 ┆ 1.3 │
│ A ┆ Buy ┆ 1.21003 ┆ 4 ┆ P2 ┆ 1.15 ┆ 1.25 │
│ A ┆ Buy ┆ 1.21003 ┆ 4 ┆ P1 ┆ 1.0 ┆ 1.4 │
│ A ┆ Buy ┆ 1.21003 ┆ 4 ┆ P5 ┆ 1.0 ┆ 1.4 │
│ B ┆ Sell ┆ 1.1384 ┆ 42 ┆ P2 ┆ 1.4 ┆ 1.1 │
│ B ┆ Sell ┆ 1.1384 ┆ 42 ┆ P1 ┆ 1.39 ┆ 1.11 │
│ B ┆ Sell ┆ 1.1384 ┆ 42 ┆ P3 ┆ 1.55 ┆ 1.05 │
│ B ┆ Sell ┆ 1.1384 ┆ 42 ┆ null ┆ null ┆ null │
└──────────┴───────────┴─────────┴────────────┴──────┴──────┴──────┘
第 1 步创建一个姓名列表并将每个引用放入列表中:
agg = (
df_original_two_orders
.with_columns(pl.col("Parent Order ID").backward_fill())
.group_by("Parent Order ID")
.agg(
pl.col(columns).first(),
pl.concat_list(names.first()).alias("Name"), # Put all names into single column: ["Name1", "Name2", ...]
pl.col("^Quote .*$").slice(1), # Create list for each quote column (skip first row): [1.1, 1.3], [1.15, 1.25], ...
)
)
shape: (2, 10)
┌────────────────┬──────────┬───────────┬─────────┬───┬────────────────┬────────────────┬────────────────┬────────────────┐
│ Parent Order ┆ Order ID ┆ Direction ┆ Price ┆ … ┆ Quote Provider ┆ Quote Provider ┆ Quote Provider ┆ Quote Provider │
│ ID ┆ --- ┆ --- ┆ --- ┆ ┆ 1 ┆ 2 ┆ 3 ┆ 4 │
│ --- ┆ str ┆ str ┆ f64 ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ ┆ ┆ ┆ ┆ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] │
╞════════════════╪══════════╪═══════════╪═════════╪═══╪════════════════╪════════════════╪════════════════╪════════════════╡
│ B ┆ B ┆ Sell ┆ 1.1384 ┆ … ┆ [1.1, 1.4] ┆ [1.11, 1.39] ┆ [1.05, 1.55] ┆ [null, null] │
│ A ┆ A ┆ Buy ┆ 1.21003 ┆ … ┆ [1.1, 1.3] ┆ [1.15, 1.25] ┆ [1.0, 1.4] ┆ [1.0, 1.4] │
└────────────────┴──────────┴───────────┴─────────┴───┴────────────────┴────────────────┴────────────────┴────────────────┘
第 2 步从报价列中创建单独的买入/卖出列表。
我们可以使用
pl.when().then().otherwise()
来测试是否应该取每个报价列表中的第一个/最后一个值,具体取决于方向是否为买入/卖出。
(
agg
.with_columns(
pl.concat_list( # Create list of Buy values
pl.when(pl.col("Direction") == "Buy")
.then(quotes.list.first())
.otherwise(quotes.list.last())
.alias("Buy")
),
pl.concat_list( # Create list of Sell values
pl.when(pl.col("Direction") == "Sell")
.then(quotes.list.first())
.otherwise(quotes.list.last())
.alias("Sell")
)
)
.select(columns + ["Name", "Buy", "Sell"])
)
shape: (2, 7)
┌──────────┬───────────┬─────────┬────────────┬──────────────────────┬─────────────────────┬─────────────────────┐
│ Order ID ┆ Direction ┆ Price ┆ Some Value ┆ Name ┆ Buy ┆ Sell │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ list[str] ┆ list[f64] ┆ list[f64] │
╞══════════╪═══════════╪═════════╪════════════╪══════════════════════╪═════════════════════╪═════════════════════╡
│ B ┆ Sell ┆ 1.1384 ┆ 42 ┆ ["P2", "P1", … null] ┆ [1.4, 1.39, … null] ┆ [1.1, 1.11, … null] │
│ A ┆ Buy ┆ 1.21003 ┆ 4 ┆ ["P8", "P2", … "P5"] ┆ [1.1, 1.15, … 1.0] ┆ [1.3, 1.25, … 1.4] │
└──────────┴───────────┴─────────┴────────────┴──────────────────────┴─────────────────────┴─────────────────────┘
最后我们
.explode()
将列表变成行。
如果需要,您可以在之后添加
.drop_nulls()
以删除空行。