假设我有一个 csv 文件
date,value
2020-01-01,1
2020-01-02,4
2020-01-03,5
2020-01-04,9
2020-01-05,2
我想用 duckdb 读取它,进行一些预处理,最终得到一个训练集和验证集作为 Polars 数据帧
我可以做:
train = duckdb.sql("""
select *, avg(value) over (order by date rows between 2 preceding and current row)
from read_csv(my_data.csv) qualify date < make_date(2020,1,4)
""").pl()
val = duckdb.sql("""
select *, avg(value) over (order by date rows between 2 preceding and current row)
from read_csv(my_data.csv) qualify date >= make_date(2020,1,4)
""").pl()
这可行,但是它不会有双重计算的风险吗?
有没有办法在不进行双重计算的情况下同时实现两个数据帧?或者我应该这样做
data = duckdb.sql('select *, avg(value) over (order by date rows between 2 preceding and current row) from read_csv(my_data.csv)').pl()
train = data.filter(pl.col('date') < date(2020, 1, 4))
val = data.filter(pl.col('date') >= date(2020, 1, 4))
?
我看到你无论如何都将数据读入Polars数据框中,所以你可以使用
partition_by
:
train, val = (
data
.with_columns(p = pl.col.date >= datetime(2020,1,4))
.partition_by("p", include_key = False)
)
shape: (3, 2)
┌────────────┬───────┐
│ date ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2020-01-01 ┆ 1 │
│ 2020-01-02 ┆ 4 │
│ 2020-01-03 ┆ 5 │
└────────────┴───────┘
shape: (2, 2)
┌────────────┬───────┐
│ date ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2020-01-04 ┆ 9 │
│ 2020-01-05 ┆ 2 │
└────────────┴───────┘