查找一系列字符串列表的唯一交集

Question

如何找到一列列表的交集？

[dependencies]
polars = { version = "*", features = ["lazy"] }

use polars::df;
use polars::prelude::*;

fn main() {
    let df = df![
        "bar" => ["a", "b", "c", "a", "b", "c", "a", "c"],
        "ham" => ["foo", "foo", "foo", "bar", "bar", "bar", "bing", "bang"]
    ]
    .unwrap();

    let df_grp = df
        .lazy()
        .groupby(["bar"])
        .agg([col("ham").list()])
        .collect()
        .unwrap();

    println!("{:?}", df_grp);
}

打印：

┌─────┬────────────────────────┐
│ bar ┆ ham                    │
│ --- ┆ ---                    │
│ str ┆ list[str]              │
╞═════╪════════════════════════╡
│ c   ┆ ["foo", "bar", "bang"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ ["foo", "bar"]         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a   ┆ ["foo", "bar", "bing"] │
└─────┴────────────────────────┘

我想做的是设置行 a/b/c ⇒ ["foo","bar"] 的交集作为所有行中的公共字符串。

我的想法是将字符串列表的列转换为哈希集的列，然后折叠/减少交集。我如何从

Series<list<String>>

⇒

Series<HashSet>

出发？如果这在惰性框架折叠表达式中是可能的，那就太好了，但是如何定义累加器呢？点燃（哈希集）？

Answer 1

我找到了一种方法来做到这一点，尽管不是用表达式。

use polars::df;
use polars::prelude::*;
use std::collections::HashSet;

fn main() -> Result<(), PolarsError> {
    let df = df![
        "bar" => ["a", "b", "c",
                  "a","b","c",
                  "a","b","c"],
        "ham" => ["foo", "foo","foo",
                  "bar", "bar","bar", 
                  "bing", "bang","bing"]
    ]
    .unwrap();

let df_grp = df
    .lazy()
    .groupby(["bar"])
    .agg([col("ham").list()])
    .sort("bar", Default::default())
    .collect()?;

println!("{:?}", df_grp);

let mut s_sets: Vec<Vec<String>> = Vec::new();
df_grp
    .column("ham")?
    .list()?
    .into_iter()
    .for_each(|opt_lst| match opt_lst {
        None => s_sets.push(vec!["".to_string()]),
        Some(lst) => s_sets.push(
            lst.clone()
                .utf8()
                .unwrap()
                .into_no_null_iter()
                .map(|s: &str| s.to_string())
                .collect::<Vec<String>>(),
        ),
    });

    let common = find_common_ids(s_sets);
    println!("{:?}", common);

    Ok(())
}

fn find_common_ids(callset: Vec<Vec<String>>) -> HashSet<String> {
    let init = HashSet::from_iter(callset[0].iter().cloned());
    callset[1..].iter().fold(init, |common, new| {
        let new = HashSet::from_iter(new.iter().cloned());
        &common & &new
    })
}

Answer 2

这是一种处理表达式的方法，使用

list_sets

板条箱的

polars-plan

功能：

use polars::df;
use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let df = df![
        "bar" => ["a", "b", "c", "a", "b", "c", "a", "c"],
        "ham" => ["foo", "foo", "foo", "bar", "bar", "bar", "bing", "bang"]
    ]
    .unwrap();

    let df_grp = df
        .lazy()
        .group_by(["bar"])
        .agg([col("ham").alias("aggregated")])
        .with_column(col("aggregated"))
        .drop_columns(["bar"])
        .collect()?;

    println!("{:?}", df_grp);
    // ┌────────────────────────┐
    // │ aggregated             │
    // │ ---                    │
    // │ list[str]              │
    // ╞════════════════════════╡
    // │ ["foo", "bar"]         │
    // │ ["foo", "bar", "bang"] │
    // │ ["foo", "bar", "bing"] │
    // └────────────────────────┘

    let df_w_all_hams = df_grp.transpose(None, None)?;

    println!("{:?}", df_w_all_hams);
    // ┌────────────────┬────────────────────────┬────────────────────────┐
    // │ column_0       ┆ column_1               ┆ column_2               │
    // │ ---            ┆ ---                    ┆ ---                    │
    // │ list[str]      ┆ list[str]              ┆ list[str]              │
    // ╞════════════════╪════════════════════════╪════════════════════════╡
    // │ ["foo", "bar"] ┆ ["foo", "bar", "bang"] ┆ ["foo", "bar", "bing"] │
    // └────────────────┴────────────────────────┴────────────────────────┘

    let common_vals = df_w_all_hams
        .lazy()
        .select([col("*").list().set_intersection("*").alias("common_vals")])
        .collect()?;

    println!("{:?}", common_vals);
    // ┌────────────────┐
    // │ common_vals    │
    // │ ---            │
    // │ list[str]      │
    // ╞════════════════╡
    // │ ["foo", "bar"] │
    // └────────────────┘

    Ok(())
}

查找一系列字符串列表的唯一交集

问题描述投票：0回答：2

2个回答

最新问题

查找一系列字符串列表的唯一交集

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2