如何在两个 DataFrame 之间高效匹配文本

Question

我有一些文字数据：

data1 = pl.from_repr("""
┌─────────┬──────────────────────────────────────────────────┬───────────────────────────┐
│ id      ┆ comment                                          ┆ title                     │
│ ---     ┆ ---                                              ┆ ---                       │
│ str     ┆ str                                              ┆ str                       │
╞═════════╪══════════════════════════════════════════════════╪═══════════════════════════╡
│ user_A  ┆ good                                             ┆ a file name               │
│ user_B  ┆ a better way is…                                 ┆ is there some good sugg？ │
│ user_C  ┆ a another way is…                                ┆ is there some good sugg？ │
│ user_C  ┆ I have been using Pandas for a long time, so I…  ┆ a book                    │
└─────────┴──────────────────────────────────────────────────┴───────────────────────────┘
""")

另一个带有用户ID：

data2 = pl.from_repr("""
┌─────────┬───────────────────────────┐
│ userid  ┆ title                     │
│ ---     ┆ ---                       │
│ str     ┆ str                       │
╞═════════╪═══════════════════════════╡
│ user_X  ┆ is there some good sugg？ │
│ user_Y  ┆ a great idea…             │
│ user_Z  ┆ a file name               │
│ user_W  ┆ a book                    │
└─────────┴───────────────────────────┘
""")

所需输出

shape: (4, 4)
┌────────┬─────────────────────────────────────────────────┬───────────────────────────┬────────┐
│ id     ┆ comment                                         ┆ title                     ┆ userid │
│ ---    ┆ ---                                             ┆ ---                       ┆ ---    │
│ str    ┆ str                                             ┆ str                       ┆ str    │
╞════════╪═════════════════════════════════════════════════╪═══════════════════════════╪════════╡
│ user_A ┆ good                                            ┆ a file name               ┆ user_Z │
│ user_B ┆ a better way is…                                ┆ is there some good sugg？ ┆ user_X │
│ user_C ┆ a another way is…                               ┆ is there some good sugg？ ┆ user_X │
│ user_C ┆ I have been using Pandas for a long time, so I… ┆ a book                    ┆ user_W │
└────────┴─────────────────────────────────────────────────┴───────────────────────────┴────────┘

一个简单的方法是合并

title

在

pandas

：

dataall = pd.merge(
    data1,data2,
    on = 'title',
    how ='left'
)

但是内存很贵。 data1 的大小为 (2942087, 7)（或者有时可能超过行数的 3 倍），data2 的大小为 (47516640, 4) 我的内存大小是32GB，但是不够用我也尝试使用

polars

在

polars

：

dataall = data1.join(
    data2,
    on = 'title',
    how ='left'
)

发生错误


Canceled future for execute_request message before replies were done

我尝试过

is_in

中的功能

polars

并将文本编码为数字，它们很快但我不知道如何实现。
pandas/polars/numpy 有没有高效可行的方法？

经过@ritchie46的建议
-----编辑2022-5-24 16:00:10

import polars as pl
pl.Config.set_global_string_cache()

data1 = pl.read_parquet('data1.parquet.gzip').lazy()
data2 = pl.read_parquet('data2.parquet.gzip').lazy()

data1 = data1.with_columns(pl.col('source_post_title').cast(pl.Categorical))
data2 = data2.with_columns(pl.col('source_post_title').cast(pl.Categorical))


dataall = data1.join(
    data2,
    on = 'source_post_title',
    how ='left'
).collect()

代码似乎可以工作一段时间然后

Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.

这是因为我的处理器本身太弱吗？我的CPU是

i7-10850H

Answer 1

如果您的连接键中有很多重复项，则输出表可能比您要连接的任何表都大得多。

可能对

polars

有帮助的是：

使用
```
Categorical
```
数据类型，以便缓存重复项。
对连接键进行重复数据删除，以便输出表不会爆炸（如果允许正确性）。
直接从
```
scan
```
级别使用 Polars 惰性 API。这样中间结果就会被清除并且不会保留在 RAM 中。除此之外，极地可能会进行其他优化来减少内存压力。

如果不需要所有输出数据，但只需要连接结果的前 x 百万行，则可以使用 Polars Lazy。

lf_a = pl.scan_parquet("data1")
lf_a = # some more work

lf_b = p.scan_parquet("data2"_
lf_b = # some more work

# take only first million rows
N = int(1e6)

# because of the head operation the join will not materialize a full output table
lf_a.join(lf_b).head(N)

如何在两个 DataFrame 之间高效匹配文本

问题描述投票：0回答：1

1个回答

最新问题

如何在两个 DataFrame 之间高效匹配文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1