pandas 中的笛卡尔积

Question

我有两个 pandas 数据框：

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})

获得笛卡尔积的最佳实践是什么（当然不需要像我一样明确地写出来）？

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

Answer 1

在最新版本的 Pandas (>= 1.2) 中，它内置于

merge

中，因此您可以执行以下操作：

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})    

df1.merge(df2, how='cross')

这相当于之前的pandas < 1.2 answer but is easier to read.

对于熊猫< 1.2:

如果您有一个对每一行重复的键，那么您可以使用合并生成笛卡尔积（就像在 SQL 中一样）。

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

输出：

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

请参阅此处的文档：http://pandas.pydata.org/pandas-docs/stable/merging.html

Answer 2

使用

pd.MultiIndex.from_product

作为空数据帧中的索引，然后重置其索引，就完成了。

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

输出：

Answer 3

这需要最少的代码。创建一个通用的“键”来笛卡尔合并两者：

df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, how='outer')

Answer 4

这不会赢得代码高尔夫比赛，并且借用了之前的答案 - 但清楚地显示了如何添加密钥以及连接如何工作。这会从列表中创建 2 个新数据框，然后添加用于执行笛卡尔积的密钥。

我的用例是我需要列表中每周的所有商店 ID 的列表。因此，我创建了一个我想要的所有星期的列表，然后是我想要映射它们的所有商店 ID 的列表。

我选择左侧合并，但在语义上与此设置中的内部相同。您可以在有关合并的文档中看到这一点，其中指出如果组合键在两个表中出现多次，它会执行笛卡尔积 - 这就是我们设置的。

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

Answer 5

使用方法链：

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

Answer 6

呈现给你

熊猫 >= 1.2

left.merge(right, how='cross')

import pandas as pd 

pd.__version__
# '1.2.0'

left = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
right = pd.DataFrame({'col3': [5, 6]}) 

left.merge(right, how='cross')

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

结果中索引被忽略。

在实现方面，这使用了共同键列方法的连接，如接受的答案中所述。使用 API 的好处是它可以节省您大量的打字工作，并且可以很好地处理一些极端情况。我几乎总是推荐这种语法作为我在 pandas 中笛卡尔积的首选，除非你正在寻找

更高效的东西。

Answer 7

作为替代方案，可以依赖 itertools 提供的笛卡尔积：

itertools.product

，这可以避免创建临时密钥或修改索引：

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

快速测试：

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

Answer 8

如果您没有重叠的列，不想添加一列，并且可以丢弃数据框的索引，这可能会更容易：

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

Answer 9

这是一个辅助函数，用于使用两个数据框执行简单的笛卡尔积。内部逻辑使用内部键进行处理，并避免损坏任何一侧恰好被命名为“键”的任何列。

import pandas as pd

def cartesian(df1, df2):
    """Determine Cartesian product of two data frames."""
    key = 'key'
    while key in df1.columns or key in df2.columns:
        key = '_' + key
    key_d = {key: 0}
    return pd.merge(
        df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)

# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)

显示：

   number  key  digit
0       1    3      5
1       1    3      6
2       2    4      5
3       2    4      6

Answer 10

您可以首先计算

df1.col1

和

df2.col3

 的笛卡尔积，然后合并回

df1

 以获得

col2

。

这是一个通用的笛卡尔积函数，它采用列表字典：

def cartesian_product(d):
    index = pd.MultiIndex.from_product(d.values(), names=d.keys())
    return pd.DataFrame(index=index).reset_index()

申请为：

res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
#  col1 col3 col2
# 0   1    5    3
# 1   1    6    3
# 2   2    5    4
# 3   2    6    4

Answer 11

当前版本的 Pandas (1.1.5) 的另一种解决方法：如果您从非数据帧序列开始，这个方法特别有用。我没计时。它不需要任何人工索引操作，但需要您重复第二个序列。它依赖于

explode

 的一个特殊属性，即右侧索引是重复的。

df1 = DataFrame({'col1': [1,2], 'col2': [3,4]})

series2 = Series(
    [[5, 6]]*len(df1),
    name='col3',
    index=df1.index,
)

df_cartesian = df1.join(series2.explode())

这个输出

   col1  col2 col3
0     1     3    5
0     1     3    6
1     2     4    5
1     2     4    6

Answer 12

您可以使用

pyjanitor 中的 expand_grid 来复制交叉连接；它为较大的数据集提供了一些速度性能（它在下面使用 np.meshgrid

）：

pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
jn.expand_grid(others = {"df1":df1, "df2":df2})

   df1       df2
  col1 col2 col3
0    1    3    5
1    1    3    6
2    2    4    5
3    2    4    6

Answer 13

如果您想以这样的方式获取两个系列的叉积，以便结果通过其两个各自索引的叉积正确索引，可以这样做：

def indexed_cross_product(s1: pd.Series, s2: pd.Series):
    assert s1.index.name is not None
    assert s2.index.name is not None
    assert s1.name is not None
    assert s2.name is not None
    idx = pd.MultiIndex.from_product([s1.index, s2.index], names=[s1.index.name, s2.index.name])
    return pd.DataFrame([[s1.loc[a], s2.loc[b]] for a, b in idx] , index=idx, columns=[s1.name, s2.name])

Answer 14

我发现使用 pandas MultiIndex 是完成这项工作的最佳工具。如果您有列表列表

lists_list

，请调用

pd.MultiIndex.from_product(lists_list)

 并迭代结果（或在 DataFrame 索引中使用它）。

pandas 中的笛卡尔积

问题描述投票：0回答：14

14个回答

`left.merge(right, how='cross')`

最新问题

pandas 中的笛卡尔积

问题描述 投票：0回答：14

14个回答

left.merge(right, how='cross')

最新问题

问题描述投票：0回答：14

`left.merge(right, how='cross')`