查找 pandas 数据框中剩余的行

问题描述 投票:0回答:1

如果我使用此方法检测特定列中的 Y:

thing1 = df[df['column1'] == 'Y']

thing2 = df[df['column2'] == 'Y']

thing3 = df[df['column3'] == 'Y']

thing4 = df[df['column4'] == 'Y']

如何获取其中一列中没有 Y 的所有行? 我尝试过类似的事情:

none_of_the_things = df[(df['column1'] != 'Y') & (df['column2'] != 'Y') & (df[df['column3'] != 'Y']) & (df[df['column4'] != 'Y'])]

但这不起作用。 出现此类错误:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 23.3 TiB for an array with shape (1855862, 1724897) and data type float64
python ./deepseek-v2.5-chart.py  11.91s user 2.81s system 105% cpu 13.996 total

我用这种方法创造了某种怪物吗? 23.3 TiB 听起来很多。

示例 csv:

RECVDATE,column1,column2,column3,column4
01/01/2024,Y,N,N,N
01/04/2024,N,N,N,N
02/02/2024,N,Y,N,N
02/02/2024,N,Y,N,N
02/04/2024,N,N,N,N
03/03/2024,N,N,Y,N
03/03/2024,N,N,Y,N
03/03/2024,N,N,Y,N
03/04/2024,N,N,N,N
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,N
04/04/2024,N,N,N,N

示例 python,其中失败的内容被注释掉:

#!/usr/bin env python3
date_field = 'RECVDATE'
import glob, os
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
dir_path = 'data'
dfs = [pd.read_csv(f, encoding='latin-1', low_memory=False, quotechar='"') for f in glob.glob(os.path.join(dir_path, 'example.csv'))]
df = pd.concat(dfs, axis=0, ignore_index=True)
df[date_field] = pd.to_datetime(df[date_field], errors='coerce')

thing1 = df[df['column1'] == 'Y']
thing1['YearMonth'] = thing1[date_field].dt.to_period('M')
monthly_counts1 = thing1.groupby('YearMonth').size()
monthly_counts1.index = monthly_counts1.index.astype(str)

thing2 = df[df['column2'] == 'Y']
thing2['YearMonth'] = thing2[date_field].dt.to_period('M')
monthly_counts2 = thing2.groupby('YearMonth').size()
monthly_counts2.index = monthly_counts2.index.astype(str)

thing3 = df[df['column3'] == 'Y']
thing3['YearMonth'] = thing3[date_field].dt.to_period('M')
monthly_counts3 = thing3.groupby('YearMonth').size()
monthly_counts3.index = monthly_counts3.index.astype(str)

thing4 = df[df['column4'] == 'Y']
thing4['YearMonth'] = thing4[date_field].dt.to_period('M')
monthly_counts4 = thing4.groupby('YearMonth').size()
monthly_counts4.index = monthly_counts4.index.astype(str)

# none_of_the_things = df[(df['column1'] != 'Y') & (df['column2'] != 'Y') & (df[df['column3'] != 'Y']) & (df[df['column4'] != 'Y'])]
# none_of_the_things['YearMonth'] = none_of_the_things[date_field].dt.to_period('M')
# non_monthly_counts = none_of_the_things.groupby('YearMonth').size()
# non_monthly_counts.index = non_monthly_counts.index.astype(str)

# Plotting
plt.figure(figsize=(12, 6))
monthly_counts1.plot(kind='line', marker='o', label='count1', color='red')
monthly_counts2.plot(kind='line', marker='o', label='count2', color='blue')
monthly_counts3.plot(kind='line', marker='o', label='count3', color='green')
monthly_counts4.plot(kind='line', marker='o', label='count4', color='purple')
# non_monthly_counts.plot(kind='line', marker='o', label='none', color='black')
plt.title('Number of Deaths/LifeThreating/Hospitalizations per Month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.grid(True)
plt.xticks(rotation=45)
plt.legend(title='Legend', loc='upper left')
plt.show()
python pandas dataframe
1个回答
0
投票

选择适用的列(此处使用

df.filter
),用
df.eq
检查是否等于“Y”,将
df.any
链接到
axis=1
,用
~
反转生成的布尔序列并用于布尔索引:

out = df[~df.filter(like='column').eq('Y').any(axis=1)]

输出:

      RECVDATE column1 column2 column3 column4
1   01/04/2024       N       N       N       N
4   02/04/2024       N       N       N       N
8   03/04/2024       N       N       N       N
15  04/04/2024       N       N       N       N
16  04/04/2024       N       N       N       N

根据您的示例,使用

df.all
等于“N”:

out2 = df[df.filter(like='column').eq('N').all(axis=1)]

out2.equals(out)
# True

如果您实际的

df
没有统一的列名称,请使用
df[['column1', 'column2', ...]]

© www.soinside.com 2019 - 2024. All rights reserved.