如果我使用此方法检测特定列中的 Y:
thing1 = df[df['column1'] == 'Y']
thing2 = df[df['column2'] == 'Y']
thing3 = df[df['column3'] == 'Y']
thing4 = df[df['column4'] == 'Y']
如何获取其中一列中没有 Y 的所有行? 我尝试过类似的事情:
none_of_the_things = df[(df['column1'] != 'Y') & (df['column2'] != 'Y') & (df[df['column3'] != 'Y']) & (df[df['column4'] != 'Y'])]
但这不起作用。 出现此类错误:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 23.3 TiB for an array with shape (1855862, 1724897) and data type float64
python ./deepseek-v2.5-chart.py 11.91s user 2.81s system 105% cpu 13.996 total
我用这种方法创造了某种怪物吗? 23.3 TiB 听起来很多。
示例 csv:
RECVDATE,column1,column2,column3,column4
01/01/2024,Y,N,N,N
01/04/2024,N,N,N,N
02/02/2024,N,Y,N,N
02/02/2024,N,Y,N,N
02/04/2024,N,N,N,N
03/03/2024,N,N,Y,N
03/03/2024,N,N,Y,N
03/03/2024,N,N,Y,N
03/04/2024,N,N,N,N
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,Y
04/04/2024,N,N,N,N
04/04/2024,N,N,N,N
示例 python,其中失败的内容被注释掉:
#!/usr/bin env python3
date_field = 'RECVDATE'
import glob, os
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
dir_path = 'data'
dfs = [pd.read_csv(f, encoding='latin-1', low_memory=False, quotechar='"') for f in glob.glob(os.path.join(dir_path, 'example.csv'))]
df = pd.concat(dfs, axis=0, ignore_index=True)
df[date_field] = pd.to_datetime(df[date_field], errors='coerce')
thing1 = df[df['column1'] == 'Y']
thing1['YearMonth'] = thing1[date_field].dt.to_period('M')
monthly_counts1 = thing1.groupby('YearMonth').size()
monthly_counts1.index = monthly_counts1.index.astype(str)
thing2 = df[df['column2'] == 'Y']
thing2['YearMonth'] = thing2[date_field].dt.to_period('M')
monthly_counts2 = thing2.groupby('YearMonth').size()
monthly_counts2.index = monthly_counts2.index.astype(str)
thing3 = df[df['column3'] == 'Y']
thing3['YearMonth'] = thing3[date_field].dt.to_period('M')
monthly_counts3 = thing3.groupby('YearMonth').size()
monthly_counts3.index = monthly_counts3.index.astype(str)
thing4 = df[df['column4'] == 'Y']
thing4['YearMonth'] = thing4[date_field].dt.to_period('M')
monthly_counts4 = thing4.groupby('YearMonth').size()
monthly_counts4.index = monthly_counts4.index.astype(str)
# none_of_the_things = df[(df['column1'] != 'Y') & (df['column2'] != 'Y') & (df[df['column3'] != 'Y']) & (df[df['column4'] != 'Y'])]
# none_of_the_things['YearMonth'] = none_of_the_things[date_field].dt.to_period('M')
# non_monthly_counts = none_of_the_things.groupby('YearMonth').size()
# non_monthly_counts.index = non_monthly_counts.index.astype(str)
# Plotting
plt.figure(figsize=(12, 6))
monthly_counts1.plot(kind='line', marker='o', label='count1', color='red')
monthly_counts2.plot(kind='line', marker='o', label='count2', color='blue')
monthly_counts3.plot(kind='line', marker='o', label='count3', color='green')
monthly_counts4.plot(kind='line', marker='o', label='count4', color='purple')
# non_monthly_counts.plot(kind='line', marker='o', label='none', color='black')
plt.title('Number of Deaths/LifeThreating/Hospitalizations per Month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.grid(True)
plt.xticks(rotation=45)
plt.legend(title='Legend', loc='upper left')
plt.show()
df.filter
),用 df.eq
检查是否等于“Y”,将 df.any
链接到 axis=1
,用 ~
反转生成的布尔序列并用于布尔索引:
out = df[~df.filter(like='column').eq('Y').any(axis=1)]
输出:
RECVDATE column1 column2 column3 column4
1 01/04/2024 N N N N
4 02/04/2024 N N N N
8 03/04/2024 N N N N
15 04/04/2024 N N N N
16 04/04/2024 N N N N
df.all
等于“N”:
out2 = df[df.filter(like='column').eq('N').all(axis=1)]
out2.equals(out)
# True
如果您实际的
df
没有统一的列名称,请使用 df[['column1', 'column2', ...]]
。