就找到重复内容而言,我已经弄明白了。我有一个标记为True或False的列,然后我删除一个具有特定值的列。此时,我只需要包含一列在行范围内的任何内容。
举个例子:
Status Height Object Store
0 Here 100' ABC EFG
1 Maybe here 99' ABC EFG
2 Maybe here 102' ABC JKL
3 Maybe here 99' ABC QRS
4 Here 80' XYZ QRS
5 Maybe here 78' XYZ JKL
期望的输出:
Status Height Object Store
0 Here 100' ABC EFG
2 Maybe here 102' ABC JKL
3 Maybe here 99' ABC QRS
4 Here 80' XYZ QRS
5 Maybe here 78' XYZ JKL
应删除“可能在这里”行,因为它们的高度在+/- 4英尺之内。谁能指出我正确的方向?
谢谢。
要决定是否删除基于height
的行,请检查[height-threshold, height+threshold]
中是否已存在dictionary
中的至少一个元素。如果存在,请删除height
例如,如果height=80
和threshold=4
,检查76, 77, 78, 79, 80, 81, 82, 83, 84
中是否存在dictionary
中至少有一个数字。如果存在,请删除该行。
global dictionary
def can_i_remove(item, threshold):
global dictionary
key = item-threshold
while(key <= (item+threshold)):
if(dictionary.get(key) != None):
return True
key = key+1
dictionary[item] = False
return False
def main():
global dictionary
dictionary = dict()
threshold = 4
ret = can_i_remove(100, threshold)
print(str(dictionary) + " -> 100 - " + str(ret))
ret = can_i_remove(96, threshold)
print(str(dictionary) + " -> 96 - " + str(ret))
ret = can_i_remove(95, threshold)
print(str(dictionary) + " -> 95 - " + str(ret))
ret = can_i_remove(104, threshold)
print(str(dictionary) + " -> 104 - " + str(ret))
ret = can_i_remove(105, threshold)
print(str(dictionary) + " -> 105 - " + str(ret))
main()
输出:
{100: False} -> 100 - False
{100: False} -> 96 - True
{100: False, 95: False} -> 95 - False
{100: False, 95: False} -> 104 - True
{100: False, 95: False, 105: False} -> 105 - False
你可以使用numpy解决方案为get + -4范围指定值并按boolean indexing
过滤:
print (df)
Status Height Object
0 Here 100' ABC
1 Maybe here 99' ABC
2 Maybe here 102' ABC
3 Maybe here 99' ABC
4 Here 80' XYZ
5 Maybe here 78' XYZ
#specify values for check ranges
vals = [100, 80]
#remove traling 'and convert to integer
a = df['Height'].str.strip("'").astype(int)
#convert to numpy array and compare, get abs values
arr = np.abs(np.array(vals) - a.values[:, None])
print (arr)
[[ 0 20]
[ 1 19]
[ 2 22]
[ 1 19]
[20 0]
[22 2]]
#xreate boolean mask for match at least one True
mask = np.any((arr > 0) & (arr < 4), axis=1)
print (mask)
[False True True True False True]
#inverting condition by ~
print (df[~mask])
Status Height Object
0 Here 100' ABC
4 Here 80' XYZ
类似:
#invert conditions and check if all values Trues per row
mask = np.all((arr <= 0) | (arr >= 4), axis=1)
print (mask)
[ True False False False True False]
print (df[mask])
Status Height Object
0 Here 100' ABC
4 Here 80' XYZ
编辑:
解决方案类似于DataFrame.duplicated
创建的仅链式新布尔掩码:
#specify values for check ranges
vals = [100, 80]
#remove traling 'and convert to integer
a = df['Height'].str.strip("'").astype(int)
#convert to numpy array and compare, get abs values
arr = np.abs(np.array(vals) - a.values[:, None])
print (arr)
[[ 0 20]
[ 1 19]
[ 2 22]
[ 1 19]
[20 0]
[22 2]]
#create boolean mask for match at least one True
mask1 = np.any((arr > 0) & (arr < 4), axis=1)
print (mask1)
[False True True True False True]
mask2 = df.duplicated(subset=['Object','Store'], keep=False)
print (mask2)
0 True
1 True
2 False
3 False
4 False
5 False
dtype: bool
mask = mask1 & mask2
#inverting condition by ~
print (df[~mask])
Status Height Object Store
0 Here 100' ABC EFG
2 Maybe here 102' ABC JKL
3 Maybe here 99' ABC QRS
4 Here 80' XYZ QRS
5 Maybe here 78' XYZ JKL
#invert conditions and check if all values Trues per row
mask3 = np.all((arr <= 0) | (arr >= 4), axis=1)
print (mask3)
[ True False False False True False]
mask = mask3 | ~mask2
print (df[mask])
Status Height Object Store
0 Here 100' ABC EFG
2 Maybe here 102' ABC JKL
3 Maybe here 99' ABC QRS
4 Here 80' XYZ QRS
5 Maybe here 78' XYZ JKL