如何用百分比向 Pandas 的 DataFrame 添加另一列?字典的大小可以改变。
>>> import pandas as pd
>>> a = {'Test 1': 4, 'Test 2': 1, 'Test 3': 1, 'Test 4': 9}
>>> p = pd.DataFrame(a.items())
>>> p
0 1
0 Test 2 1
1 Test 3 1
2 Test 1 4
3 Test 4 9
[4 rows x 2 columns]
如果
10
的百分比确实是您想要的,最简单的方法是稍微调整您的数据摄入量:
>>> p = pd.DataFrame(a.items(), columns=['item', 'score'])
>>> p['perc'] = p['score']/10
>>> p
Out[370]:
item score perc
0 Test 2 1 0.1
1 Test 3 1 0.1
2 Test 1 4 0.4
3 Test 4 9 0.9
对于实际百分比,请改为:
>>> p['perc']= p['score']/p['score'].sum()
>>> p
Out[427]:
item score perc
0 Test 2 1 0.066667
1 Test 3 1 0.066667
2 Test 1 4 0.266667
3 Test 4 9 0.600000
首先,将字典的键设为数据框的索引:
import pandas as pd
a = {'Test 1': 4, 'Test 2': 1, 'Test 3': 1, 'Test 4': 9}
p = pd.DataFrame([a])
p = p.T # transform
p.columns = ['score']
然后,计算百分比并分配给新列。
def compute_percentage(x):
pct = float(x/p['score'].sum()) * 100
return round(pct, 2)
p['percentage'] = p.apply(compute_percentage, axis=1)
这给你:
score percentage
Test 1 4 26.67
Test 2 1 6.67
Test 3 1 6.67
Test 4 9 60.00
[4 rows x 2 columns]
import pandas as pd
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
# calculate percentage using apply() method and lambda function
df['B_Percentage'] = df['B'].apply(lambda x: (x / df['B'].sum()) * 100)
print(df)
使用 lambda 可能很有用。可以通过更多的方法来完成。也许这会有所帮助http://www.pythonpandas.com/how-to-calculate-the-percentage-of-a-column-in-pandas/
df=pd.read_excel("regional cases.xlsx")
df.head()
REGION CUMILATIVECOUNTS POPULATION
GREATER 12948 4943075
ASHANTI 4972 5792187
WESTERN 2051 2165241
CENTRAL 1071 2563228
df['Percentage']=round((df['CUMILATIVE COUNTS']/ df['POPULATION']*100)*100,2)
df.head()
REGION CUMILATIVECOUNTS POPULATION Percentage
GREATER 12948 4943075 26.19
ASHANTI 4972 5792187 8.58
WESTERN 2051 2165241 9.47
在探索模型训练数据时,我采用以下方法。
import pandas as pd
d = {"set1":[59268, 6166, 115], "set2":[12700, 9892, 238]}
idx_labels = ["Train", "Validation", "Test"]
df = pd.DataFrame(data=d, index=idx_labels)
df
set1 set2
Train 59268 12700
Validation 6166 9892
Test 115 238
for idx in idx_labels:
df["sub-totals"] = df.loc[idx].sum()
df
set1 set2 sub-totals
Train 59268 12700 88379
Validation 6166 9892 88379
Test 115 238 88379
def compute_ratio(df, target, num_decimal: int = 2) -> pd.Series:
if target in df.columns:
divider = df.loc[:, target].sum()
pct = df.loc[:, target] / divider
elif target in df.index:
divider = df.loc[target].sum()
pct = df.loc[target, :] / divider
return round(pct, num_decimal)
df["set1_ratio"] = compute_ratio(df, "set1")
df["set2_ratio"] = compute_ratio(df, "set2")
df.loc["totals"] = df.loc[:, :].sum()
df
set1 set2 sub-totals set1_ratio set2_ratio
Train 59268.0 12700.0 88379.0 0.90 0.56
Validation 6166.0 9892.0 88379.0 0.09 0.43
Test 115.0 238.0 88379.0 0.00 0.01
totals 65549.0 22830.0 265137.0 0.99 1.00