在pandas中获取洛伦兹曲线和基尼系数

Question

如何使用 pandas python 包获取洛伦兹曲线和基尼系数？关于基尼系数和洛伦兹曲线的类似帖子主要涉及 numpy 或 R。

Answer 1

这是一个使用一个函数来准备洛伦兹曲线并使用另一个函数来获取基尼系数的示例。我使用来自Pareto II（也称为 Lomax）分布的数据来实现洛伦兹曲线的合适分布。

计算

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  

def lorenz_prep(df, x, y):
    df_lorenz_curve = (pd.concat([df, df.head(1).map(lambda x: 0)], ignore_index=True)    # Add the origin (0,0) of the lorenz curve in a own row.
                       .loc[lambda df_: (df_[y] / df_[x]).fillna(0).sort_values().index]  # Sort values according to income per person.
                       .assign(equal_line=lambda df_: df_[x].cumsum() / df[x].sum(),      # Calculate cumulated people shares.
                               lorenz_curve=lambda df_: df_[y].cumsum() / df_[y].sum(),   # Calculate cumulated income shares.
                               )
                       .set_index("equal_line", drop=False)
                       .rename_axis(None)
                       )
    return df_lorenz_curve


def gini(df):
    """
    The following section was consulted to create this formula:
    https://de.wikipedia.org/wiki/Gini-Koeffizient#Beispiel
    """
    df_g = df.assign(pop_share=df["equal_line"].diff(),
                     income_share=df["lorenz_curve"].diff(),
                     )
    g = 1 - 2 * ((df_g["lorenz_curve"] - df_g["income_share"] / 2) * df_g["pop_share"]).sum()
    return g


# Create an example dataframe
df = pd.DataFrame({"income": (np.random.default_rng(seed=42).pareto(a=1.2, size=200) + 1) * 1500,
                   "number_of_people": 1,
                   })

# Prepare dataframe.
df_res = df.pipe(lorenz_prep, x="number_of_people", y="income")

# Plot lorenz curve.
df_res[["equal_line", "lorenz_curve"]].plot()
plt.show()

# Get Gini coefficient.
print(df_res.pipe(gini))

结果

洛伦兹曲线：

基尼系数：0.5471224899542815
使用

this

检查 gini 功能。

注意

使用的数据集是这样的，一行包含一个人的收入。因此，

"number_of_people"

始终为 1。但是，使用提供的

gini

公式，还可以处理不同人数的总收入位于一行的数据（例如，对于收入范围），例如：

# 5 people earn together 2000 and so on.
df = pd.DataFrame({"income": [2000, 4000, 6000, 15000],
                   "number_of_people": [5, 3, 2, 1,],
                   })

替代基尼函数

这是计算基尼系数的替代函数，在文献中比较常见。然而，使用此函数只能按照第一种描述的方式处理数据集（一个实体一个收入）。

 def gini_alternative(df, y):
    """
    Use the raw data for this function, 
    do not use lorenz_prep() on the dataset before running this function.
    This function only works when data is of the form of 1 income per entity or similar.
    """
    dfx = df.sort_values(y)
    dfx.index = pd.RangeIndex(start=1, stop=dfx.index.size + 1)
    return ((2 * dfx.index - dfx.index.size - 1) * dfx[y]).sum() / (dfx.index.size**2 * dfx[y].mean())

print(df.pipe(gini_alternative, y="income"))

数学符号^1,2:

¹https://mathworld.wolfram.com/GiniCoefficient.html
²http://dx.doi.org/10.2307/177185

在pandas中获取洛伦兹曲线和基尼系数

问题描述投票：0回答：1

1个回答

计算

结果

注意

替代基尼函数

最新问题

在pandas中获取洛伦兹曲线和基尼系数

问题描述 投票：0回答：1

1个回答

计算

结果

注意

替代基尼函数

最新问题

问题描述投票：0回答：1