比较数据帧列:TypeError:无法将“StringDtype”解释为数据类型

问题描述 投票:0回答:1

我正在尝试比较两个数据帧的列和类型以检查是否相等,行预计会不同。

我使用的是pandas版本1.1.2

pd.__version__
'1.1.2'

if (df1.columns.difference(df2.columns).empty) and
                (df1.dtypes == df2.dtypes).all()

但这行错误:

(df1.dtypes == df2.dtypes).all()


Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/common.py", line 65, in new_method
    return method(self, other)
  File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 370, in wrapper
    res_values = comparison_op(lvalues, rvalues, op)
  File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 239, in comparison_op
    res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
  File "/Users/vfrank/dev-working/venv/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 53, in comp_method_OBJECT_ARRAY
    result = libops.vec_compare(x.ravel(), y.ravel(), op)
  File "pandas/_libs/ops.pyx", line 169, in pandas._libs.ops.vec_compare
TypeError: Cannot interpret 'StringDtype' as a data type

发生了什么事以及如何解决这个问题?

更新第一个数据帧的数据类型:

df1.dtypes
update_timestamp                 string
issue_type_desc                  string
id                                Int64
project_key                      string
teamname                         string
issue_key                        string
summary                          string
created_at                       string
updated_at                       string
updated_at_utc                   string
status_updated_at                string
resolved_at                      string
closed_at                        string
due_date                         string
age_ts                            Int64
resolution_seconds               string
reporter_name                    string
assignee                         string
issue_type_id                     Int64
priority_id                       Int64
status_id                         Int64
status_detail                    string
status_category                  string
resolution_id                    string
security_level_id                 Int64
parent_issue_key                 string
subtask_parent_key               string
subtask_parent_id                 Int64
subtask_key                      string
epickey                          string
epic_id                           Int64
epic_parent_key                  string
customfield_fversion             string
customfield_10004                string
epicdesc                         string
customfield_10005                string
sprint                           string
customfield_10003_original       string
customfield_10003_status         string
storypoint                      float64
storyhours                      float64
subtaskhours                     string
subcategory                      string
customfield_feature              string
zephyr_last_executed_at          string
zephyr_last_executed_by          string
zephyr_last_execution_status     string
fix_versions                     string
deleted                           Int64
lastupdateddatetime              string
dtype: object

对于第二个数据帧的数据类型:

df2.dtypes
update_timestamp                string
issue_type_desc                 string
id                               Int64
project_key                     string
teamname                        string
issue_key                       string
summary                         string
created_at                      string
updated_at                      string
updated_at_utc                  string
status_updated_at               string
resolved_at                     string
closed_at                       string
due_date                        string
age_ts                           Int64
resolution_seconds              string
reporter_name                   string
assignee                        string
issue_type_id                    Int64
priority_id                      Int64
status_id                        Int64
status_detail                   string
status_category                 string
resolution_id                   string
security_level_id                Int64
parent_issue_key                string
subtask_parent_key              string
subtask_parent_id                Int64
subtask_key                     string
epickey                         string
epic_id                          Int64
epic_parent_key                 string
customfield_fversion            string
customfield_10004               string
epicdesc                        string
customfield_10005               string
sprint                          string
customfield_10003_original      string
customfield_10003_status        string
storypoint                      string
storyhours                      string
subtaskhours                    string
subcategory                     string
customfield_feature             string
zephyr_last_executed_at         string
zephyr_last_executed_by         string
zephyr_last_execution_status    string
fix_versions                    string
deleted                          Int64
lastupdateddatetime             string
dtype: object

更新到目前为止,我已经能够通过编写一个函数来进行比较来解决这个错误,但这似乎不是理想的解决方案。我希望有一种方法可以与这种新的扩展类型进行数据类型比较:

def compare_dataframe_column_dtypes(df_compare1, df_compare2):
    """Since pandas 1.0.0 and the inception of the StringDtype
    doing `df1.dtypes == df2.dtypes` will fail on StringDtype,
    so in order to compare dataframes one must catch this error
    and convert to prior pd 1.0.0 string dtype and compare again.
    """
    df1 = df_compare1.copy()
    df2 = df_compare2.copy()
    try:
        if (df1.dtypes == df2.dtypes).all():
            return True
        return False
    except Exception as ex:
        logger.error("%s. Converting datatypes StringDType to str and then comparing again", ex.args[0])
        for column in df1: 
            if pd.StringDtype.is_dtype(df1[column]):
                df1[column] = df1[column].astype(str)
        for column2 in df2: 
            if pd.StringDtype.is_dtype(df2[column2]):
                df2[column2] = df2[column2].astype(str)
        if (df1.dtypes == df2.dtypes).all():
            return True
        return False
pandas
1个回答
0
投票

我很晚才偶然发现这一点,但你也许可以将它们转换为字典并进行比较

if (dict(df1.dtypes) == dict(df2.dtypes)):
        return True
    return False

https://docs.python.org/3/library/stdtypes.html#mapping-types-dict

© www.soinside.com 2019 - 2024. All rights reserved.