如何根据标题对数据框中的项目进行分类?

问题描述 投票:0回答:1

我有一个数据框,我想根据其标题将能源相关项目分为 4 个不同的主题。 为此,我想使用预定义的关键字来识别项目与哪个主题相关,并对它们进行相应的分类。

主题和关键词是:

  1. 能源效率:效率AND(程序*或过程*或设备*或系统*或电机*)

  2. 化石燃料:石油或(天然气和天然气)或液化天然气或碳

  3. 可再生能源:(太阳能和能源)或(风能和能源)或(可再生能源和能源)或生物燃料*或生物质*

  4. 核能:核电和发电厂

使用关键字我想创建一个新专栏来告知项目主题。该项目可以分为多个主题。

我有

df['title']
,我想使用基于我预定义关键字的值创建
df['topic']

EXAMPLE

Project 1:

INPUT 
df['title'] = 'Integral Use of Biomass in Electric Power Generation and Use of Heat to Increase Energy Efficiency'
OUTPUT 
df['topic']= [1,3]


Project 2:

INPUT 
df['title'] = 'simulation of power plant based on nuclear fusion'
OUTPUT 
df['topic'] = [4]


Project 3:
INPUT 
df['title'] =  'new industrial processes in the oil sector to increase energy efficiency'
OUTPUT
df['topic']=[1,2]

我认为正则表达式是识别每个项目标题中关键字的好方法。

我尝试使用

isin
但我不知道如何设置正则表达式,也不知道如何一起使用它们。

我也愿意接受新的解决方案来对除此之外的项目进行分类!

python pandas dataframe classification
1个回答
0
投票

看起来主要是一个实施问题。我们可以创建两个辅助函数,

_and
_or
,它们检查是否所有输入到
_and
的实例以及是否尊重输入到
_or
的任何实例。

import pandas as pd


def get_topic(title: str) -> str:
    """
    RULES:
    energy efficiency : efficiency AND (procedure* OR process* OR equipment* OR system* OR motor*)

    fossil fuels: oil OR (natural AND gas) OR LNG OR carbon

    renewable energy: (solar AND energy) OR (wind AND energy) OR (renewable AND energy) OR biofuel* OR biomass*

    nuclear energy: nuclear AND power AND plant
    """

    def _and(keywords: list[str]) -> bool:
        return all(keyword in title for keyword in keywords)

    def _or(keywords: list[str]) -> bool:
        return any(keyword in title for keyword in keywords)

    title = title.lower()
    suggestions = []

    if _and(["efficiency"]) and _or(
        ["procedure", "process", "equipment", "system", "motor"]
    ):
        suggestions.append(1)

    if _or(["oil", "lng", "carbon"]) or _and(["natural", "gas"]):
        suggestions.append(2)

    if (
        _and(["solar", "energy"])
        or _and(["wind", "energy"])
        or _and(["renewable", "energy"])
        or _and(["biofuel"])
        or _and(["biomass"])
    ):
        suggestions.append(3)

    if _and(["nuclear", "power", "plant"]):
        suggestions.append(4)

    return suggestions


df = pd.DataFrame(
    {
        "title": [
            "Integral Use of Biomass in Electric Power Generation and Use of Heat to Increase Energy Efficiency",
            "simulation of power plant based on nuclear fusion",
            "new industrial processes in the oil sector to increase energy efficiency",
        ]
    }
)

df["topic"] = df["title"].apply(get_topic)

print(df)
© www.soinside.com 2019 - 2024. All rights reserved.