预期的字符串或类似字节的对象,得到“float”

问题描述 投票:0回答:1

我正在尝试用 python 制作 ETL(提取、转换和加载)算法。我有一个亚马逊评论数据库,但是当我使用 DataFrame.apply() 方法通过正则表达式应用该函数时,我收到了错误:

expected string or bytes-like object, got 'float'

我使用的代码如下:

import pandas as pd
import pathlib
#from sqlalchemy import create_engine
import re
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

#Create the pattern for regex ETL process
pattern = re.compile(r"[\u0041-\u1EFF\s]+\s?")

def iterator_func (x):
    match = pattern.search(x[1])
    return "".join(i for i in match.groups() if i not in stop_words)


try:
    #Open the database, create a connection and upload the data to a database after the ETL process.
    with open(pathlib.Path("database\\test.csv"), encoding="utf-8") as f:
        csv_table = pd.read_csv(f, header=None)

    #Remove incorret values from the first index, stop words and ponctuation characters using regex and nltk
    csv_table[1] = csv_table.apply(iterator_func)
    csv_table[2] = csv_table[2].apply(iterator_func)

在这里您可以下载并查看数据库:kaggle 上的亚马逊评论

我尝试手动迭代每一行,效果很好,但我注意到这会产生严重的性能问题。

   for x in csv_table.index():
        if csv_table.loc[x, 0] != "1" or csv_table.loc[x, 0] != "2":
            csv_table.drop(x, inplace=True, erros="ignore")
        #TODO: Create a regex function to avoid numbers, pontuations and stop words.
        temp_phrase = "".join(i for i in pattern.findall(csv_table.loc[x, 1]) if i not in stop_words)

        temp_phrase_two = "".join(i for i in pattern.findall(csv_table.loc[x, 2]) if i not in stop_words)

        csv_table.loc[x, 1] = temp_phrase

        csv_table.loc[x, 2] = temp_phrase_two
python pandas dataframe csv etl
1个回答
0
投票

我只是尝试将列的类型转换为正确的类型,并且效果很好。

csv_table[1] = csv_table[1].astype("str")
© www.soinside.com 2019 - 2024. All rights reserved.