我正在尝试用 python 制作 ETL(提取、转换和加载)算法。我有一个亚马逊评论数据库,但是当我使用 DataFrame.apply() 方法通过正则表达式应用该函数时,我收到了错误:
expected string or bytes-like object, got 'float'
我使用的代码如下:
import pandas as pd
import pathlib
#from sqlalchemy import create_engine
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
#Create the pattern for regex ETL process
pattern = re.compile(r"[\u0041-\u1EFF\s]+\s?")
def iterator_func (x):
match = pattern.search(x[1])
return "".join(i for i in match.groups() if i not in stop_words)
try:
#Open the database, create a connection and upload the data to a database after the ETL process.
with open(pathlib.Path("database\\test.csv"), encoding="utf-8") as f:
csv_table = pd.read_csv(f, header=None)
#Remove incorret values from the first index, stop words and ponctuation characters using regex and nltk
csv_table[1] = csv_table.apply(iterator_func)
csv_table[2] = csv_table[2].apply(iterator_func)
在这里您可以下载并查看数据库:kaggle 上的亚马逊评论
我尝试手动迭代每一行,效果很好,但我注意到这会产生严重的性能问题。
for x in csv_table.index():
if csv_table.loc[x, 0] != "1" or csv_table.loc[x, 0] != "2":
csv_table.drop(x, inplace=True, erros="ignore")
#TODO: Create a regex function to avoid numbers, pontuations and stop words.
temp_phrase = "".join(i for i in pattern.findall(csv_table.loc[x, 1]) if i not in stop_words)
temp_phrase_two = "".join(i for i in pattern.findall(csv_table.loc[x, 2]) if i not in stop_words)
csv_table.loc[x, 1] = temp_phrase
csv_table.loc[x, 2] = temp_phrase_two
我只是尝试将列的类型转换为正确的类型,并且效果很好。
csv_table[1] = csv_table[1].astype("str")