我试图将列中每个单元格中相连的两个团队名称分开。我希望得到一些帮助来想出一种将它们分开的方法。
从下面的代码中可以看到,我正在从网站导入数据并清理数据框。
我想要实现的是创建一个新列,即从
df_games_2023['text_only_away']
列中剥离的 df_games_2023['text_only']
列。所以新列df_games_2023 ['new_text_column']
将是“Mississippi Valley St.”、“Brescia”、“Pacific”等..
import pandas as pd
import re
# URL of the CSV file on the website
url = "https://www.barttorvik.com/2023_results.csv" # Replace with the actual URL
#name the columns columns
colnames = ['matchup', 'date','home_team', 'xyz', 'xyz1','away_score','home_score','xyz2','xyz3','xyz4','xyz5']
# Read CSV data into pandas DataFrame
df_games_2023 = pd.read_csv(url, names = colnames)
#eliminate columns from dataframe
df_games_2023 = df_games_2023[['matchup','date', 'home_team', 'away_score','home_score']]
#name the dataframe columns
#df_games_2023.columns = ['matchup', 'date','home_team', 'xyz', 'xyz1','away_score','home_score','xyz2','xyz3','xyz4','xyz5']
#clean up the home_team data
# Extract only text using regex
df_games_2023['text_only'] = df_games_2023['home_team'].apply(lambda x: re.sub(r'\d+', '', x))
# Define the phrases to drop
phrases_to_drop = [',','-','.','(',')','%']
# Drop the specified phrases from the column
for phrase in phrases_to_drop:
df_games_2023['text_only'] = df_games_2023['text_only'].str.replace(phrase, '', regex=True)
#Clean up away team
df_games_2023['text_only_away'] = df_games_2023['matchup'].apply(lambda x: re.sub(r'\d+', '', x))
#we are removing a random '-' with this string of code
df_games_2023['text_only_away'] = df_games_2023['text_only_away'].apply(lambda x: x.rstrip('-'))
# Now you have your DataFrame ready
df_games_2023
上面的代码工作得很好,但问题是当我尝试使用逻辑将一个团队名称与
df_games_2023['text_only_away']
列隔离开时。以下是我用来通过从 ['text_only_away'] 中剥离 ['text_only'] 创建新列的代码:
def remove_data(row):
text_column = row['text_only_away']
phrase = row['text_only']
if text_column.endswith(phrase):
return text_column[:-len(phrase)].rstrip()
else:
return text_column
# Apply the function to each row and create a new column
df_games_2023['new_text_column'] = df_games_2023.apply(remove_data, axis=1)
有关如何与 ['text_only'] 中列出的团队之外的团队创建新列的任何帮助都会非常有帮助。预先感谢您!
我希望在 df_games_2023 = pd.DataFrame({'new_text_column': ['Mississippi Valley St.', 'Brescia', 'Pacific', etc..]) 上有一个新专栏
您没有说明失败的实际症状,但是从每个字符串末尾删除标点符号的代码抛出了异常,因为“(”本身作为正则表达式是非法的。
无论如何,这是一种更紧凑的正则表达式语法,可以一次性完成所有替换:
df_games_2023['text_only'].str.replace(r'[-,\.()%]', '', regex=True)
^^^^^^^^^^
注意我们必须覆盖默认值
regex=False
。