我正在使用一个包含两列的 numpy 数据框:“tweet_text”和“cyberbullying_type”。它是通过此数据集创建的,如下所示:
df = pd.read_csv('data/cyberbullying_tweets.csv')
我目前正在尝试使用两种不同的方法来计算每个“网络欺凌类型”中使用的主题标签的总数,我认为每种方法都会计算重复项。然而,每种方法都给我不同的答案:
import re
# Define the pattern for valid hashtags
hashtag_pattern = r'#[A-Za-z0-9]+'
# Function to count the total number of hashtags in a dataframe
def count_total_hashtags(dataframe):
return dataframe['tweet_text'].str.findall(hashtag_pattern).apply(len).sum()
for category in df['cyberbullying_type'].unique():
count = count_total_hashtags(df[df['cyberbullying_type'] == category])
print(f"Number of hashtags in all tweets for the '{category}' category: {count}")
输出:
'not_cyberbullying': 3265, 'gender': 2691, 'religion': 1798, 'other_cyberbullying': 1625, 'age': 728, 'ethnicity': 1112,
下一个方法更加手动:
def count_hashtags_by_category(dataframe):
hashtag_counts = {}
for category in dataframe['cyberbullying_type'].unique():
# Filter tweets by category
category_tweets = dataframe[dataframe['cyberbullying_type'] == category]
# Count hashtags in each tweet
hashtag_counts[category] = category_tweets['tweet_text'].apply(
lambda text: sum(1 for word in text.split() if word.startswith('#') and word[1:].isalnum())
).sum()
return hashtag_counts
# Count hashtags for each category
hashtags_per_category = count_hashtags_by_category(df)
print(hashtags_per_category)
输出:
{'not_cyberbullying': 3018, 'gender': 2416, 'religion': 1511, 'other_cyberbullying': 1465, 'age': 679, 'ethnicity': 956}
为什么答案不同?
您的两种方法并不完全相同。例如,正则表达式不会匹配
#YolsuzlukVeRüşvetYılı2014
,但会通过 split
+alnum
方法进行匹配。此外,尽管有效,但包含 _
的主题标签将被两种方法忽略。
str.count
和 groupby.sum
:
hashtag_pattern = r'#[\w_]+'
df = pd.read_csv('twitter_parsed_dataset.csv')
df['Text'].str.count(hashtag_pattern).groupby(df['Annotation']).sum()
输出示例:
Annotation
none 6402.0
racism 287.0
sexism 2103.0
Name: Text, dtype: float64