我使用LineairSVM来预测推文的情绪。 LSVM将推文分类为中性或正面。我使用管道(按顺序)清理,矢量化和分类推文。但是当预测情绪时,我只能得到0(对于neg)或4(neg)。我希望以十进制数字预测-1到1之间的分数,以更好地扩展/理解推文的“如何”正面和负面:
代码:
#read in influential twitter users on stock market
twitter_users = pd.read_csv('core/infl_users.csv', encoding = "ISO-8859-1")
twitter_users.columns = ['users']
df = pd.DataFrame()
#MODEL TRAINING
#read trainingset for model : csv to dataframe
df = pd.read_csv("../trainingset.csv", encoding='latin-1')
#label trainingsset dataframe columns
frames = [df]
for colnames in frames:
colnames.columns = ["target","id","data","query","user","text"]
#remove unnecessary columns
df = df.drop("id",1)
df = df.drop("data",1)
df = df.drop("query",1)
df = df.drop("user",1)
pat1 = r'@[A-Za-z0-9_]+' # remove @ mentions fron tweets
pat2 = r'https?://[^ ]+' # remove URL's from tweets
combined_pat = r'|'.join((pat1, pat2)) #addition of pat1 and pat2
www_pat = r'www.[^ ]+' # remove URL's from tweets
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not", # converting words like isn't to is not
"haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
"wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
"can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
"mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')
def tweet_cleaner(text): # define tweet_cleaner function to clean the tweets
soup = BeautifulSoup(text, 'lxml') # call beautiful object
souped = soup.get_text() # get only text from the tweets
try:
bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?") # remove utf-8-sig codeing
except:
bom_removed = souped
stripped = re.sub(combined_pat, '', bom_removed) # calling combined_pat
stripped = re.sub(www_pat, '', stripped) #remove URL's
lower_case = stripped.lower() # converting all into lower case
neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case) # converting word's like isn't to is not
letters_only = re.sub("[^a-zA-Z]", " ", neg_handled) # will replace # by space
words = [x for x in tok.tokenize(letters_only) if len(x) > 1] # Word Punct Tokenize and only consider words whose length is greater than 1
return (" ".join(words)).strip() # join the words
# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)
# Use the punctuations of string module
punctuations = string.punctuation
# Creating a Spacy Parser
parser = English()
class predictors(TransformerMixin):
def transform(self, X, **transform_params):
return [clean_text(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
# Basic function to clean the text
def clean_text(text):
return text.strip().lower()
def spacy_tokenizer(sentence):
mytokens = parser(sentence)
mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]
#mytokens = [word.lemma_.lower().strip() for word in mytokens]
mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
#mytokens = preprocess2(mytokens)
return mytokens
# Vectorization
# Convert a collection of text documents to a matrix of token counts
# ngrams : extension of the unigram model by taking n words together
# big advantage: it preserves context. -> words that appear together in the text will also appear together in a n-gram
# n-grams can increase the accuracy in classifying pos & neg
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
# Linear Support Vector Classification.
# "Similar" to SVC with parameter kernel=’linear’
# more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
# LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:
classifier = LinearSVC(C=0.5)
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
#put tweet-text in X and target in ylabels to train model
X = df['text']
ylabels = df['target']
#T he next step is to load the data and split it into training and test datasets. In this example,
# we will use 80% of the dataset to train the model.This 80% is then splitted again in 80-20. 80% tot train the model, 20% to test results.
# the remaining 20% is kept to train the final model
X_tr, X_kast, y_tr, y_kast = train_test_split(X, ylabels, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42)
# Create the pipeline to clean, tokenize, vectorize, and classify
# Tying together different pieces of the ML process is known as a pipeline.
# Each stage of a pipeline is fed data processed from its preceding stage
# Pipelines only transform the observed data (X).
# Pipeline can be used to chain multiple estimators into one.
# The pipeline object is in the form of (key, value) pairs.
# Key is a string that has the name for a particular step
# value is the name of the function or actual method.
#Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
pipe_tfid = Pipeline([("cleaner", predictors()),
('vectorizer', tfvectorizer),
('classifier', classifier)])
# Fit our data, fit = training the model
pipe_tfid.fit(X_train,y_train)
# Predicting with a test dataset
#sample_prediction1 = pipe_tfid.predict(X_test)
accur = pipe_tfid.score(X_test,y_test)
当我预测情绪分数时
pipe_tfid.predict('textoftweet')
SVM在训练期间计算权重qazxsw po,使得分类的边距最大。然后使用函数进行预测(在二进制分类器的情况下)
如果w ^ Tx + bias> 0则选择C1否则选择C2
SVM无法返回概率,因为它不是概率模型。像w
这样对SVM有一些概率性的解释。但是如果你想知道预测的可信度,你最好使用一些标准的概率模型(如NaiveBayes,LogisticRegression等)。