Problems to conver .py file to .exe file, .exe file converted are HUGE

问题描述 投票:0回答:0

我在将 PROGRAM 1 从 .py 转换为 .exe 时遇到问题,我希望有人能帮助我,总而言之,PROGRAM 1 是一个接受“email.eml”文件并对其进行扫描并将其分类为垃圾邮件的程序,为此,它使用神经网络(程序 2)。 PROGRAM 2 训练 NN 模型并生成将由 PROGRAM 1 使用的 .pkl 格式的模型。所以,首先我遇到的问题是,当我使用“!pyinstaller --onefile Antispam.py”生成 .exe 文件时,我得到了这个错误:

“发生递归错误(超出最大递归深度)。 要变通,请按照这些说明“

然后,作为解决方法,我修改生成的 .spec 文件以增加递归限制:

 import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)

在此之后,我不再有问题了,但是,生成的 .exe 程序是 HUGEEEE,我正在使用 google colab 来执行此操作,因为我的电脑没有所有依赖项,我得到了一些文件,总的来说就像3.8GB,因此,我当然不想要那个。

你知道为什么我的程序在转换为 .exe 后如此巨大,以及我如何以更好的方式转换它以获得可能以 MB 为单位的文件?我的代码有什么问题我可以更改吗? 希望你能指导我。

注意:我使用神经网络训练我的反垃圾邮件,但也使用其他一些 ML 分类器,我得到了类似的场景。

程序 1 转换为 .EXE 文件:

import pickle
import numpy as np
import tensorflow as tf
import pandas as pd
import re
import email
import email.header
from langdetect import detect
import argparse

keywords = ['Full refund', 'Cashcashcash', 'Compare rates', 'Billion dollars', 'Call free', 'Only $', 'Xanax', 'Lower monthly payment', 'Best price', 'Cash bonus', 'Hidden assets', 'Order now', 'No obligation', 'They keep your money ? no refund!', 'Explode your business', 'Increase sales', 'This isn?t junk', 'Order today', 'Reserves the right', 'Unsecured credit', 'More Internet Traffic', 'Buying judgments', 'Free leads', 'This isn?t spam', 'Lower interest rate', 'Time limited', 'Get it today', 'Undisclosed recipient', 'No-obligation', 'Lowest price', 'Once in lifetime', 'Cures baldness', 'At no cost', 'Join millions', 'Free grant money', 'Laser printer', 'Real thing', 'Mail in order form', 'Info you requested', 'Get started now', 'Marketing', 'Free DVD', 'Steamy', 'Access now', 'Home employment', 'No medical exams', 'Copy DVDs', 'Visit our website', 'Brand new pager', 'For instant access', 'Your income', 'Free money', 'While supplies last', 'Online business opportunity', 'Be your own boss', 'Not intended', 'Luxury car', 'Increase Income', 'Message contains', 'As seen on', 'Consolidate your debt', 'Apply now', 'Cents on the dollar', 'Risk-free', 'Homebased business', 'No medical exams ', 'Confidentially on all orders', 'Addresses on CD', 'No questions asked', 'Special promotion', 'Free access', 'Satisfaction guaranteed', 'Growth hormone', 'Big bucks', 'Fast cash', 'Month trial offer', 'XXX', 'Make $', 'Save up to', 'One hundred percent guaranteed', 'No investment', 'Work from home', 'Unsubscribe', 'See for yourself', 'All-natural', 'Email harvest', 'Save big', 'The following form', 'Please read', 'Marketing solutions', 'Don?t hesitate', 'Cable converter', 'Dear friend', '50% off', 'Cheap meds', 'Online biz opportunity', 'Why pay more?', 'Calling creditors', 'Eliminate bad credit', 'No refunds', 'Click Here', "Don't hesitate", 'Removal instructions', 'We honor all', 'Internet market', 'Orders shipped by shopper', 'Score with babes', 'Celebrity', 'Free info', "This isn't spam", 'Compete for your business', 'Multi level marketing', '100% satisfaction', 'Very Cheap', 'Subscribe', 'Save $', 'Print out and fax', 'Free cell phone', 'Increase your sales', 'Stock pick', 'Limited time', 'New customers only', 'Financially independent', 'Auto email removal', 'Avoice bankruptcy', "Don't delete", 'Offer expires', 'No inventory', 'Drastically reduced', 'Instant weight loss', 'This is an ad', 'All natural', 'Click to remove', 'No catch', 'Act now!', 'Once in a lifetime', 'Billing address', 'Free investment', 'Potential earnings', 'Join millions of Americans', 'Accept credit cards', 'Free Instant', 'Meet women', 'Dig up dirt on friends', 'Print form signature', 'Unlimited', 'No claim forms', 'Avoid bankruptcy', 'Fast money', 'Get it now', 'You are a winner!', 'Shopping spree', 'Weight loss', 'Free installation', 'Investment decision', 'Free hosting', 'One time', 'For you', 'Click here', 'Stock disclaimer statement', 'Act now', 'Check or money order', '100% satisfied', 'Get started now. ', 'Get paid', 'Cutie', 'Reverses aging', 'What are you waiting for?', 'We hate spam', 'No hidden costs', 'Requires initial investment', 'Weekend getaway', '0% risk', 'Meet girls', 'Extra income', 'Subject to credit', 'Increase traffic', 'Accordingly', 'http://', 'Do it today', 'Risk free', 'Don?t delete', 'Cannot be combined with any other offer', 'Incredible deal', 'Fast Viagra delivery', 'It?s effective', 'Free quote', 'Multi-level marketing', 'If only it were that easy', 'Million dollars', 'Lose weight spam', 'Digital marketing', 'Unsolicited', 'hidden charges', 'No credit check', 'No purchase necessary', 'For just $XXX', 'Guarantee!', 'Join thousands', 'Lowest insurance rates', 'For free', 'Email marketing', 'Fantastic deal', 'Buy now', 'Give it away', 'Search engines', 'This won?t last', 'Sexy babes', 'Expect to earn', 'Action Required', 'University diplomas', 'No age restrictions', 'Lose weight', 'Cards accepted', 'All new', 'For Only', 'Free sample', 'Opportunity', 'Order status', "Can't live without", "This isn't junk", 'Do it now', 'You?ve been selected!', 'Online pharmacy', 'Sign up free today', 'Home based', 'No experience', 'Income from home', 'Act Immediately', 'Have you been turned down?', 'Refinance home', 'Direct marketing', 'Home-based', 'Information you requested', 'Near you', 'Lower your mortgage rate', 'Bulk email', 'New domain extensions', 'You have been selected', 'Can?t live without', 'Click below', 'Earn $', 'Dear [email/friend/somebody]', 'F r e e', 'Save big money', 'While you sleep', 'Free website', 'Free membership', 'Double your', '$$$', 'Pennies a day', 'Being a member', 'Who really wins?', 'Card accepted', 'Get rid of debt', 'Will not believe your eyes', 'Free gift', 'Meet singles', 'You are a winner', 'Unsecured debt', 'Financial freedom', 'Human growth hormone', 'Serious cash', 'Make money', 'Money making', 'Deal ending soon', 'No purchase required', 'Priority mail', 'No fees', 'Call now', 'They?re just giving it away', 'Stainless steel', 'No middleman', 'Pure profit', 'Earn per week', 'Credit bureaus', 'Money back', 'One time mailing', 'Free priority mail', 'US dollars', 'Removes wrinkles', 'No cost', 'Stock alert', 'Produced and sent out', '100% more', 'Free preview', 'Now only', 'Free trial', 'No selling', 'Sent in compliance', 'Vacation offers', 'No hidden Costs', 'Earn cash', 'Online degree', 'The best rates', 'In accordance with laws', 'You?re a Winner!', 'Apply online', 'Stops snoring', 'Acceptance', 'Earn extra cash', 'All-new', 'Terms and conditions', 'Social security number', 'Gift certificate', 'Long distance phone offer', 'Stuff on sale', 'Work at home', 'Hottie', 'Copy accurately', 'Important information regarding', 'Get out of debt', 'No gimmick', 'Free offer', 'No strings attached', 'Supplies are limited', 'Name brand', 'Outstanding values', 'Mortgage rates', 'Cancel at any time', 'Take action now', 'Free consultation', 'Easy terms', 'Internet marketing', 'Credit card offers', '100% free', 'Consolidate debt and credit', 'Hurry up', 'Additional income', 'No disappointment', 'Take action', 'Mass email', 'While stocks last', 'Hot babes', 'Double your income', 'Web traffic', 'Great offer', 'Kinky', 'One hundred percent free', 'Exclusive deal', 'Collect child support', 'Eliminate debt', 'Online marketing', 'Sex', 'Search engine listings', 'Life insurance', 'Stop snoring', 'Giving away', 'Promise you', 'Buy direct']


def clean_subject(subject):
    # Decode the subject using the email header parser
    decoded_subject = email.header.decode_header(subject)
    # Join any decoded subject fragments
    subject = ''.join([data[0].decode(data[1]) if data[1] else str(data[0]) for data in decoded_subject])
    # Remove all non-alphanumeric characters except for $ % : ; , and space
    subject = re.sub(r'[^a-zA-Z0-9$%:;, ]', '', subject)
    # Remove any leading or trailing whitespace
    subject = subject.strip()
    subject = ''.join(subject)
    return subject


def extract_email_data(eml_file_path):
    try:
        # Open the .eml file
        with open(eml_file_path, 'rb') as f:
            message = email.message_from_binary_file(f)

            # Extract basic email information
            sender = message.get('From')
            sender_domain = sender.split('@')[-1].strip('> ')
            recipient = message['To']
            subject = message.get('Subject')
            if subject:
                if detect(subject) != 'en':
                        print("The file is not in English, sorry, for now we only process emails in english:", eml_file_path)
                        return None
                subject = clean_subject(subject)
            else:
                subject = "No subject"
            date_sent = message['Date']

            # Extract the body of the email and remove any newlines
            if message.is_multipart():
                body = ''
                for part in message.walk():
                    if part.get_content_type() == 'text/plain':
                        body += part.get_payload()
            else:
                body = message.get_payload()
            body = body.replace('\n', ' ')

            # Extract authentication results
            if 'Authentication-Results' in message:
                auth_results = message['Authentication-Results']
                spf_match = re.search(r"spf=([\w/]+)", auth_results)
                dkim_match = re.search(r"dkim=([\w/]+)", auth_results)
                dmarc_match = re.search(r"dmarc=([\w/]+)", auth_results)

                if spf_match:
                    spf_result = spf_match.group(1)
                else:
                    spf_result = "unknown"

                if dkim_match:
                    dkim_result = dkim_match.group(1)
                else:
                    dkim_result = "unknown"

                if dmarc_match:
                    dmarc_result = dmarc_match.group(1)
                else:
                    dmarc_result = "unknown"
            else:
                spf_result = "unknown"
                dkim_result = "unknown"
                dmarc_result = "unknown"

            # Extract URLs from the body
            url_pattern = r"(?P<url>https?://[^\s]+)"
            url_matches = re.findall(url_pattern, body)
            urls = ', '.join(set(url_matches))

            if message.get_all('Received', []):
                hop_pattern = r"from\s([\w\.]+)\s\("
                hop_matches = re.findall(hop_pattern, ''.join(message.get_all('Received', [])))
                hops = ', '.join(hop_matches)
            else:
                hops = "unknown"

            # Check if email body has spam keywords
            body_has_keywords = False
            for keyword in keywords:
                if keyword in body:
                    body_has_keywords = True
                    break

            # Categorize email as spam or non-spam
            if body_has_keywords:
                word_pattern = "Has spam patterns"
            else:
                word_pattern = "Non-spam patterns"

            # Write the extracted data into a dictionary
            my_dict = [{'Subject': subject,
                       'SPF': spf_result.lower(),
                       'DKIM': dkim_result.lower(),
                       'DMARC': dmarc_result.lower(),
                       'Word pattern': word_pattern}]

            return my_dict
    except:
        print("Error reading file:", eml_file_path)
        return None

def preprocess_and_predict(x_evaluate, le_path, onehot_enc_path, cv_path, ann_path):
    # load label encoder model
    with open(le_path, 'rb') as f:
        le = pickle.load(f)
    # encode 'Word pattern' column using label encoder
    x_evaluate['Word pattern'] = le.transform(x_evaluate['Word pattern'])
        # load one-hot encoder model
    with open(onehot_enc_path, 'rb') as f:
        onehot_enc = pickle.load(f)
    # encode categorical columns using one-hot encoder
    cat_cols = ['SPF', 'DKIM', 'DMARC']
    x_evaluate_cat = onehot_enc.transform(x_evaluate[cat_cols])
    # remove categorical columns from the input data and concatenate one-hot encoded columns
    x_evaluate = x_evaluate.drop(cat_cols, axis=1)
    x_evaluate = np.hstack((x_evaluate.values, x_evaluate_cat.toarray()))
    # load count vectorizer model
    with open(cv_path, 'rb') as f:
        cv = pickle.load(f)
    # vectorize 'Subject' column using count vectorizer
    x_evaluate_subject = cv.transform(x_evaluate[:,0])  # index 0 corresponds to 'Subject' column
    x_evaluate_subject = x_evaluate_subject.toarray()
    # concatenate the vectorized 'Subject' column with the rest of the preprocessed data
    x_evaluate = np.hstack((x_evaluate_subject, x_evaluate[:, 1:]))
    # convert the preprocessed data to a tensor
    x_evaluate = tf.convert_to_tensor(x_evaluate, dtype=tf.float32)
    # load ANN model
    with open(ann_path, 'rb') as f:
        ann_model = pickle.load(f)
    # make predictions on the preprocessed data using the ANN model
    y_pred = ann_model.predict(x_evaluate)
    # compare the predicted probabilities to a threshold of 0.5 and return the predicted label
    if y_pred > 0.5:
        return "ham"
    else:
        return "Unwanted email"

try:
  parser = argparse.ArgumentParser()
  parser.add_argument("-input",help="Input email file path")
  args = parser.parse_args()
  if args.input:
    x_evaluation=extract_email_data(args.input)
    # convert the input data to a pandas DataFrame
    x_evaluate = pd.DataFrame(x_evaluation)
    # preprocess the input data and make predictions using the function
    predicted_label = preprocess_and_predict(x_evaluate,'labelencoder_model.pkl','onehotencoder_model.pkl','countvectorizer_model.pkl','ann_model.pkl')
    # print the predicted label
    print("\n")
    print("We are going to evaluate the email you just input")
    print(x_evaluation)
    print("\n")
    print("The email is: ")
    print(predicted_label)
  else:
    print("No input file specified")
except:
  print("An error ocurred")

生成 .PKL 文件的程序 2:

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.callbacks import EarlyStopping
import pickle
from sklearn.metrics import confusion_matrix,precision_score, recall_score, f1_score, accuracy_score

def label_encoding(x_train,x_test,col):
    le = LabelEncoder()
    le.fit(x_train[col])
    with open("labelencoder_model.pkl", "wb") as f:
      pickle.dump(le, f)
    x_train[col] = le.transform(x_train[col])
    x_test[col] = le.transform(x_test[col])
    return x_train, x_test
    
def onehot_encoding(x_train, x_test,cols):
    onehot_enc = OneHotEncoder(drop='first')
    onehot_enc.fit(x_train[cols])
    with open("onehotencoder_model.pkl", "wb") as f:
      pickle.dump(onehot_enc, f)
    x_train_cat = onehot_enc.transform(x_train[cols])
    x_train = x_train.drop(cols, axis=1)
    x_train = np.hstack((x_train.values, x_train_cat.toarray()))
    x_test_cat = onehot_enc.transform(x_test[cols])
    x_test = x_test.drop(cols, axis=1)
    x_test = np.hstack((x_test.values, x_test_cat.toarray()))
    return x_train, x_test

def count_vectorizer(x_train, x_test):
    cv = CountVectorizer(max_features=30000, ngram_range=(1, 2))
    cv.fit(x_train[:,0])
    with open("countvectorizer_model.pkl", "wb") as f:
      pickle.dump(cv, f)
    x_train_subject = cv.transform(x_train[:,0])
    x_train_subject=x_train_subject.toarray()
    x_train = np.hstack((x_train_subject, x_train[:, 1:]))
    x_test_subject = cv.transform(x_test[:,0])
    x_test_subject=x_test_subject.toarray()
    x_test = np.hstack((x_test_subject, x_test[:, 1:]))
    x_train = tf.convert_to_tensor(x_train, dtype=tf.float32)
    x_test = tf.convert_to_tensor(x_test, dtype=tf.float32)
    return x_train, x_test

def ann_model_func(x_train, y_train):
    ann = tf.keras.models.Sequential()
    ann.add(tf.keras.layers.Dense(units=12, activation='relu'))
    ann.add(tf.keras.layers.Dense(units=12, activation='relu'))
    ann.add(tf.keras.layers.Dense(units=12, activation='relu'))
    ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
    ann.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    early_stop = EarlyStopping(monitor='loss', patience=7)
    ann.fit(x_train, y_train, batch_size=32, epochs=20, callbacks=[early_stop])
    with open("ann_model.pkl", "wb") as f:
        pickle.dump(ann, f)
    return ann

def testing(x_test,y_test,ann):
    y_pred = ann.predict(x_test)
    y_pred = (y_pred > 0.5)
    print("\n")   
    print("Confusion Matrix for the operation: ")
    cm = confusion_matrix(y_test, y_pred)
    print(cm)
    print("\n")
    print(f'Precision: {precision_score(y_test, y_pred)}')
    print(f'Recall: {recall_score(y_test, y_pred)}')
    print(f"The F1 score is: {f1_score(y_test, y_pred)}")
    print(f"The accuracy: {accuracy_score(y_test, y_pred)}")
    
def main():
    spam = pd.read_csv('Dataset_final.csv')
    x = spam.iloc[:,:-1]
    y = spam.iloc[:,-1]
    cols = ['SPF', 'DKIM', 'DMARC']    

    y = y.map({'ham': 1, 'Unwanted email': 0})

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

    x_train, x_test=label_encoding(x_train,x_test,'Word pattern')
    x_train, x_test=onehot_encoding(x_train, x_test,cols)
    print(x_test)
    x_train, x_test=count_vectorizer(x_train, x_test)
    print(x_test)
    ann = ann_model_func(x_train, y_train)
    testing(x_test,y_test,ann)

if __name__ == "__main__":
    main()

python deep-learning google-colaboratory pyinstaller
© www.soinside.com 2019 - 2024. All rights reserved.