Multiprocessing Logging-如何在joblib并行中使用loguru

问题描述 投票:1回答:1

我有一堆Python脚本来运行某些数据科学模型。这需要花费相当长的时间,并且加快速度的唯一方法是使用多处理。为此,我使用了joblib库,它确实运行良好。但是,不幸的是,这使日志混乱,并且控制台的输出也出现了乱码(不过如此),因为所有进程同时转储了各自的输出。

我对使用logging库是陌生的,并遵循了一些其他的答案来尝试使其工作。我正在使用8个核心进行处理。使用SO上的答案,我写出了日志文件,并希望每次迭代有8个新文件。但是,它在第​​一次迭代时创建了8个文件,并且每个循环仅写入/追加到这8个文件中。这有点不方便,因此我进行了更多探索,发现了logurulogzero。虽然它们都涵盖了使用multiprocessing的示例,但都没有显示如何在joblib中使用它。这是我到目前为止的内容:

run_models.py

import math
import multiprocessing
import time
from datetime import datetime
from loguru import logger

import pandas as pd
import psutil
from joblib import Parallel, delayed

import helper
import log
import prep_data
import stock_subscriber_data
import train_model


def get_pred(cust_df, stock_id, logger):

    logger.info('--------------------------------Stock loop {}-------------------------------'.format(stock_id))

    cust_stockid_df = stock_subscriber_data.get_stockid_data(cust_df, stock_id)
    weekly_timeseries, last_date, abn_df = prep_data.prep(cust_stockid_df, logger)  
    single_row_df = stock_subscriber_data.get_single_row(cust_df, stock_id)

    stock_subscriber_data.write_data(abn_df, 't1')
    test_y, prd = train_model.read_train_write(cust_df, stock_id, weekly_timeseries, last_date, logger)

    return True


def main():

    cust_df = stock_subscriber_data.get_data()
    cust_df = helper.clean_data(cust_df)
    stock_list = cust_df['intStockID'].unique()

    max_proc = max(math.ceil(((psutil.virtual_memory().total >> 30) - 100) / 50), 1)
    num_cores = min(multiprocessing.cpu_count(), max_proc)

    logger.add("test_loguru.log", format="{time} {level}: ({file}:{module} - {line}) >> {message}", level="INFO", enqueue=True)

    Parallel(n_jobs=num_cores)(delayed(get_pred)(cust_df, s, logger) for s in stock_list)


if __name__ == "__main__":
    main()

train_model.py

import math
from datetime import datetime
from itertools import product
from math import sqrt

import pandas as pd
from keras import backend
from keras.layers import Dense
from keras.layers import LSTM
from keras.models import Sequential
from numpy import array
from numpy import mean
from pandas import DataFrame
from pandas import concat
from sklearn.metrics import mean_squared_error

import helper
import stock_subscriber_data

# bunch of functions here that don't need logging...

# walk-forward validation for univariate data
def walk_forward_validation(logger, data, n_test, cfg):
    #... do stuff here ...
    #... and here ...
    logger.info('{0:.3f}'.format(error))
    return error, model


# score a model, return None on failure
def repeat_evaluate(logger, data, config, n_test, n_repeats=10):
    #... do stuff here ...
    #... and here ...
    logger.info('> Model{0} {1:.3f}'.format(key, result))
    return key, result, best_model



def read_train_write(data_df, stock_id, series, last_date, logger):
    #... do stuff here ...
    #... and here ...
    logger.info('done')

    #... do stuff here ...
    #... and here ...

    # bunch of logger.info() statements here... 
    #
    #
    #
    #

    #... do stuff here ...
    #... and here ...

    return test_y, prd

当一次只有一个过程时,这很好用。但是,在多进程模式下运行时,出现_pickle.PicklingError: Could not pickle the task to send it to the workers.错误。我究竟做错了什么?我该如何补救?我不介意切换到logurulogzero以外的任何东西,只要我可以创建一个具有连贯日志的文件,甚至创建n文件,每个文件都包含joblib每次迭代的日志。

python logging python-multiprocessing joblib
1个回答
0
投票

我通过修改run_models.py使它起作用。现在,每个循环有一个日志文件。这会创建很多日志文件,但是它们都与每个循环相关,而不是混乱或其他任何内容。我想一次只一步。这是我所做的:

run_models.py

import math
import multiprocessing
import time
from datetime import datetime
from loguru import logger

import pandas as pd
import psutil
from joblib import Parallel, delayed

import helper
import log
import prep_data
import stock_subscriber_data
import train_model


def get_pred(cust_df, stock_id):

    log_file_name = "log_file_{}".format(stock_id)

    logger.add(log_file_name, format="{time} {level}: ({file}:{module} - {line}) >> {message}", level="INFO", enqueue=True)

    logger.info('--------------------------------Stock loop {}-------------------------------'.format(stock_id))

    cust_stockid_df = stock_subscriber_data.get_stockid_data(cust_df, stock_id)
    weekly_timeseries, last_date, abn_df = prep_data.prep(cust_stockid_df, logger)  
    single_row_df = stock_subscriber_data.get_single_row(cust_df, stock_id)

    stock_subscriber_data.write_data(abn_df, 't1')
    test_y, prd = train_model.read_train_write(cust_df, stock_id, weekly_timeseries, last_date, logger)

    return True


def main():

    cust_df = stock_subscriber_data.get_data()
    cust_df = helper.clean_data(cust_df)
    stock_list = cust_df['intStockID'].unique()

    max_proc = max(math.ceil(((psutil.virtual_memory().total >> 30) - 100) / 50), 1)
    num_cores = min(multiprocessing.cpu_count(), max_proc)

    Parallel(n_jobs=num_cores)(delayed(get_pred)(cust_df, s) for s in stock_list)


if __name__ == "__main__":
    main()
© www.soinside.com 2019 - 2024. All rights reserved.