在多核上从 scikit-learn 运行 LinearRegression() 时出现问题

问题描述 投票:0回答:1

我想在 5 个核心上运行 sklearn 库中的

LinearRegression()
。正如文档所说,除非
n_jobs
> 1,否则
n_targets
参数不会导致多重处理,我创建了具有两个 y 值的随机数据并尝试运行程序。然而,CPU 核心图表显示,只有 1 个核心的使用率超过 50%。图表正常还是代码有问题?

我尝试过的代码:

import os
# Set environment variables to limit the number of threads
for env_var in ["OMP_NUM_THREADS", "OPENBLAS_NUM_THREADS", "MKL_NUM_THREADS", "VECLIB_MAXIMUM_THREADS", "NUMEXPR_NUM_THREADS"]:
    os.environ[env_var] = "5"

# Importing the necessary libraries
import time
import psutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from multiprocessing import Process
from sklearn.model_selection import train_test_split
from joblib import parallel_backend, Parallel, delayed
from sklearn.metrics import mean_squared_error, r2_score

# Function to plot CPU usage for all cores
def plot_cpu_usage(usage_data):
    plt.figure(figsize=(15, 8))  # Adjust the size as needed
    # print(usage_data)
    for core, usage in enumerate(usage_data):
        ax = plt.subplot(4, 4, core + 1)  # 4x4 grid for 16 cores
        
        # Determine the color based on the maximum usage for the core
        max_usage = max(usage)
        if max_usage > 90:
            line_color = 'red'
        elif max_usage > 75:
            line_color = 'orange'
        elif max_usage < 50:
            line_color = 'blue'
        else:
            line_color = 'purple'
        
        # Plot the usage with the determined color
        ax.plot(usage, color=line_color)

        ax.set_title(f'Core {core}')
        ax.set_xlabel('Time (s)')
        ax.set_ylabel('Usage (%)')
    plt.tight_layout()
    plt.savefig('cpu_cores_usage [test#4].png')
    plt.show()  # This will display the graph in a window

# Function to monitor CPU usage
def monitor_cpu_usage(duration, interval):
    # Record the start time
    start_time = time.time()
    # Initialize usage data
    usage_data = [[] for _ in range(psutil.cpu_count())]
    
    while (time.time() - start_time) < duration:
        # Get per-core CPU usage
        cores_usage = psutil.cpu_percent(percpu=True)
        # Append usage data for each core
        for i, usage in enumerate(cores_usage):
            usage_data[i].append(usage)
        # Wait for the specified interval
        time.sleep(interval)
    
    # Call the plot function
    plot_cpu_usage(usage_data)

# Function to create a dataset using a seed and random generation
def create_data(seed, sample, all_f, real_f):
    # Seed set as 42 for reproducible results
    np.random.seed(seed)

    # Set the no. of samples and features and create a sxf matrix
    n_samples, n_features = sample, all_f
    X = np.random.randn(n_samples, n_features)

    # Let only the first real_f features actually affect value.
    # We create Y1 as the sum of first 15 features and random noise
    real_p = real_f
    Y1 = np.sum(X[:, :real_p], axis=1) + np.random.normal(size=(n_samples,))

    # Create Y2 similar to Y1
    Y2 = np.sum(X[:, :real_p], axis=1) + np.random.normal(size=(n_samples,))

    # Combine Y1 and Y2 into a single matrix Y
    Y = np.column_stack((Y1, Y2))

    print(X[0:5])
    print(Y[0:5])

    return X, Y

# Function to run your lr.py program
def run_lr_program():
    X, Y = create_data(42, 10000, 5000, 2500)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

    # Fit the model in parallel
    with parallel_backend('loky', n_jobs=5):
        reg = linear_model.LinearRegression(n_jobs=5)
        reg.fit(X_train, Y_train)

    # Make predictions using the testing set
    Y_pred = reg.predict(X_test)

    # The coefficients
    print("Coefficients: \n", reg.coef_)

    # The intercepts
    print("Intercepts: \n", reg.intercept_)

    # The mean squared error
    print("Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred))

    # The coefficient of determination: 1 is perfect prediction
    print("Coefficient of determination: %.2f" % r2_score(Y_test, Y_pred))

try:
    # Run the lr.py program in a separate process
    lr_process = Process(target=run_lr_program)
    lr_process.start()

    # Monitor CPU usage while lr.py is running
    monitor_cpu_usage(duration=60, interval=1)  # Monitor for 60 seconds with 1-second intervals

    # Wait for the lr.py program to finish
    lr_process.join()
finally:
    # Ensure proper cleanup
    lr_process.terminate()

我得到的图表: Plot of CPU Core utilization captured during program execution

预期结果:
5 个核心的图表显示比其他核心的活动更高。 (我是一个初学者,我的任务是学习使用指定数量的核心来训练我的模型,然后再在服务器上运行任何内容)

python scikit-learn multiprocessing linear-regression joblib
1个回答
0
投票

正如文档所说,除非

n_jobs
> 1,否则
n_targets
参数不会导致多重处理,我创建了具有两个 y 值的随机数据并尝试运行程序。然而,CPU 核心图表显示,只有 1 个核心的使用率超过 50%。图表正常还是代码有问题?

不,这不会尝试进程级并行性。

这里的文档有点令人困惑:

用于计算的作业数量。这只会提供 在问题足够大的情况下加速,也就是说,如果首先

n_targets > 1
,其次
X
是稀疏的,或者如果
positive
设置为
True
None
表示 1,除非在
joblib.parallel_backend
中 语境。
-1
表示使用所有处理器。 [...]

用伪代码来说,这句话的意思是:

if (n targets > 1) and (issparse(X) or positive == True):
    use parallelism
else:
    ignore n_jobs

(如果你想自己检查的话,可以阅读源代码。)

由于 X 不是稀疏的,并且您没有通过

positive
,因此不会尝试进程级并行性。即使这是尝试进程级并行性,并行性级别也仅限于目标数量。由于您有 2 个目标,因此它最多可以创建 2 个进程来完成这项工作。

无论如何,它都会进行一定程度的并行性,这可能是 BLAS 级并行性的结果。 NumPy 可以并行化某些操作,具体取决于您拥有的 BLAS 实现。

请注意,结合进程级并行性和 BLAS 并行性时的最大并行性可能远高于 5。如果您有 5 个进程,每个进程有 5 个线程,那么您可能有 25 个并发线程在运行。

© www.soinside.com 2019 - 2024. All rights reserved.