为什么以多线程、分块方式将大文件写入 SSD 比从单线程写入要慢?

问题描述 投票:0回答:1

(我敢打赌你直观的答案是“同步” - 请耐心听我解释为什么这不一定是答案。)

下面的代码比较了将完全相同的数据写入文件:本质上,N 次相同的 1 MB 块,直到达到 10 GB 目标文件大小。它通过可变数量的线程(1 到 56)执行此操作,负责同时启动所有线程并测量

filebuf::sputn
调用所花费的时间。

还可以在下面找到代码的输出,该输出是使用 Samsung MZWL63T8HFLT-00AW7 SSD 和 Intel Xeon w9-3495X CPU(因此限制为 56 个线程)生成的,我使用 https://www.desmos.com/ 绘制了该输出计算器。从本质上讲,您会看到写入文件所花费的时间取决于调用

filebuf::sputn
的线程数,我无法解释这一点,因为锁定互斥锁被排除在写入时间测量之外,并且比较文件写入和总持续时间的百分比表明锁定总计不超过总时间的 2%。

这是预期的结果吗?除了显而易见的策略(让所有线程将其数据转储到队列中,用单独的线程清空队列并写入文件)之外,还有什么策略可以避免在存在多个线程的情况下出现此类性能下降?

filebuf::sputn
调用所花费的时间与线程数

Time spent in filebuf::sputn calls vs. number of threads, showing a linear relationship

代码

#include <chrono>
#include <format>
#include <fstream>
#include <iosfwd>
#include <iostream>
#include <latch>
#include <mutex>
#include <ratio>
#include <thread>
#include <vector>

using namespace std;
using namespace chrono;
using double_milliseconds = duration<long double, milli>;

int main() {
    constexpr auto maxNThreads = 56;
    constexpr auto fileSize = 10'000'000'000;
    constexpr auto chunkSize = 1'000'000;

    const vector<char> chunk(chunkSize);
    mutex mutex;
    vector<jthread> threads;
    for (int nThreads = 1; nThreads <= maxNThreads; nThreads++) {
        filebuf file;
        file.open("out.tmp", ios::out | ios::binary);

        latch commonStart(nThreads + 1);

        streamsize written = 0;
        nanoseconds writing{0};
        for (int i = 0; i < nThreads; i++) {
            threads.emplace_back([&] {
                streamsize writtenThread = 0;
                commonStart.arrive_and_wait();
                while (writtenThread < fileSize / nThreads) {
                    const lock_guard lock(mutex);
                    const auto write_start = high_resolution_clock::now();
                    const auto writtenIteration = file.sputn(chunk.data(), chunk.size());
                    const auto write_stop = high_resolution_clock::now();

                    writtenThread += writtenIteration;
                    written += writtenIteration;
                    writing += write_stop - write_start;
                }
            });
        }

        commonStart.arrive_and_wait();
        const auto start = high_resolution_clock::now();
        threads.clear();
        const auto stop = high_resolution_clock::now();

        const auto written_gb = static_cast<double>(written) / 1'000'000'000;
        const auto duration = duration_cast<milliseconds>(stop - start);
        const auto rate_mb_s = static_cast<int>(static_cast<double>(written) / (double)duration.count() / 1000);
        const auto writing_ms = duration_cast<milliseconds>(writing);
        const auto writing_pct = static_cast<int>(duration_cast<double_milliseconds>(writing) / duration * 100);
        cout << format(
            "{:2d} threads(s): {:.03f} GB / {} = {} MB/s ({} or {}% writing)",
            nThreads,
            written_gb,
            duration,
            rate_mb_s,
            writing_ms,
            writing_pct
        ) << endl;
    }
}

输出

 1 threads(s): 10.000 GB / 1796ms = 5567 MB/s (1794ms or 99% writing)
 2 threads(s): 10.000 GB / 1855ms = 5390 MB/s (1848ms or 99% writing)
 3 threads(s): 10.002 GB / 1815ms = 5510 MB/s (1804ms or 99% writing)
 4 threads(s): 10.000 GB / 1858ms = 5382 MB/s (1844ms or 99% writing)
 5 threads(s): 10.000 GB / 1886ms = 5302 MB/s (1873ms or 99% writing)
 6 threads(s): 10.002 GB / 1875ms = 5334 MB/s (1861ms or 99% writing)
 7 threads(s): 10.003 GB / 1896ms = 5275 MB/s (1882ms or 99% writing)
 8 threads(s): 10.000 GB / 1999ms = 5002 MB/s (1983ms or 99% writing)
 9 threads(s): 10.008 GB / 1962ms = 5100 MB/s (1947ms or 99% writing)
10 threads(s): 10.000 GB / 1940ms = 5154 MB/s (1924ms or 99% writing)
11 threads(s): 10.010 GB / 2024ms = 4945 MB/s (2006ms or 99% writing)
12 threads(s): 10.008 GB / 1925ms = 5198 MB/s (1908ms or 99% writing)
13 threads(s): 10.010 GB / 2057ms = 4866 MB/s (2041ms or 99% writing)
14 threads(s): 10.010 GB / 2046ms = 4892 MB/s (2030ms or 99% writing)
15 threads(s): 10.005 GB / 2097ms = 4771 MB/s (2079ms or 99% writing)
16 threads(s): 10.000 GB / 2036ms = 4911 MB/s (2019ms or 99% writing)
17 threads(s): 10.013 GB / 2110ms = 4745 MB/s (2092ms or 99% writing)
18 threads(s): 10.008 GB / 2112ms = 4738 MB/s (2095ms or 99% writing)
19 threads(s): 10.013 GB / 2118ms = 4727 MB/s (2100ms or 99% writing)
20 threads(s): 10.000 GB / 2058ms = 4859 MB/s (2040ms or 99% writing)
21 threads(s): 10.017 GB / 2197ms = 4559 MB/s (2180ms or 99% writing)
22 threads(s): 10.010 GB / 2186ms = 4579 MB/s (2168ms or 99% writing)
23 threads(s): 10.005 GB / 2198ms = 4551 MB/s (2181ms or 99% writing)
24 threads(s): 10.008 GB / 2219ms = 4510 MB/s (2201ms or 99% writing)
25 threads(s): 10.000 GB / 2313ms = 4323 MB/s (2294ms or 99% writing)
26 threads(s): 10.010 GB / 2270ms = 4409 MB/s (2251ms or 99% writing)
27 threads(s): 10.017 GB / 2244ms = 4463 MB/s (2225ms or 99% writing)
28 threads(s): 10.024 GB / 2374ms = 4222 MB/s (2354ms or 99% writing)
29 threads(s): 10.005 GB / 2249ms = 4448 MB/s (2230ms or 99% writing)
30 threads(s): 10.020 GB / 2224ms = 4505 MB/s (2205ms or 99% writing)
31 threads(s): 10.013 GB / 2228ms = 4494 MB/s (2209ms or 99% writing)
32 threads(s): 10.016 GB / 2268ms = 4416 MB/s (2248ms or 99% writing)
33 threads(s): 10.032 GB / 2305ms = 4352 MB/s (2286ms or 99% writing)
34 threads(s): 10.030 GB / 2274ms = 4410 MB/s (2254ms or 99% writing)
35 threads(s): 10.010 GB / 2347ms = 4265 MB/s (2325ms or 99% writing)
36 threads(s): 10.008 GB / 2377ms = 4210 MB/s (2357ms or 99% writing)
37 threads(s): 10.027 GB / 2416ms = 4150 MB/s (2395ms or 99% writing)
38 threads(s): 10.032 GB / 2483ms = 4040 MB/s (2461ms or 99% writing)
39 threads(s): 10.023 GB / 2512ms = 3990 MB/s (2492ms or 99% writing)
40 threads(s): 10.000 GB / 2561ms = 3904 MB/s (2539ms or 99% writing)
41 threads(s): 10.004 GB / 2658ms = 3763 MB/s (2635ms or 99% writing)
42 threads(s): 10.038 GB / 2606ms = 3851 MB/s (2583ms or 99% writing)
43 threads(s): 10.019 GB / 2626ms = 3815 MB/s (2603ms or 99% writing)
44 threads(s): 10.032 GB / 2727ms = 3678 MB/s (2704ms or 99% writing)
45 threads(s): 10.035 GB / 2722ms = 3686 MB/s (2697ms or 99% writing)
46 threads(s): 10.028 GB / 2721ms = 3685 MB/s (2698ms or 99% writing)
47 threads(s): 10.011 GB / 2939ms = 3406 MB/s (2914ms or 99% writing)
48 threads(s): 10.032 GB / 2789ms = 3596 MB/s (2763ms or 99% writing)
49 threads(s): 10.045 GB / 2843ms = 3533 MB/s (2815ms or 99% writing)
50 threads(s): 10.000 GB / 2858ms = 3498 MB/s (2832ms or 99% writing)
51 threads(s): 10.047 GB / 2847ms = 3528 MB/s (2819ms or 99% writing)
52 threads(s): 10.036 GB / 2919ms = 3438 MB/s (2890ms or 99% writing)
53 threads(s): 10.017 GB / 2829ms = 3540 MB/s (2802ms or 99% writing)
54 threads(s): 10.044 GB / 2846ms = 3529 MB/s (2818ms or 99% writing)
55 threads(s): 10.010 GB / 2743ms = 3649 MB/s (2714ms or 98% writing)
56 threads(s): 10.024 GB / 2847ms = 3520 MB/s (2817ms or 98% writing)
c++ multithreading performance file-writing solid-state-drive
1个回答
0
投票

我看到的真正问题之一是您在每个线程之间使用共享互斥体。本质上是创建一个队列供线程独立写入。其次,即使没有互斥体,多个文件写入同一个文件也会产生瓶颈效应。所以当然你的时间会基于N个线程呈指数增长,因为事实上,它们并不是同时写入。

对单个数据源的多线程写入的真正测试和解决方案将如下所示:

  1. 根据N个线程分割数据源。
  2. 让每个线程将其数据写入临时文件。
  3. 最后,将文件按顺序连接在一起,以便生成线程。
© www.soinside.com 2019 - 2024. All rights reserved.