为什么从多核而不是同一个核进行多线程、分块写入大文件到 SSD 的速度会更慢?

问题描述 投票:0回答:1

注意:考虑到评论中的大量建议和发现,该问题已经过一些编辑,现在可能已经过时了。它最初关注的是线程数量,而实际问题似乎是线程的核心亲和力。

(我敢打赌你直观的答案是“同步” - 请耐心听我解释为什么这不一定是答案。)

下面的代码比较了将完全相同的数据写入文件:本质上,N 次相同的 1 MB 块,直到达到 10 GB 目标文件大小。它通过可变数量的线程(1 到 56)执行此操作,负责同时启动所有线程并测量

filebuf::sputn
调用所花费的时间。

(因为这一点在评论中多次出现:目的不是比单线程更快地写入数据。目的是写入在多个独立的数据生成线程中生成的数据,理想情况下不需要另一个单独的写入线程。 )

还可以在下面找到代码的输出,该输出是使用 MSVC、Samsung MZWL63T8HFLT-00AW7 SSD 和 Intel Xeon w9-3495X CPU(禁用超线程,因此限制为 56 个线程)在 Windows 上生成的,我使用 绘制了该输出https://www.desmos.com/calculator。从本质上讲,您会看到,如果将每个线程分配给自己的核心,则写入文件所花费的时间取决于调用

filebuf::sputn
的线程数,我无法解释这一点,因为锁定互斥体被排除在外从写入时间的测量来看,文件写入和总体持续时间的比较表明,锁定总计不超过总时间的 2%。

如果所有线程都分配给同一个核心,则不会出现该问题。不幸的是,虽然这可能是这个玩具示例中的解决方案,但它不适用于每个线程使用昂贵的 CPU 操作生成自己的数据的现实场景。

这是预期的结果吗?除了显而易见的策略(让所有线程将其数据转储到队列中,使用单独的线程清空队列并写入文件)之外,从多个核心写入时避免这种性能的策略是什么?

filebuf::sputn
调用所花费的时间与线程数

蓝色:每个线程使用自己的核心 红色:所有线程使用相同的核心

Time spent in filebuf::sputn calls vs. number of threads, showing linear relationships

代码

#include <Windows.h>

#include <chrono>
#include <format>
#include <fstream>
#include <iosfwd>
#include <iostream>
#include <latch>
#include <mutex>
#include <ranges>
#include <ratio>
#include <thread>
#include <vector>

using namespace std;
using namespace chrono;
using double_milliseconds = duration<long double, milli>;

int main() {
    constexpr auto maxNThreads = 56;
    constexpr auto fileSize = 10'000'000'000;
    constexpr auto chunkSize = 1'000'000;

    mutex mutex;
    vector<jthread> threads;
    for (const auto sameCore : {false, true}) {
        for (const auto nThreads : ranges::iota_view(1, maxNThreads + 1)) {
            filebuf file;
            file.open("out.tmp", ios::out | ios::binary);

            latch commonStart(nThreads + 1);

            streamsize written = 0;
            nanoseconds writing{0};
            for (const auto i : ranges::iota_view(0, nThreads)) {
                const auto threadSize = fileSize / nThreads;
                threads.emplace_back([&commonStart, &mutex, &file, &written, &writing, threadSize, sameCore, i] {
                    const auto mask = static_cast<DWORD_PTR>(1) << (sameCore ? 0 : i);
                    if (!::SetThreadAffinityMask(::GetCurrentThread(), mask)) return;

                    const vector<char> chunk(chunkSize);
                    streamsize writtenThread = 0;
                    commonStart.arrive_and_wait();
                    while (writtenThread < threadSize) {
                        const lock_guard lock(mutex);
                        const auto write_start = steady_clock::now();
                        const auto writtenIteration = file.sputn(chunk.data(), chunk.size());
                        const auto write_stop = steady_clock::now();

                        writtenThread += writtenIteration;
                        written += writtenIteration;
                        writing += write_stop - write_start;
                    }
                });
            }

            commonStart.arrive_and_wait();
            const auto start = steady_clock::now();
            threads.clear();
            file.close();
            const auto stop = steady_clock::now();

            const auto cores = sameCore ? "same core" : "diff. cores";
            const auto written_gb = static_cast<double>(written) / 1'000'000'000;
            const auto duration = duration_cast<milliseconds>(stop - start);
            const auto rate_mb_s = static_cast<int>(static_cast<double>(written) / (double)duration.count() / 1000);
            const auto writing_ms = duration_cast<milliseconds>(writing);
            const auto writing_pct = static_cast<int>(duration_cast<double_milliseconds>(writing) / duration * 100);
            cout << format(
                "{:2d} threads(s), {}: {:.03f} GB / {} = {} MB/s ({} or {}% writing)",
                nThreads,
                cores,
                written_gb,
                duration,
                rate_mb_s,
                writing_ms,
                writing_pct
            ) << endl;
        }
    }
}

输出

 1 threads(s), diff. cores: 10.000 GB / 1825ms = 5479 MB/s (1824ms or 99% writing)
 2 threads(s), diff. cores: 10.000 GB / 1826ms = 5476 MB/s (1819ms or 99% writing)
 3 threads(s), diff. cores: 10.002 GB / 1897ms = 5272 MB/s (1887ms or 99% writing)
 4 threads(s), diff. cores: 10.000 GB / 1838ms = 5440 MB/s (1826ms or 99% writing)
 5 threads(s), diff. cores: 10.000 GB / 1893ms = 5282 MB/s (1880ms or 99% writing)
 6 threads(s), diff. cores: 10.002 GB / 1999ms = 5003 MB/s (1885ms or 94% writing)
 7 threads(s), diff. cores: 10.003 GB / 1919ms = 5212 MB/s (1903ms or 99% writing)
 8 threads(s), diff. cores: 10.000 GB / 2013ms = 4967 MB/s (1927ms or 95% writing)
 9 threads(s), diff. cores: 10.008 GB / 1969ms = 5082 MB/s (1953ms or 99% writing)
10 threads(s), diff. cores: 10.000 GB / 1972ms = 5070 MB/s (1956ms or 99% writing)
11 threads(s), diff. cores: 10.010 GB / 1982ms = 5050 MB/s (1966ms or 99% writing)
12 threads(s), diff. cores: 10.008 GB / 1986ms = 5039 MB/s (1969ms or 99% writing)
13 threads(s), diff. cores: 10.010 GB / 2116ms = 4730 MB/s (2099ms or 99% writing)
14 threads(s), diff. cores: 10.010 GB / 2086ms = 4798 MB/s (2055ms or 98% writing)
15 threads(s), diff. cores: 10.005 GB / 2080ms = 4810 MB/s (1997ms or 96% writing)
16 threads(s), diff. cores: 10.000 GB / 2185ms = 4576 MB/s (2095ms or 95% writing)
17 threads(s), diff. cores: 10.013 GB / 2126ms = 4709 MB/s (2109ms or 99% writing)
18 threads(s), diff. cores: 10.008 GB / 2236ms = 4475 MB/s (2181ms or 97% writing)
19 threads(s), diff. cores: 10.013 GB / 2212ms = 4526 MB/s (2133ms or 96% writing)
20 threads(s), diff. cores: 10.000 GB / 2185ms = 4576 MB/s (2168ms or 99% writing)
21 threads(s), diff. cores: 10.017 GB / 2192ms = 4569 MB/s (2174ms or 99% writing)
22 threads(s), diff. cores: 10.010 GB / 2171ms = 4610 MB/s (2152ms or 99% writing)
23 threads(s), diff. cores: 10.005 GB / 2172ms = 4606 MB/s (2154ms or 99% writing)
24 threads(s), diff. cores: 10.008 GB / 2290ms = 4370 MB/s (2271ms or 99% writing)
25 threads(s), diff. cores: 10.000 GB / 2281ms = 4384 MB/s (2262ms or 99% writing)
26 threads(s), diff. cores: 10.010 GB / 2372ms = 4220 MB/s (2352ms or 99% writing)
27 threads(s), diff. cores: 10.017 GB / 2368ms = 4230 MB/s (2349ms or 99% writing)
28 threads(s), diff. cores: 10.024 GB / 2362ms = 4243 MB/s (2343ms or 99% writing)
29 threads(s), diff. cores: 10.005 GB / 2361ms = 4237 MB/s (2341ms or 99% writing)
30 threads(s), diff. cores: 10.020 GB / 2388ms = 4195 MB/s (2369ms or 99% writing)
31 threads(s), diff. cores: 10.013 GB / 2297ms = 4359 MB/s (2277ms or 99% writing)
32 threads(s), diff. cores: 10.016 GB / 2274ms = 4404 MB/s (2255ms or 99% writing)
33 threads(s), diff. cores: 10.032 GB / 2306ms = 4350 MB/s (2286ms or 99% writing)
34 threads(s), diff. cores: 10.030 GB / 2341ms = 4284 MB/s (2321ms or 99% writing)
35 threads(s), diff. cores: 10.010 GB / 2404ms = 4163 MB/s (2383ms or 99% writing)
36 threads(s), diff. cores: 10.008 GB / 2555ms = 3917 MB/s (2446ms or 95% writing)
37 threads(s), diff. cores: 10.027 GB / 2461ms = 4074 MB/s (2440ms or 99% writing)
38 threads(s), diff. cores: 10.032 GB / 2508ms = 4000 MB/s (2488ms or 99% writing)
39 threads(s), diff. cores: 10.023 GB / 2470ms = 4057 MB/s (2449ms or 99% writing)
40 threads(s), diff. cores: 10.000 GB / 2543ms = 3932 MB/s (2521ms or 99% writing)
41 threads(s), diff. cores: 10.004 GB / 2549ms = 3924 MB/s (2528ms or 99% writing)
42 threads(s), diff. cores: 10.038 GB / 2592ms = 3872 MB/s (2570ms or 99% writing)
43 threads(s), diff. cores: 10.019 GB / 2672ms = 3749 MB/s (2651ms or 99% writing)
44 threads(s), diff. cores: 10.032 GB / 2790ms = 3595 MB/s (2700ms or 96% writing)
45 threads(s), diff. cores: 10.035 GB / 2701ms = 3715 MB/s (2676ms or 99% writing)
46 threads(s), diff. cores: 10.028 GB / 2746ms = 3651 MB/s (2720ms or 99% writing)
47 threads(s), diff. cores: 10.011 GB / 2794ms = 3583 MB/s (2770ms or 99% writing)
48 threads(s), diff. cores: 10.032 GB / 2850ms = 3520 MB/s (2823ms or 99% writing)
49 threads(s), diff. cores: 10.045 GB / 2983ms = 3367 MB/s (2936ms or 98% writing)
50 threads(s), diff. cores: 10.000 GB / 2965ms = 3372 MB/s (2939ms or 99% writing)
51 threads(s), diff. cores: 10.047 GB / 2904ms = 3459 MB/s (2879ms or 99% writing)
52 threads(s), diff. cores: 10.036 GB / 2893ms = 3469 MB/s (2866ms or 99% writing)
53 threads(s), diff. cores: 10.017 GB / 3011ms = 3326 MB/s (2981ms or 99% writing)
54 threads(s), diff. cores: 10.044 GB / 2856ms = 3516 MB/s (2830ms or 99% writing)
55 threads(s), diff. cores: 10.010 GB / 2917ms = 3431 MB/s (2891ms or 99% writing)
56 threads(s), diff. cores: 10.024 GB / 2857ms = 3508 MB/s (2829ms or 99% writing)
 1 threads(s), same core: 10.000 GB / 1805ms = 5540 MB/s (1804ms or 99% writing)
 2 threads(s), same core: 10.000 GB / 1796ms = 5567 MB/s (1817ms or 101% writing)
 3 threads(s), same core: 10.002 GB / 1814ms = 5513 MB/s (1813ms or 99% writing)
 4 threads(s), same core: 10.000 GB / 1800ms = 5555 MB/s (1798ms or 99% writing)
 5 threads(s), same core: 10.000 GB / 1798ms = 5561 MB/s (1822ms or 101% writing)
 6 threads(s), same core: 10.002 GB / 1777ms = 5628 MB/s (1814ms or 102% writing)
 7 threads(s), same core: 10.003 GB / 1789ms = 5591 MB/s (1786ms or 99% writing)
 8 threads(s), same core: 10.000 GB / 1789ms = 5589 MB/s (1830ms or 102% writing)
 9 threads(s), same core: 10.008 GB / 1825ms = 5483 MB/s (1809ms or 99% writing)
10 threads(s), same core: 10.000 GB / 1810ms = 5524 MB/s (1804ms or 99% writing)
11 threads(s), same core: 10.010 GB / 1797ms = 5570 MB/s (1795ms or 99% writing)
12 threads(s), same core: 10.008 GB / 1848ms = 5415 MB/s (1845ms or 99% writing)
13 threads(s), same core: 10.010 GB / 1779ms = 5626 MB/s (1806ms or 101% writing)
14 threads(s), same core: 10.010 GB / 1786ms = 5604 MB/s (1816ms or 101% writing)
15 threads(s), same core: 10.005 GB / 1833ms = 5458 MB/s (1830ms or 99% writing)
16 threads(s), same core: 10.000 GB / 1829ms = 5467 MB/s (1826ms or 99% writing)
17 threads(s), same core: 10.013 GB / 1785ms = 5609 MB/s (1815ms or 101% writing)
18 threads(s), same core: 10.008 GB / 1789ms = 5594 MB/s (1825ms or 102% writing)
19 threads(s), same core: 10.013 GB / 1781ms = 5622 MB/s (1814ms or 101% writing)
20 threads(s), same core: 10.000 GB / 1768ms = 5656 MB/s (1803ms or 101% writing)
21 threads(s), same core: 10.017 GB / 1844ms = 5432 MB/s (1834ms or 99% writing)
22 threads(s), same core: 10.010 GB / 1822ms = 5493 MB/s (1818ms or 99% writing)
23 threads(s), same core: 10.005 GB / 1801ms = 5555 MB/s (1797ms or 99% writing)
24 threads(s), same core: 10.008 GB / 1796ms = 5572 MB/s (1832ms or 102% writing)
25 threads(s), same core: 10.000 GB / 1859ms = 5379 MB/s (1806ms or 97% writing)
26 threads(s), same core: 10.010 GB / 1791ms = 5589 MB/s (1827ms or 102% writing)
27 threads(s), same core: 10.017 GB / 1775ms = 5643 MB/s (1813ms or 102% writing)
28 threads(s), same core: 10.024 GB / 1798ms = 5575 MB/s (1830ms or 101% writing)
29 threads(s), same core: 10.005 GB / 1890ms = 5293 MB/s (1850ms or 97% writing)
30 threads(s), same core: 10.020 GB / 1755ms = 5709 MB/s (1785ms or 101% writing)
31 threads(s), same core: 10.013 GB / 1806ms = 5544 MB/s (1844ms or 102% writing)
32 threads(s), same core: 10.016 GB / 1799ms = 5567 MB/s (1826ms or 101% writing)
33 threads(s), same core: 10.032 GB / 1762ms = 5693 MB/s (1815ms or 103% writing)
34 threads(s), same core: 10.030 GB / 1776ms = 5647 MB/s (1813ms or 102% writing)
35 threads(s), same core: 10.010 GB / 1773ms = 5645 MB/s (1812ms or 102% writing)
36 threads(s), same core: 10.008 GB / 1826ms = 5480 MB/s (1863ms or 102% writing)
37 threads(s), same core: 10.027 GB / 1815ms = 5524 MB/s (1846ms or 101% writing)
38 threads(s), same core: 10.032 GB / 1823ms = 5503 MB/s (1830ms or 100% writing)
39 threads(s), same core: 10.023 GB / 1776ms = 5643 MB/s (1811ms or 102% writing)
40 threads(s), same core: 10.000 GB / 1769ms = 5652 MB/s (1807ms or 102% writing)
41 threads(s), same core: 10.004 GB / 1803ms = 5548 MB/s (1841ms or 102% writing)
42 threads(s), same core: 10.038 GB / 1852ms = 5420 MB/s (1841ms or 99% writing)
43 threads(s), same core: 10.019 GB / 1827ms = 5483 MB/s (1844ms or 100% writing)
44 threads(s), same core: 10.032 GB / 1787ms = 5613 MB/s (1817ms or 101% writing)
45 threads(s), same core: 10.035 GB / 1821ms = 5510 MB/s (1852ms or 101% writing)
46 threads(s), same core: 10.028 GB / 1814ms = 5528 MB/s (1842ms or 101% writing)
47 threads(s), same core: 10.011 GB / 1788ms = 5598 MB/s (1816ms or 101% writing)
48 threads(s), same core: 10.032 GB / 1794ms = 5591 MB/s (1820ms or 101% writing)
49 threads(s), same core: 10.045 GB / 1780ms = 5643 MB/s (1809ms or 101% writing)
50 threads(s), same core: 10.000 GB / 1776ms = 5630 MB/s (1841ms or 103% writing)
51 threads(s), same core: 10.047 GB / 1836ms = 5472 MB/s (1824ms or 99% writing)
52 threads(s), same core: 10.036 GB / 1890ms = 5310 MB/s (1835ms or 97% writing)
53 threads(s), same core: 10.017 GB / 1810ms = 5534 MB/s (1836ms or 101% writing)
54 threads(s), same core: 10.044 GB / 1783ms = 5633 MB/s (1815ms or 101% writing)
55 threads(s), same core: 10.010 GB / 1771ms = 5652 MB/s (1831ms or 103% writing)
56 threads(s), same core: 10.024 GB / 1793ms = 5590 MB/s (1820ms or 101% writing)
c++ multithreading performance file-writing solid-state-drive
1个回答
-2
投票

我看到的真正问题之一是您在每个线程之间使用共享互斥体。本质上是创建一个队列供线程独立写入。其次,即使没有互斥体,多个线程写入同一文件也会产生瓶颈效应。所以当然你的时间会基于N个线程呈指数增长,因为事实上,它们并不是同时写入。

对单个数据源的多线程写入的真正测试和解决方案将如下所示:

  1. 根据N个线程分割数据源。
  2. 让每个线程将其数据写入临时文件。
  3. 最后,将文件按顺序连接在一起,以便生成线程。
© www.soinside.com 2019 - 2024. All rights reserved.