批处理/拆分PostgreSQL数据库

Question

我正在研究一个批量处理数据并填满PostgreSQL（9.6，但我可以升级）数据库的项目。它当前的工作方式是该过程在不同的步骤中进行，每个步骤都将数据添加到它拥有的表中（很少有两个进程在同一个表中写入，如果有的话，则在不同的列中写入）。

数据的方式，数据往往随着每一步变得越来越精细。作为简化示例，我有一个表定义数据源。极少数（数十/数百），但每个数据源都生成批量数据样本（批次和样本是单独的表，用于存储元数据）。每批通常产生约50k样品。然后逐步处理这些数据点中的每一个，并且每个数据样本在下一个表中生成更多数据点。

这很好用，直到我们在样本表中得到1.5mil的行（从我们的观点来看这不是很多数据）。现在，批量过滤开始变慢（我们检索的每个样本大约10毫秒）。它开始成为一个主要的瓶颈，因为获取批量数据的执行时间需要5-10分钟（读取时间为ms）。

我们在这些查询涉及的所有外键上都有b-tree索引。

由于我们的计算以批处理为目标，因此我通常不需要在计算期间跨批处理查询（这是查询时间在此刻受到很大影响的时候）。但是，出于数据分析的原因，需要在批次之间进行临时查询。

因此，一个非常简单的解决方案是为每个批处理生成一个单独的数据库，并在需要时以某种方式查询这些数据库。如果我在每个数据库中只有一个批处理，显然单个批处理的过滤将是即时的，我的问题将得到解决（目前）。然而，那么我最终将拥有数千个数据库，数据分析将是痛苦的。

在PostgreSQL中，有没有办法假装我有一些查询的单独数据库？理想情况下，当我“注册”新批次时，我想为每个批次执行此操作。

在PostgreSQL的世界之外，我应该为我的用例尝试另一个数据库吗？

编辑：DDL /架构

在我们当前的实现中，sample_representation是所有处理结果所依赖的表。批量由（batch.id，representation.id）元组真正定义。我上面尝试和描述的查询速度很慢（每个样本10毫秒，50k样本加起来大约5分钟）

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

我们目前有大约1.5平方英尺，2个samples，460个representationes（其中49个已经加工，其他没有相关的样品），这意味着每批平均有30k样品。有些人有大约5万。

架构如下。有一些与所有表关联的元数据，但在这种情况下我不是要查询它。实际的样本数据分别存储在磁盘上，而不是存储在数据库中，以防万一。

batch

Answer 1

摆弄后，我找到了解决方案。但我仍然不确定为什么原始查询真的花了那么多时间：

    create table batch
(
    id uuid default uuid_generate_v1mc() not null
        constraint batch_pk
            primary key,
    path text not null
        constraint unique_batch_path
            unique,
    id_data_source uuid
)
;
create table sample
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_pk
            primary key,
    sample_pos integer,
    id_batch uuid
        constraint batch_fk
            references batch
                on update cascade on delete set null
)
;
create index sample_sample_pos_index
    on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
    on sample (id_batch, sample_pos)

;
create table representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint representation_pk
            primary key,
    id_data_source uuid
)
;
create table data_source
(
    id uuid default uuid_generate_v1mc() not null
        constraint data_source_pk
            primary key
)
;
alter table batch
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
alter table representation
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
create table sample_representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_representation_pk
            primary key,
    id_sample uuid
        constraint sample_fk
            references sample
                on update cascade on delete set null,
    id_representation uuid
        constraint representation_fk
            references representation
                on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
    on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
    on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
    on sample_representation (id_representation)
;

一切都被编入索引，但表格相对较大，在SELECT sample_representation.id, sample.sample_pos FROM sample_representation JOIN sample ON sample.id = sample_representation.id_sample WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'和sample_representation有150万行。我想会发生的事情是首先将表连接起来然后用sample过滤。但即使由于连接而创建一个大视图，也不应该花那么长时间？！

无论如何，我试图使用CTE而不是加入两个“大规模”表。想法是提前过滤，然后加入：

WHERE

这个查询也需要永远。原因很清楚。 WITH sel_samplerepresentation AS ( SELECT * FROM sample_representation WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a' ), sel_samples AS ( SELECT * FROM sample WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91' ) SELECT sel_samples.sample_pos, sel_samplerepresentation.id FROM sel_samplerepresentation JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation和sel_samples各有5万条记录。连接发生在CTE的非索引列上。

由于CTE没有指数，我将其重新表述为物化视图，我可以为其添加指数：

sel_samplerepresentation

这更像是一个黑客而不是解决方案，但执行这些查询需要1秒！（从8分钟开始）

批处理/拆分PostgreSQL数据库

问题描述投票：1回答：1

1个回答

最新问题

批处理/拆分PostgreSQL数据库

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1