如何在大型表上改进基于日期的查询性能？

Question

这与我发布的其他2个问题有关（听起来我应该将此作为一个新问题发布） - 反馈有所帮助，但我认为下次我需要插入数据时会出现同样的问题。事情进展缓慢，这迫使我暂时删除一些旧的数据，因此在我查询的表格中只剩下2个月的价值。

Indexing strategy for different combinations of WHERE clauses incl. text patterns

How to get date_part query to hit index?

这次给出了更多细节 - 希望它有助于查明问题：

PG版本10.7（在heroku上运行
总数据库大小：18.4GB（这包含2个月的数据，并且每月将以大致相同的速率增长）
15GB内存
可用存储空间总量：512GB
最大的表（最慢的查询作用的表）是9.6GB（它是总数据库中最大的一块） - 大约1000万条记录

最大表的架构：

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;

慢查询：

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
    datelocal >= '3-1-2019' AND
    datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

此查询中的日期范围通常为整月（它接受来自基于Web的报表的用户输入） - 正如您所看到的，我尝试为每个月的数据创建索引。这有帮助，但据我所知，除非最近运行了查询（将结果放入缓存），否则它仍然需要一分钟才能运行。

解释分析结果：

Finalize GroupAggregate  (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  ->  Sort  (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              ->  Partial HashAggregate  (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    ->  Parallel Index Scan using reportimpression_mar2019_index on reportimpression  (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms

我不认为有1000万条记录会被处理得太多，特别是考虑到我最近碰到了PG计划，我试图把资源投入其中，所以我认为问题仍然只是我的索引或者我的查询效率不高。

Answer 1

materialized view是你概述的方式。查询过去几个月的只读数据无需刷新即可运行。如果您需要覆盖当月，您可能需要特殊情况。

基础查询仍然可以从索引中受益，您可以采取两个方向：

首先，像你一样的partial indexes现在不会在你的场景中买多少，不值得。如果你收集了更多月的数据并且主要按月查询（并按月添加/删除行），那么table partitioning可能是个主意，那么你也可以自动对索引进行分区。不过，我会考虑Postgres 11甚至即将推出的Postgres 12。）

如果您的行很宽，请创建一个允许index-only scans的索引。喜欢：

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal, views, gender);

有关：

How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?

或者在Postgres 11或更高版本中使用INCLUDE附加列：

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);

否则，如果您的行按datelocal进行物理排序，请考虑使用BRIN index。它非常小，可能与您案例的B树索引一样快。（但是如此小，它将更容易缓存，而不是推出其他数据。）

CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);

您可能对CLUSTER或pg_repack感兴趣，以对表行进行物理排序。 pg_repack可以在没有桌子上的排他锁的情况下做到这一点，甚至没有btree索引（CLUSTER要求）。但它是Postgres标准发行版附带的附加模块。

有关：

Answer 2

你的执行计划似乎正在做正确的事情。

你可以做的事情，以有效性的降序来改进：

使用预聚合数据的物化视图
不要使用托管数据库，使用自己的铁，具有良好的本地存储和大量的RAM。
仅使用一个索引而不是几个分区索引。这主要不是性能建议（除非你有很多索引，否则查询可能不会慢得多），但它会减轻管理负担。

如何在大型表上改进基于日期的查询性能？

问题描述投票：2回答：2

2个回答

最新问题

如何在大型表上改进基于日期的查询性能？

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2