我目前正在开发一个支持多个数据库的网络服务。我正在尝试优化表并修复丢失的索引。以下是MySQL查询:
SELECT 'UTC' AS timezone, pak.id AS package_id, rel.unique_id AS relay, sns.unique_id AS sensor, pak.rtime AS time,
sns.units AS sensor_units, typ.name AS sensor_type, dat.data AS sensor_data,
loc.altitude AS altitude, Y(loc.location) AS latitude, X(loc.location) as longitude,
loc.speed as speed, loc.climb as climb, loc.track as track,
loc.longitude_error as longitude_error, loc.latitude_error as latitude_error, loc.altitude_error as altitude_error,
loc.speed_error as speed_error, loc.climb_error as climb_error, loc.track_error as track_error
FROM sensor_data dat
LEFT OUTER JOIN package_location loc on dat.package_id = loc.package_id
LEFT OUTER JOIN data_package pak ON dat.package_id = pak.id
LEFT OUTER JOIN relays rel ON pak.relay_id = rel.id
LEFT OUTER JOIN sensors sns ON dat.sensor_id = sns.id
LEFT OUTER JOIN sensor_types typ ON sns.sensor_type = typ.id
WHERE typ.name='Temperature'
AND rel.unique_id='OneWireTester'
AND pak.rtime > '2015-01-01'
AND pak.rtime < '2016-01-01'
以及解释...
+----+-------------+-------+--------+------------------------------------------+----------------------+---------+------------------------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------+----------------------+---------+------------------------+------+----------------------------------------------------+
| 1 | SIMPLE | rel | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where |
| 1 | SIMPLE | pak | ref | PRIMARY,fk_package_relay_id | fk_package_relay_id | 9 | BigSense.rel.id | 1 | Using index condition; Using where |
| 1 | SIMPLE | dat | ref | fk_sensor_package_id,fk_sensor_sensor_id | fk_sensor_package_id | 9 | BigSense.pak.id | 1 | NULL |
| 1 | SIMPLE | sns | eq_ref | PRIMARY,fk_sensors_type_id | PRIMARY | 8 | BigSense.dat.sensor_id | 1 | NULL |
| 1 | SIMPLE | loc | eq_ref | PRIMARY | PRIMARY | 8 | BigSense.pak.id | 1 | NULL |
| 1 | SIMPLE | typ | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+--------+------------------------------------------+----------------------+---------+------------------------+------+----------------------------------------------------+
...看起来很简单。我需要在
relays
表和 sensor_types
上添加索引来优化查询。
PostgreSQL 版本的表几乎相同。但是,当我使用以下查询时:
SELECT 'UTC' AS timezone, pak.id AS package_id, rel.unique_id AS relay, sns.unique_id AS sensor, pak.rtime AS time,
sns.units AS sensor_units, typ.name AS sensor_type, dat.data AS sensor_data,
loc.altitude AS altitude, ST_Y(loc.location::geometry) AS latitude, ST_X(loc.location::geometry) as longitude,
loc.speed as speed, loc.climb as climb, loc.track as track,
loc.longitude_error as longitude_error, loc.latitude_error as latitude_error, loc.altitude_error as altitude_error,
loc.speed_error as speed_error, loc.climb_error as climb_error, loc.track_error as track_error
FROM sensor_data dat
LEFT OUTER JOIN package_location loc on dat.package_id = loc.package_id
LEFT OUTER JOIN data_package pak ON dat.package_id = pak.id
LEFT OUTER JOIN relays rel ON pak.relay_id = rel.id
LEFT OUTER JOIN sensors sns ON dat.sensor_id = sns.id
LEFT OUTER JOIN sensor_types typ ON sns.sensor_type = typ.id
WHERE typ.name='Temperature'
AND rel.unique_id='OneWireTester'
AND pak.rtime > '2015-01-01'
AND pak.rtime < '2016-01-01';
如果我进行解释分析,我会得到以下结果:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=36.23..131.80 rows=1 width=477) (actual time=0.074..3.933 rows=76 loops=1)
-> Nested Loop (cost=36.09..131.60 rows=1 width=349) (actual time=0.068..3.782 rows=76 loops=1)
-> Nested Loop (cost=35.94..130.58 rows=4 width=267) (actual time=0.062..2.472 rows=620 loops=1)
-> Hash Join (cost=35.67..128.73 rows=4 width=247) (actual time=0.053..0.611 rows=620 loops=1)
Hash Cond: (dat.sensor_id = sns.id)
-> Seq Scan on sensor_data dat (cost=0.00..89.46 rows=946 width=21) (actual time=0.007..0.178 rows=1006 loops=1)
-> Hash (cost=35.64..35.64 rows=2 width=238) (actual time=0.037..0.037 rows=11 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Hash Join (cost=20.68..35.64 rows=2 width=238) (actual time=0.019..0.035 rows=11 loops=1)
Hash Cond: (sns.sensor_type = typ.id)
-> Seq Scan on sensors sns (cost=0.00..13.60 rows=360 width=188) (actual time=0.002..0.005 rows=31 loops=1)
-> Hash (cost=20.62..20.62 rows=4 width=66) (actual time=0.010..0.010 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on sensor_types typ (cost=0.00..20.62 rows=4 width=66) (actual time=0.006..0.008 rows=1 loops=1)
Filter: ((name)::text = 'Temperature'::text)
Rows Removed by Filter: 4
-> Index Scan using data_package_pkey on data_package pak (cost=0.28..0.45 rows=1 width=20) (actual time=0.002..0.002 rows=1 loops=620)
Index Cond: (id = dat.package_id)
Filter: ((rtime > '2015-01-01 00:00:00'::timestamp without time zone) AND (rtime < '2016-01-01 00:00:00'::timestamp without time zone))
-> Index Scan using relays_pkey on relays rel (cost=0.14..0.24 rows=1 width=94) (actual time=0.002..0.002 rows=0 loops=620)
Index Cond: (id = pak.relay_id)
Filter: ((unique_id)::text = 'OneWireTester'::text)
Rows Removed by Filter: 1
-> Index Scan using package_location_pkey on package_location loc (cost=0.14..0.18 rows=1 width=140) (actual time=0.001..0.001 rows=0 loops=76)
Index Cond: (dat.package_id = package_id)
Planning time: 0.959 ms
Execution time: 4.030 ms
(27 rows)
表模式具有相同的外键和一般结构,因此我希望看到所需的相同索引。然而,我一直在浏览有关 pgsql 检查语句的几个指南,从我收集的信息来看,Seq Scan 语句是缺失索引的指示符,这意味着我缺少
sensors
、sensor_data
和 sensor_type
上的索引。
我是否正确解释了这些检查语句的结果?为了优化这两个数据库,我应该寻找什么?
在 PostgreSQL(可能还有 MySQL)中,索引的使用不仅仅因为它们被定义,而是在可以加快查询速度的情况下使用它们。
在
EXPLAIN ANALYZE
输出中,您可以在括号之间看到关于 cost
的部分,后跟关于 actual time
的类似部分。查询规划器查看 cost
,它是由配置文件中列出的许多参数定义的。这些成本包括 IO 和 CPU 时间等,前者的价值通常比后者高得多(通常相差 100 倍)。这意味着查询规划器会尝试最大程度地减少需要从磁盘读取的数据量,这些数据量按预先确定的大小(通常为 4kB)的页读取,而不是按单个行读取(这是因为这样可以加快访问速度)由于硬盘驱动器的物理特性)。表本身和索引都存储在磁盘上。如果表格很小,则可以容纳几页,甚至可能只有一页。由于 CPU 时间比 IO 时间便宜,因此顺序扫描几个页面比使用索引读取磁盘页面的额外 IO 要快得多。
从
EXPLAIN ANALYZE
输出中可以看出,大多数表格都很小,只能容纳几页。如果您确实想测试索引的功能,您应该向表中加载一百万行左右的随机数据,然后进行测试。
有一个名为Explain Depesz(https://explain.depesz.com)的好工具,您可以删除解释分析输出,它会直观地向您呈现查询发生的情况。例如,我将在那里运行您的输出:
因此,在您的示例中,正如我们所看到的,查询的主要问题是您在第 11 行和第 12 行中执行的索引扫描。在我看来,您应该创建更好的索引,因为即使使用它们,它们也运行了 620 次每个。
也许您的索引应该位于
rtime
表的 data_package_pkey
列中。这还不是唯一的事情。优化查询涉及多个步骤。我遇到过很多情况,答案是更好地规划查询。
例如,您可以首先获取您正在查找的年份的所有查询。然后,使用较少的列,您可以查找
unique_id
。
因此,你的问题是9年前的。您可能已经解决了所有问题,并且您现在可能是 PostgreSQL 解释大师。不过,我希望我的回答可以给其他人带来光明。