我做了相对简单的连接查询,用于在大表(2+百万行)中的3个值上选择一小时数据的数据,但是根据选择的时间,获取时间急剧增加
第一个查询具有从表的开头选择数据的选项,第二个查询有点接近开始,最后一个查询接近表的末尾。
创建声明
CREATE TABLE `session_id_table` (
`session_id` bigint NOT NULL,
`session_start_time` datetime DEFAULT NULL,
`session_end_time` datetime DEFAULT NULL,
PRIMARY KEY (`session_id`),
UNIQUE KEY `session_id_UNIQUE` (`session_id`),
KEY `idx_session_id_table_session_id` (`session_id`),
KEY `time_1` (`session_start_time`,`session_end_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin;
insert into session_id_table(session_id, session_start_time, session_end_time) values
(1, '2024-09-05 10:00:23', '2024-09-05 10:15:54'),
(2, '2024-09-05 10:00:29', '2024-09-05 10:22:54'),
(3, '2024-09-05 10:01:23', '2024-09-05 10:26:54'),
(4, '2024-09-05 10:02:20', '2024-09-05 10:21:54'),
(5, '2024-09-05 10:02:23', '2024-09-05 10:20:54'),
(6, '2024-09-05 10:21:00', '2024-09-05 10:35:54')
此类数据的结果应该是:
'2024-09-05 10:01:00', 1
'2024-09-05 10:01:00', 2
'2024-09-05 10:02:00', 1
'2024-09-05 10:02:00', 2
'2024-09-05 10:02:00', 3
'2024-09-05 10:21:00', 2
'2024-09-05 10:21:00', 3
'2024-09-05 10:21:00', 4
'2024-09-05 10:21:00', 6
with
dates as
(select distinct convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime) as date_time
from session_id_table
order by convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime))
select date_time, session_id from dates as a
inner join session_id_table as b
on (date_time>=session_start_time) and (date_time<=session_end_time)
#where date_time>='2023-05-25 00:00:00' and date_time<='2023-05-25 01:00:00'
#where date_time>='2023-10-25 00:00:00' and date_time<='2023-10-25 01:00:00'
#where date_time>='2023-12-10 00:00:00' and date_time<='2023-12-10 01:00:00'
#where date_time>='2024-09-10 00:00:00' and date_time<='2024-09-10 01:00:00'
where date_time>='2024-09-19 00:00:00' and date_time<='2024-09-19 01:00:00'
为 session_start_time 和 session_end_time 创建单个索引并得到以下结果:
这是mysql配置问题吗?指数问题?或查询问题?
解释一下
idx_time_1 是 (session_start_time, session_end_time) 上的索引
idx_time 是 (session_id, session_start_time, session_end_time) 上的索引
idx_convert_time 是索引 ((convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime)))
idx_start 是 (session_start_time) 上的索引
idx_end 是 (session_end_time) 上的索引
这里解释最快查询的 json
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "208534400.08"
},
"nested_loop": [
{
"table": {
"table_name": "a",
"access_type": "ALL",
"rows_examined_per_scan": 973,
"rows_produced_per_join": 973,
"filtered": "100.00",
"cost_info": {
"read_cost": "14.66",
"eval_cost": "97.30",
"prefix_cost": "111.96",
"data_read_per_join": "15K"
},
"used_columns": [
"date_time"
],
"materialized_from_subquery": {
"using_temporary_table": true,
"dependent": false,
"cacheable": true,
"query_block": {
"select_id": 2,
"cost_info": {
"query_cost": "1411.11"
},
"ordering_operation": {
"using_filesort": false,
"duplicates_removal": {
"using_filesort": true,
"cost_info": {
"sort_cost": "973.00"
},
"table": {
"table_name": "session_id_table",
"access_type": "range",
"possible_keys": [
"idx_time",
"idx_time_1",
"idx_start",
"idx_convert_time"
],
"key": "idx_convert_time",
"used_key_parts": [
"cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime)"
],
"key_length": "6",
"rows_examined_per_scan": 973,
"rows_produced_per_join": 973,
"filtered": "100.00",
"cost_info": {
"read_cost": "340.81",
"eval_cost": "97.30",
"prefix_cost": "438.11",
"data_read_per_join": "2M"
},
"used_columns": [
"session_id",
"session_start_time",
"cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime)"
],
"attached_condition": "((cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime) >= '2023-05-25 00:00:00') and (cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime) <= '2023-05-25 01:00:00'))"
}
}
}
}
}
}
},
{
"table": {
"table_name": "b",
"access_type": "ALL",
"possible_keys": [
"idx_time_1",
"idx_end",
"idx_start"
],
"rows_examined_per_scan": 2143170,
"rows_produced_per_join": 231654145,
"filtered": "11.11",
"range_checked_for_each_record": "index map: 0x70",
"cost_info": {
"read_cost": "3847.12",
"eval_cost": "23165414.57",
"prefix_cost": "208534400.08",
"data_read_per_join": "503G"
},
"used_columns": [
"session_id",
"session_start_time",
"session_end_time"
]
}
}
]
}
}
这里是最慢的解释
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "200390348.15"
},
"nested_loop": [
{
"table": {
"table_name": "a",
"access_type": "ALL",
"rows_examined_per_scan": 935,
"rows_produced_per_join": 935,
"filtered": "100.00",
"cost_info": {
"read_cost": "14.19",
"eval_cost": "93.50",
"prefix_cost": "107.69",
"data_read_per_join": "14K"
},
"used_columns": [
"date_time"
],
"materialized_from_subquery": {
"using_temporary_table": true,
"dependent": false,
"cacheable": true,
"query_block": {
"select_id": 2,
"cost_info": {
"query_cost": "1356.01"
},
"ordering_operation": {
"using_filesort": false,
"duplicates_removal": {
"using_filesort": true,
"cost_info": {
"sort_cost": "935.00"
},
"table": {
"table_name": "session_id_table",
"access_type": "range",
"possible_keys": [
"idx_time",
"idx_time_1",
"idx_start",
"idx_convert_time"
],
"key": "idx_convert_time",
"used_key_parts": [
"cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime)"
],
"key_length": "6",
"rows_examined_per_scan": 935,
"rows_produced_per_join": 935,
"filtered": "100.00",
"cost_info": {
"read_cost": "327.51",
"eval_cost": "93.50",
"prefix_cost": "421.01",
"data_read_per_join": "2M"
},
"used_columns": [
"session_id",
"session_start_time",
"cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime)"
],
"attached_condition": "((cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime) >= '2024-09-10 00:00:00') and (cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime) <= '2024-09-10 01:00:00'))"
}
}
}
}
}
}
},
{
"table": {
"table_name": "b",
"access_type": "ALL",
"possible_keys": [
"idx_time_1",
"idx_end",
"idx_start"
],
"rows_examined_per_scan": 2143170,
"rows_produced_per_join": 222607015,
"filtered": "11.11",
"range_checked_for_each_record": "index map: 0x70",
"cost_info": {
"read_cost": "3845.46",
"eval_cost": "22260701.56",
"prefix_cost": "200390348.15",
"data_read_per_join": "484G"
},
"used_columns": [
"session_id",
"session_start_time",
"session_end_time"
]
}
}
]
}
}
这里还解释了最快的分析
-> Nested loop inner join (cost=6.69e+6 rows=7.43e+6) (actual time=15.4..346 rows=20827 loops=1)
-> Table scan on a (cost=539..541 rows=31.2) (actual time=8.61..8.64 rows=61 loops=1)
-> Materialize CTE dates (cost=539..539 rows=31.2) (actual time=8.61..8.61 rows=61 loops=1)
-> Group (no aggregates) (cost=535 rows=31.2) (actual time=7.7..8.5 rows=61 loops=1)
-> Sort: date_time (cost=438 rows=973) (actual time=7.69..7.76 rows=973 loops=1)
-> Filter: ((cast(date_format(session_start_time,_utf8mb4'%Y-%c-%e %H:%i') as datetime) >= '2023-05-25 00:00:00') and (cast(date_format(session_start_time,_utf8mb4'%Y-%c-%e %H:%i') as datetime) <= '2023-05-25 01:00:00')) (cost=438 rows=973) (actual time=0.191..6.67 rows=973 loops=1)
-> Index range scan on session_id_table using idx_convert_time over ('2023-05-25 00:00:00' <= cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime) <= '2023-05-25 01:00:00') (cost=438 rows=973) (actual time=0.189..6.44 rows=973 loops=1)
-> Filter: ((a.date_time >= b.session_start_time) and (a.date_time <= b.session_end_time)) (cost=28.4 rows=238082) (actual time=5.14..5.52 rows=341 loops=61)
-> Covering index range scan on b (re-planned for each iteration) (cost=28.4 rows=2.14e+6) (actual time=0.0733..4.28 rows=8709 loops=61)
以及最慢的查询
-> Nested loop inner join (cost=6.56e+6 rows=7.28e+6) (actual time=1375..76261 rows=21064 loops=1)
-> Table scan on a (cost=518..520 rows=30.6) (actual time=25.6..25.7 rows=61 loops=1)
-> Materialize CTE dates (cost=518..518 rows=30.6) (actual time=25.6..25.6 rows=61 loops=1)
-> Group (no aggregates) (cost=515 rows=30.6) (actual time=24.7..25.4 rows=61 loops=1)
-> Sort: date_time (cost=421 rows=935) (actual time=24.7..24.8 rows=935 loops=1)
-> Filter: ((cast(date_format(session_start_time,_utf8mb4'%Y-%c-%e %H:%i') as datetime) >= '2024-09-10 00:00:00') and (cast(date_format(session_start_time,_utf8mb4'%Y-%c-%e %H:%i') as datetime) <= '2024-09-10 01:00:00')) (cost=421 rows=935) (actual time=1.2..23 rows=935 loops=1)
-> Index range scan on session_id_table using idx_convert_time over ('2024-09-10 00:00:00' <= cast(date_format(`session_start_time`,_utf8mb4'%Y-%c-%e %H:%i') as datetime) <= '2024-09-10 01:00:00') (cost=421 rows=935) (actual time=0.477..22 rows=935 loops=1)
-> Filter: ((a.date_time >= b.session_start_time) and (a.date_time <= b.session_end_time)) (cost=29.6 rows=238082) (actual time=1249..1250 rows=345 loops=61)
-> Covering index range scan on b (re-planned for each iteration) (cost=29.6 rows=2.14e+6) (actual time=0.216..968 rows=1.89e+6 loops=61)
好吧,我得到了答案,尽管它让我感到困惑。
TLDR;当对多个属性进行连接时,索引会减慢选择速度。
决定尝试一下时间戳是否是更快的解决方案,因此我将所有日期时间和 ts 移至一个新表中,并在 session_id_table 中添加了 start_timestamp 和 end_timestamp 。 OP 中的 CTE 和新表中的日期时间结果相同。然后我尝试使用时间戳进行相同的查询,与我使用 CTE/datetime 获得的持续时间/获取时间的 100 倍或 200 倍的差异相比,所有 5 个查询都获得了一致的结果,并且大多数表的结果得到了改进(不是边缘)。决定为时间戳创建索引,结果与 OP 中的结果相当。因此删除了索引并在获取时间上得到了整体改进。
只是免责声明,更改了OP的第五个查询
where date_time>='2024-09-19 00:00:00' and date_time<='2024-09-19 01:00:00'
到
#where dt>='2024-09-20 12:00:00' and dt<='2024-09-20 13:00:00'
这是表格数据
这是只有 session_start_time 索引的 5 个查询
这是具有单独 session_start_time 和 session_end_time 索引的 5 个查询
这是在一个索引中同时包含 session_start_time 和 session_end_time 的 5 个查询
与没有索引的日期时间相比,在没有索引的时间戳上使用连接时,您还可以看到一些改进
选择一天的数据时,这里有更多数据:
#where date(dt)='2023-05-25'
where date(dt)='2024-09-10'