我有一个表存储一个人的地址详细信息。假设列名称是Address。现在我要做的是找到与输入地址字符串相似的地址列表。 (假设匹配地址应超过特定阈值)
数据库:Postgres (v16)
行数:1000万+
我们尝试了trgm模块,但我们没有很快得到结果。对于大量的匹配记录,需要时间。
查询:
从 WHERE ARRAY['val'] 中选择 * <@ pentagram AND similarity(address,
索引: 五角星:杜松子酒指数
注意
查询计划执行:
explain(costs,buffers,verbose,analyze) select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address) from address where ARRAY['vel'] <@ pentagram;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.address (cost=2296.03..416822.75 rows=398205 width=4) (actual time=136.191..13181.323 rows=398415 loops=1)
Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
Heap Blocks: exact=357144
Buffers: shared hit=357254
-> Bitmap Index Scan on pentgram_idx (cost=0.00..2196.48 rows=398205 width=0) (actual time=57.975..57.976 rows=398415 loops=1)
Index Cond: (address.pentagram @> '{vel}'::text[])
Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
Buffers: shared hit=1
Planning Time: 0.158 ms
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.370 ms, Inlining 0.000 ms, Optimization 0.573 ms, Emission 6.775 ms, Total 7.718 ms
Execution Time: 13216.714 ms
(17 rows)
explain(costs,buffers,verbose,analyze) select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042', complete_address) from address where ARRAY['vel'] <@ pentagram;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.address (cost=2296.03..416822.75 rows=398205 width=4) (actual time=140.682..6224.179 rows=398415 loops=1)
Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
Heap Blocks: exact=357144
Buffers: shared hit=357254
-> Bitmap Index Scan on pentgram_idx (cost=0.00..2196.48 rows=398205 width=0) (actual time=64.489..64.490 rows=398415 loops=1)
Index Cond: (address.pentagram @> '{vel}'::text[])
Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
Buffers: shared hit=1
Planning Time: 0.225 ms
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.629 ms, Inlining 0.000 ms, Optimization 0.525 ms, Emission 5.670 ms, Total 6.824 ms
Execution Time: 6237.051 ms
(17 rows)
我需要帮助在尽可能短的时间内找到所有类似的地址。 任何建议都会有帮助。
你所做的事情很奇怪。通常,您只需直接在文本列上构建索引,而不是在任何地方显式存储单个三元组。 show_tgrm() 基本上只是一个调试工具,我从未在生产代码中使用过它。
您显示的查询可能与以下内容相同:
select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address)
from address where address like '%vel%';
哪个应该使用索引
on address using gin (address gin_trgm_ops)
尽管我认为更常见的做法是选择一个所需的阈值,然后使用 % 运算符,而不是从查询中选择一个子字符串。