Postgres 查询查找相似字符串

问题描述 投票:0回答:1

我有一个表存储一个人的地址详细信息。假设列名称是Address。现在我要做的是找到与输入地址字符串相似的地址列表。 (假设匹配地址应超过特定阈值)

数据库:Postgres (v16)

行数:1000万+

我们尝试了trgm模块,但我们没有很快得到结果。对于大量的匹配记录,需要时间。

查询: 从 WHERE ARRAY['val'] 中选择 * <@ pentagram AND similarity(address,) > 阈值

索引: 五角星:杜松子酒指数

注意

  • 五角星:具有长度为 3 的标记的列。(使用 show_tgrm()
  • 生成
  • 在 ARRAY['val'] 中,我们放置一个长度为 3 的标记,该标记是为输入字符串生成的。
  • 我们正在对为输入字符串生成的每个标记执行上述查询。
  • 我们正在整理来自每个代币查询的地址。

查询计划执行:

    explain(costs,buffers,verbose,analyze) select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address) from address where ARRAY['vel'] <@ pentagram;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.address  (cost=2296.03..416822.75 rows=398205 width=4) (actual time=136.191..13181.323 rows=398415 loops=1)
   Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
   Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
   Heap Blocks: exact=357144
   Buffers: shared hit=357254
   ->  Bitmap Index Scan on pentgram_idx  (cost=0.00..2196.48 rows=398205 width=0) (actual time=57.975..57.976 rows=398415 loops=1)
         Index Cond: (address.pentagram @> '{vel}'::text[])
         Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
   Buffers: shared hit=1
Planning Time: 0.158 ms
JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.370 ms, Inlining 0.000 ms, Optimization 0.573 ms, Emission 6.775 ms, Total 7.718 ms
Execution Time: 13216.714 ms
(17 rows)
 


explain(costs,buffers,verbose,analyze) select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042', complete_address) from address where ARRAY['vel'] <@ pentagram;
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.address  (cost=2296.03..416822.75 rows=398205 width=4) (actual time=140.682..6224.179 rows=398415 loops=1)
   Output: similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042'::text, complete_address)
   Recheck Cond: ('{vel}'::text[] <@ address.pentagram)
   Heap Blocks: exact=357144
   Buffers: shared hit=357254
   ->  Bitmap Index Scan on pentgram_idx  (cost=0.00..2196.48 rows=398205 width=0) (actual time=64.489..64.490 rows=398415 loops=1)
         Index Cond: (address.pentagram @> '{vel}'::text[])
         Buffers: shared hit=110
Query Identifier: 7037675894901615674
Planning:
   Buffers: shared hit=1
Planning Time: 0.225 ms
JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.629 ms, Inlining 0.000 ms, Optimization 0.525 ms, Emission 5.670 ms, Total 6.824 ms
Execution Time: 6237.051 ms
(17 rows)

我需要帮助在尽可能短的时间内找到所有类似的地址。 任何建议都会有帮助。

sql database postgresql database-design data-engineering
1个回答
0
投票

你所做的事情很奇怪。通常,您只需直接在文本列上构建索引,而不是在任何地方显式存储单个三元组。 show_tgrm() 基本上只是一个调试工具,我从未在生产代码中使用过它。

您显示的查询可能与以下内容相同:

select similarity('no 25 nethaji cly 9th cross st wst velachery velchry chennai tn 600042',address) 
  from address where address like '%vel%';

哪个应该使用索引

on address using gin (address gin_trgm_ops)

尽管我认为更常见的做法是选择一个所需的阈值,然后使用 % 运算符,而不是从查询中选择一个子字符串。

© www.soinside.com 2019 - 2024. All rights reserved.