如何在sql中找到几乎相似的记录?

问题描述 投票:1回答:2

这是搜索记录:

A = {
    field1: value1,
    field2: value2,
    ...
    fieldN: valueN
}

我在数据库中有很多这样的记录。

如果这些记录中的偶数N-M字段相等,则其他记录(B)几乎匹配记录A.这是一个例子,M = 2:

B = {
    field1: OTHER_value1,
    field2: OTHER_value2,
    field3: value3,
    ...
    fieldN: valueN
}

如果可以是任何领域,不仅仅是第一个。

我可以进行非常大的组合SQL查询,但可能有更美观的解决方案。

P.S。:我的数据库是PostgreSQL。

sql algorithm postgresql similarity
2个回答
3
投票

这样的搜索条件将无法使用任何索引,但可以做到......

SELECT
  *
FROM
  yourTable
WHERE
  N-M <= CASE WHEN yourTable.field1 = searchValue1 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field2 = searchValue2 THEN 1 ELSE 0 END
       + CASE WHEN yourTable.field3 = searchValue3 THEN 1 ELSE 0 END
       ...
       + CASE WHEN yourTable.fieldN = searchValueN THEN 1 ELSE 0 END

同样,如果您的搜索条件位于另一个表格中......

SELECT
  *
FROM
  yourTable
INNER JOIN
  search
    ON N-M <= CASE WHEN yourTable.field1 = search.field1 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field2 = search.field2 THEN 1 ELSE 0 END
            + CASE WHEN yourTable.field3 = search.field3 THEN 1 ELSE 0 END
            ...
            + CASE WHEN yourTable.fieldN = search.fieldN THEN 1 ELSE 0 END

(你需要填充N-Myourself的值)

编辑:

一个更长的啰嗦方法,可以使用索引......

SELECT
    id,  -- your table would need to have a primary key / identity column
    MAX(field1)   AS field1,
    MAX(field2)   AS field2,
    MAX(field3)   AS field3,
    ...
    MAX(fieldN)   AS fieldN
FROM
(
    SELECT * FROM yourTable WHERE field1 = searchValue1
    UNION ALL
    SELECT * FROM yourTable WHERE field2 = searchValue2
    UNION ALL
    SELECT * FROM yourTable WHERE field3 = searchValue3
    ...
    SELECT * FROM yourTable WHERE fieldN = searchValueN
)
    AS unioned_seeks
GROUP BY
    id
HAVING
    COUNT(*) >= N-M

如果每个字段都有一个索引,并且您希望每个字段的匹配数相对较少,那么这可能会超过第一个选项,代价是非常重复的代码。


3
投票

我会用is not distinct from来处理NULL值。

您也可以使用Postgres简写来简化逻辑。一种方法是:

where ( (a.field1 is not distinct from b.field1)::int +
        (a.field2 is not distinct from b.field2)::int +
        . . .
        (a.fieldn is not distinct from b.fieldn)::int +
      ) >= N - M

我认为这更容易用M来表达。所以,只看看不同的字段:

where ( (a.field1 is distinct from b.field1)::int +
        (a.field2 is distinct from b.field2)::int +
        . . .
        (a.fieldn is distinct from b.fieldn)::int +
      ) <= M

对数据执行此操作需要使用cross join,这非常昂贵。

© www.soinside.com 2019 - 2024. All rights reserved.