如何从 tsql 中的给定字符串生成可能的排列,其中需要对特定字符进行多次替换()调用?

问题描述 投票:0回答:1

我有一个包含车牌的列。该板数据有些来自多个来源,并不完全准确。我必须假设零被误解为“O”,反之亦然。这包括同一车牌中的多个不同结果(OCR 显示 00OABC,而实际车牌是 0O0ABC)。为了处理这个问题,我需要构建一个 where 子句,该子句对每个实例零和字母“O”使用所有可能的替换值。在 tslq 中如何做到这一点?例子:

Plate, (where clause needed)

ABC012, (ABC012, ABCO12)
0OABC, (00ABC, OOABC, 0OABC, OOABC)
000XYZ, (000XYZ,00OXYZ,0OOXYZ,0O0XYZ,O00XYZ,O0OXYZ,OOOXYZ,OO0XYZ)

编辑:我需要构建一个 where in 子句,因为我无法在 select 语句中进行替换(例如: select * fromplates where replacement(plate,'O','0') = '000XYZ')。我无法进行替换,因为数据太多。如果这是一个小表或者如果这是hadoop,那就没问题了。我必须使用的是带有 200 亿行表的 MSSQL 数据库。

t-sql
1个回答
0
投票

此问题的解决方案围绕定义字符等效性。

正如不区分大小写的比较将大写字母视为相当于相应的小写字母,并且不区分重音的比较将“à”、“á”、“â”等视为“a”,我们需要定义相似字符的字符等价。

完成此操作后,我们可以定义函数来比较具有相似外观的车牌号并生成一组相似外观的车牌。

对于比较,我相信大多数不区分大小写和不区分重音的比较都会在进行比较之前将等效字符映射到首选形式(例如全部小写)。在进行比较之前,我们可以通过将所有alternate字符定义为其等效的base或首选形式来执行相同的操作。

为了生成一组相似的车牌,我们可以使用递归 CTE(通用表表达式),一次逐步遍历车牌号的一个字符,同时将每个字符映射到所有定义的等效字符(包括其本身)。每个字符将独立映射,因此最终结果包括每个字符映射的所有组合(叉积)。

我整理了一些示例代码,如下所示:

-- This table defines character mappings for similar looking characters.
-- The BaseChar value is considered the primary for each equivalence set,
-- while the AltChar values are the alternate characters that should be
-- considered to the base character and to any alternates sharing the same
-- base character.
CREATE TABLE CharacterMap (
    BaseChar CHAR,  -- Source character from reader or database
    AltChar CHAR     -- Equivalent character for comparison purposes
)
INSERT CharacterMap
VALUES
    ('O', '0'),
    ('I', '1'),
    ('S', '5'),
    ('O', 'Q')

-- Also map any defined base charcaters to themselves
-- to support later processing logic.
INSERT CharacterMap
SELECT DISTINCT BaseChar, BaseChar
FROM CharacterMap

为了支持两个可能相似的车牌号之间的直接比较,我们需要定义可与

TranslateFrom
函数一起使用的
TranslateTo
TRANSLATE()
字符串。

上面将计算

TranslateFrom = '015Q'
TranslateTo = 'OISO'

-- This single-row table is used to define translations for comparing any two
-- plate numbers. All alternate characters will be mapped to their associated
-- base characters.
CREATE TABLE TranslateStrings (
    TranslateFrom VARCHAR(1000),
    TranslateTo VARCHAR(1000)
)
INSERT TranslateStrings
SELECT
    STRING_AGG(AltChar, '') WITHIN GROUP(ORDER BY AltChar) AS TranslateFrom,
    STRING_AGG(BaseChar, '') WITHIN GROUP(ORDER BY AltChar) AS TranslateTo
FROM CharacterMap
WHERE AltChar <> BaseChar

(1)车牌号码映射、(2)车牌号码比较、(3)车牌号码功能

-- Map a plate number to a string suitable for fuzzy comparisons
CREATE FUNCTION PlateCompareString(@PlateNumber VARCHAR(10))
RETURNS VARCHAR(10)
AS
BEGIN
    RETURN (
        SELECT TRANSLATE(@PlateNumber, T.TranslateFrom, T.TranslateTo)
        FROM PlateTranslateStrings T
    )
END
-- Perform a fuzzy comparison of plate numbers
CREATE FUNCTION PlateCompare(@Left VARCHAR(10), @Right VARCHAR(10))
RETURNS BIT
AS
BEGIN
    RETURN CASE WHEN dbo.PlateCompareString(@Left) = dbo.PlateCompareString(@Right)
           THEN 1 ELSE 0 END
END
-- Generate a set of similar plates for a given plate number.
CREATE FUNCTION PlatePermutations(@PlateNumber VARCHAR(10))
RETURNS TABLE
AS
RETURN (
    WITH CTE_Permiate AS (
        -- Anchor - Seed with empty result.
        SELECT
            CAST('' AS VARCHAR(10)) AS Result,
            @PlateNumber AS Remaining
        UNION ALL
        -- Map next character to itself.
        SELECT
            CAST(P.Result + LEFT(P.Remaining, 1) AS VARCHAR(10)) AS Result,
            STUFF(P.Remaining, 1, 1, '') AS Remaining
        FROM CTE_Permiate P
        WHERE LEN(P.Remaining) > 0
        UNION ALL
        -- Also map next character to all characters sharing the same base character
        -- except itself.
        SELECT
            CAST(P.Result + ISNULL(M2.AltChar, LEFT(P.Remaining, 1)) AS VARCHAR(10)) AS Result,
            STUFF(P.Remaining, 1, 1, '') AS Remaining
        FROM CTE_Permiate P
        JOIN PlateCharacterMap M1 ON M1.AltChar = LEFT(P.Remaining, 1)
        JOIN PlateCharacterMap M2 ON M2.BaseChar = M1.BaseChar
        WHERE LEN(P.Remaining) > 0
        AND M2.AltChar <> M1.AltChar -- Exclude self (already included above)
    )
    SELECT P.Result AS PlatePermutation
    FROM CTE_Permiate P
    WHERE LEN(P.Remaining) = 0  -- Only include the completed mappings
)

测试代码:

-- Plate comparison test
WITH CTE_TestData AS (
    SELECT *
    FROM (
        VALUES
            ('ABC-XYZ'),
            ('ABC-OIS'), ('ABC-O15'), ('ABC-015'),
            ('PASSWORD'), ('PA55W0RD')
    ) V(PlateNumber)
)
SELECT
    P1.PlateNumber AS PlateNumber1,
    P2.PlateNumber AS PlateNumber2,
    dbo.PlateCompareString(P1.PlateNumber) AS CompareString1,
    dbo.PlateCompareString(P2.PlateNumber) AS CompareString2,
    CASE WHEN dbo.PlateCompare(P1.PlateNumber, P2.PlateNumber) = 1
         THEN 'Match' ELSE '' END AS Match 
FROM CTE_TestData P1
JOIN CTE_TestData P2
    ON P2.PlateNumber > P1.PlateNumber
ORDER BY P1.PlateNumber, P2.PlateNumber

结果:

车牌号1 车牌号2 比较字符串1 比较字符串2 比赛
ABC-015 ABC-O15 ABC-OIS ABC-OIS 比赛
ABC-015 ABC-OIS ABC-OIS ABC-OIS 比赛
ABC-015 ABC-XYZ ABC-OIS ABC-XYZ
ABC-015 PA55W0RD ABC-OIS 密码
ABC-015 密码 ABC-OIS 密码
ABC-O15 ABC-OIS ABC-OIS ABC-OIS 比赛
... ... ... ... ...
PA55W0RD 密码 密码 密码 比赛

板块排列测试:

-- Permutation demo
SELECT P.PlateNumber, PERM.PlatePermutation,
    COUNT(*) OVER(PARTITION BY P.PlateNumber) AS PermutationCount
FROM (
    VALUES
        ('A-ZZZ'),  -- just this one
        ('B-1'),    -- 2 permutations
        ('C-S'),    -- 2 permutations
        ('D-Q'),    -- 3 permutations
        ('E-15'),   -- 2 x 2 = 4 permutations
        ('F-ISQ'),  -- 2 x 2 x 3 = 12 permutations
        ('G-OQ0'),  -- 3 x 3 x 3 = 27 permutations
        ('H-O0Q15') -- 3 x 3 x 3 x 2 x 2 = 108 permutations
) P(PlateNumber)
CROSS APPLY dbo.PlatePermutations(P.PlateNumber) PERM
ORDER BY P.PlateNumber, PERM.PlatePermutation

部分结果:

车牌号码 板块排列 排列计数
A-ZZZ A-ZZZ 1
B-1 B-1 2
B-1 B-I 2
C-S C-5 2
C-S C-S 2
D-Q D-0 3
D-Q D-O 3
D-Q D-Q 3
E-15 E-15 4
E-15 E-1S 4
E-15 E-I5 4
E-15 E-IS 4
... ... ...

请参阅 this db<>fiddle 进行演示。

这可能不是最好的和最终的答案,但应该让你非常接近。

© www.soinside.com 2019 - 2024. All rights reserved.