我正在尝试优化一些非常丑陋的查询。 我在这里有一个查询,它获取一个州的缩写,因为我们只处理缩写。
UPDATE [data_log]
SET [h_data] = COALESCE((SELECT TOP(1) [state_abbr] FROM [CityStateInfo] WHERE [state_long] = [h_data]), [h_data])
WHERE [field] = 'MailingState'
AND LEN([h_data]) > 3
AND [h_data] IS NOT NULL
数据日志只是一个表格,我在其中跟踪需要进行或需要审查的更改。
CREATE TABLE [data_log] (
[id] int identity(1,1),
[dataID] bigint,
[field] varchar(128),
[sf_data] varchar(500),
[h_data] varchar(500),
[score] float,
[action] varchar(128)
);
INSERT INTO [Data_log] VALUES
(3605013844, '[MailingCity]', 'Flat', 'Flatt', NULL, NULL),
(3605013844, '[MailingState]', 'KY', 'Kentucky', NULL, NULL),
(3605013844, '[MailingZIP]', '41301', '41301', NULL, NULL),
(1874281127, '[MailingCity]', 'EDMONTON', 'Edmonton', NULL, NULL),
(1874281127, '[MailingState]', 'AB', 'Alberta', NULL, NULL),
(1874281127, '[MailingZIP]', 'T6M 2K1', 'T6M 2K1', NULL, NULL),
(2077170855, '[MailingCity]', 'Van Buren Point', 'Van Buren Point', NULL, NULL),
(2077170855, '[MailingState]', 'NY', 'New York', NULL, NULL),
(2077170855, '[MailingZIP]', '14166', '14166', NULL, NULL),
(1874281127, '[MailingState]', 'PA', 'Ontario', NULL, NULL),
(1874281127, '[MailingState]', 'IL', 'Missouri', NULL, NULL)
[CityStateInfo] 有大量有关美国、加拿大、墨西哥和欧洲的信息。 它有 3,764,649 行。 它包含世界各地的每个城市/州/邮政编码组合以及其他信息。
CREATE TABLE [dbo].[CityStateInfo](
[City] [varchar](255) NULL,
[State_abbr] [varchar](10) NULL,
[State_long] [varchar](50) NULL,
[Zip] [varchar](20) NULL,
[County] [varchar](50) NULL,
[Country] [varchar](50) NULL,
[Longitude] [varchar](15) NULL,
[Latitude] [varchar](15) NULL,
[StateFIPS] [varchar](10) NULL,
[CountryFIPS] [varchar](10) NULL,
[TimeZone] [int] NULL,
[cleanCity] [varchar](255) NULL,
[Country_abbr] [varchar](10) NULL,
[foreignCity] [varchar](255) NULL,
[foreignState] [varchar](255) NULL
)
INSERT INTO [CityStateInfo] VALUES
('AARON','KY','Kentucky','42602','RUSSELL','United States','-85.121708','36.751734','21','207','6','AARON','US',NULL,NULL),
('AARON','KY','Kentucky','42602','CLINTON','United States','-85.121708','36.751734','21','053','6','AARON','US',NULL,NULL),
('ADRIAN','MO','Missouri','64720','BATES','United States','-94.398772','38.433513','29','013','6','ADRIAN','US',NULL,NULL),
('ADVANCE','MO','Missouri','63730','CAPE GIRARDEAU','United States','-89.911055','37.058424','29','031','6','ADVANCE','US',NULL,NULL),
('SHIRLEY','NY','New York','11967','SUFFOLK','United States','-72.880184','40.794219','36','103','5','SHIRLEY','US',NULL,NULL),
('SHOKAN','NY','New York','12481','ULSTER','United States','-74.214799','41.982148','36','111','5','SHOKAN','US',NULL,NULL),
('KANATA','ON','Ontario','K2M 0A8',NULL,'Canada',NULL,NULL,NULL,NULL,'5','KANATA','CA',NULL,NULL),
('KANATA','ON','Ontario','K2M 0A9',NULL,'Canada',NULL,NULL,NULL,NULL,'5','KANATA','CA',NULL,NULL),
('EDMONTON','AB','Alberta','T6H 0J5',NULL,'Canada',NULL,NULL,NULL,NULL,'7','EDMONTON','CA',NULL,NULL),
('EDMONTON','AB','Alberta','T6H 0J6',NULL,'Canada',NULL,NULL,NULL,NULL,'7','EDMONTON','CA',NULL,NULL)
我以为我可以做这样的事情
UPDATE [data_log]
SET [data_log].[h_data] = c.[state_abbr]
FROM [data_log] d
INNER JOIN [CityStateInfo] c
ON d.[h_data] = c.[state_long]
WHERE d.[field] = '[MailingState]'
AND LEN([h_data]) > 3
AND [h_data] IS NOT NULL
但是,当我使用类似的设置进行选择时,我会得到数百万行,因为我要查找的每个州都可能有数十甚至数千行。 虽然上面的查询似乎确实得到了我想要的东西,但我想确保我不会仅仅为了编辑几十行而调用数百万行,这将违背尝试清理查询的目的。
SELECT * FROM [data_log] d
INNER JOIN [CityStateInfo] c
ON d.[h_data] = c.[state_long]
WHERE d.[field] = '[MailingState]'
AND LEN([h_data]) > 3
AND [h_data] IS NOT NULL
那么,如何更改 UPDATE 查询,使其仅与单行匹配进行编辑,而不是旋转超过所需的周期?
因为您只需要 CityStateInfo 中的 state_long 和 state_abbr:
SELECT *
FROM [data_log] d
INNER JOIN (
select distinct csi.state_long
, csi.state_abbr
from [CityStateInfo] csi
) c
ON d.[h_data] = c.[state_long]
WHERE d.[field] = '[MailingState]'
AND LEN([h_data]) > 3
...产生较小的输出,所以...
UPDATE [data_log]
SET [data_log].[h_data] = c.[state_abbr]
FROM [data_log] d
INNER JOIN (
select distinct csi.state_long
, csi.state_abbr
from [CityStateInfo] csi
) c
ON d.[h_data] = c.[state_long]
WHERE d.[field] = '[MailingState]'
AND LEN([h_data]) > 3
...将涉及更少的数据。 但它还涉及一个额外的处理步骤。 您应该检查两种方法的性能,以确定哪种方法最适合您的环境。