它只加载一些数据,即使我们希望它在一次运行中加载所有数据(也称为循环的 43 次迭代),我必须多次运行此脚本才能获取所有记录。我有一个想法,它与 fetch next、offset 和/或 optimization for statements 有关。
需要从 CSV 文件加载大约 200 万条记录。下面的代码应该是:
它仅在完整运行中加载部分数据。我必须多次运行这个脚本才能获取所有记录。
为了简洁起见,我在此处删除了一些其他代码来跟踪迁移历史记录,它表明每次运行此代码时,它都会捕获更多记录并将它们添加到数据库中。老实说,我不知道为什么会这样。我尝试过记录和调查,但我陷入困境。下面您可以看到运行的次数以及 csv 中的记录/行数(作为 ChecksumMigration),然后是启动时表中的数字(作为 ChecksumTableBefore)以及在运行期间添加到数据库的数字(作为 ChecksumTableAfter) .
应用日期时间 | 校验和迁移 | 之前的校验和表 | 校验和表之后 |
---|---|---|---|
2024-05-06 00:20:05 | 2036473 | 1986473 | 2036473 |
2024-05-06 00:06:27 | 2036473 | 1936473 | 1986473 |
2024-05-05 23:57:51 | 2036473 | 1786473 | 1936473 |
2024-05-05 23:54:07 | 2036473 | 1536473 | 1786473 |
2024-05-05 23:49:35 | 2036473 | 1036473 | 1536473 |
2024-05-05 23:42:20 | 2036473 | 0 | 1036473 |
USE FoodData_Central;
DECLARE @beforeChecksum INT = 0;
DECLARE @afterChecksum INT = 0;
DECLARE @migrationName NVARCHAR(40) = N'2024 April Full from 2021 - ';
DECLARE @pathToInputFolder NVARCHAR(40) = N'C:\FoodData_Central_csv_2024-04-18\';
DECLARE @tableName Nvarchar(40);
DECLARE @startTime DATETIME2 = GETDATE();
BEGIN TRY
set @tableName = 'food';
-- CHECKSUM
DECLARE @SQL NVARCHAR(MAX) = 'SELECT @ResultVariable = count(*) FROM ' + @tableName;
EXEC sp_executesql @SQL, N'@ResultVariable INT OUTPUT', @ResultVariable = @beforeChecksum output;
-- TRUNCATE TABLE food; --only truncated the first time and then use fetch next to get through the data, solving problem data along the way
-- Mapping
DROP TABLE IF EXISTS #tmp;
create table #tmp(
fdc_id NVARCHAR(max) NOT NULL,
data_type NVARCHAR(max) NULL,
description NVARCHAR(max) NULL,
food_category_id NVARCHAR(max) NULL,
publication_date NVARCHAR(max) NULL
)
bulk insert #tmp
From 'C:\FoodData_Central_csv_2024-04-18\food.csv' -- update file name, (!sometimes the file names are different <crosses eyes>)
WITH
(
CODEPAGE = '65001'
,FIRSTROW = 2
,FIELDTERMINATOR = '\",\"'
,ROWTERMINATOR = '0x0A' --Use to shift the control to next row
,batchsize=500000
,TABLOCK
);
DECLARE @i int = 1
DECLARE @offsetCount int = 1;
DECLARE @nextCount int = 50000;
WHILE @i < 43
BEGIN
SET @i = @i + 1
-- DDL
insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
select
CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
, t.data_type
, t.description
, CAST(t.food_category_id AS SMALLINT) AS food_category_id
, CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
from #tmp t
WHERE NOT EXISTS ( -- skip duplicates
SELECT 1 FROM food AS d --UPDATE
WHERE d.fdc_id = CAST(REPLACE(t.fdc_id,'"','') AS INT)
)
ORDER BY fdc_id DESC
OFFSET @offsetCount - 1 ROWS
FETCH NEXT @nextCount - @offsetCount + 1 ROWS ONLY
OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) );
set @offsetCount = @offsetCount + 50000;
set @nextCount = @nextCount + 50000;
END
-- CLEANUP
DROP TABLE IF EXISTS #tmp;
END TRY
BEGIN CATCH
PRINT 'Error Number: ' + CAST(ERROR_NUMBER() AS NVARCHAR(10));
PRINT 'Error Message: ' + ERROR_MESSAGE();
-- CLEANUP
DROP TABLE IF EXISTS #tmp;
END CATCH;
GO
好吧..这个有点奇怪,但我相信 OFFSET/FETCH 是在 WHERE 子句之后运行的,这使得偏移量很奇怪。 例如,
第 1 轮:插入 50,000 行(WHERE 没有发现问题),偏移量为 0,nextcount 为 50,000。