如何改进对 Postgres DB 的批量插入

Question

我目前有一个 C# 服务，它使用 dapper 调用一个存储过程，该存储过程执行 2 件事：如果客户存在，它会获取客户

GUID

并将其添加到

CustomerInformations

表中；如果客户不存在，则插入客户，然后返回

GUID

并将其添加到

CustomerInformations

表中。

以前，插入每小时大约需要 175 万条记录。现在每小时只能勉强获取20万条记录。我的

CustomerInformations

表中有大约 7500 万条记录，我正在寻求解决瓶颈。

对于每个 Customer 属性，它都会迭代调用存储过程。每个存储过程调用可以有 2 次插入到数据库中。首先，将客户添加到

Customers

表中，然后将属性添加到

CustomerInformations

表中。我知道这可能不是存储数据的最理想方式，但这不是我可以改变的。

C# 服务

foreach (var info request.Data)
{
    string sql = "add_one_by_customer";
    object parameters = new
    {
        p_customer_first_name = info.FirstName,
        p_customer_last_name = info.LastName,
        p_customer_property_name = info.PropertyName,
        p_customer_property_value = info.PropertyValue
    };

    try
    {
        await db.ExecuteAsync(sql, parameters, transaction: transaction, commandType: CommandType.StoredProcedure);
    }
    catch (Exception e)
    {
        throw new Exception($"Failed to insert");
    }
}

Postgres 存储过程：

CREATE OR REPLACE PROCEDURE add_one_by_customer(
    p_customer_first_name  VARCHAR,
    p_customer_last_name  VARCHAR,
    p_customer_property_name  VARCHAR,
    p_customer_property_value  VARCHAR,
    )
    LANGUAGE plpgsql
AS $procedure$
DECLARE p_customer_id uuid;
        p_current_item_value varchar;   
begin   
    SELECT INTO p_customer_id,
                customer_id
    FROM customers
    WHERE customer_first_name = p_customer_first_name AND
          customer_last_name = p_customer_last_name
    limit 1;
                           
    
    IF (p_customer_id IS NULL) THEN  
        begin               
            INSERT INTO customers(customer_first_name, customer_last_name)
            VALUES (p_customer_first_name, p_customer_last_name) RETURNING  customer_id into p_customer_id;
            EXCEPTION WHEN unique_violation THEN
            p_customer_id  = (SELECT custmomer_id 
                              FROM  customers
                              WHERE customer_first_name = p_customer_first_name AND
                                    customer_last_name = p_customer_last_name
        END;
    end if;    
   
    p_current_item_value := (select property_value
                             from customer_informations
                             where customer_id = p_customer_id AND
                                   customer_property_name = p_customer_property_name);
  

   
    if (p_current_item_value is NULL) THEN
        INSERT INTO customer_informations(customer_id, customer_property_name, customer_property_value)
        VALUES (p_customer_id, p_customer_property_name, p_customer_property_value);
    elseif (p_current_item_value is not null AND p_current_item_value != p_item_value) then
        UPDATE customer_informations 
        SET customer_property_value = p_current_item_value
        WHERE  customer_id = p_customer_id ;        
    end if;
end; $procedure$;

目前我的

CustomerInformations

表对

Customer_Id, Customer_property_name

有唯一的约束。

我尝试增强的东西：

在服务中并行化（这就是您在存储过程中看到唯一的违规异常行的原因），这确实加快了速度，但还不够。
我正在考虑删除唯一约束和索引，但我不确定清理重复项有多容易（其他人与数据库交互）

任何提示或建议将不胜感激。

客户信息唯一约束：

CONSTRAINT ux_customer_informations UNIQUE (customer_id, customer_property_name)

客户独特的约束：

CONSTRAINT ux_customers UNIQUE (customer_firstname, customer_lastname)

Answer 1

您当前的程序是效率极低。参见：

处理 PostgreSQL 异常的优雅方式？

避免带有错误处理的嵌套代码块，这是非常昂贵的。可以通过我使用的“SELECT 或 INSERT”技术正确完成。参见：

函数中的 SELECT 或 INSERT 是否容易出现竞争条件？

第二部分是变相的UPSERT。现在也便宜很多了：

CREATE OR REPLACE PROCEDURE dd_one_by_customer(
      p_customer_first_name      text
    , p_customer_last_name       text
    , p_customer_property_name   text
    , p_customer_property_value  text
      )
  LANGUAGE plpgsql AS
$proc$
DECLARE
   p_customer_id uuid;
   p_current_item_value text;
BEGIN
   LOOP
      SELECT customer_id
      FROM   customers
      WHERE  customer_first_name = p_customer_first_name
      AND    customer_last_name = p_customer_last_name
      INTO   p_customer_id;

      EXIT WHEN FOUND;
      
      INSERT INTO customers
             (  customer_first_name,   customer_last_name)
      VALUES (p_customer_first_name, p_customer_last_name)
      ON     CONFLICT (customer_first_name, customer_last_name) DO NOTHING
      RETURNING customer_id
      INTO   p_customer_id;

      EXIT WHEN FOUND;
   END LOOP;

   INSERT INTO customer_informations
          (  customer_id,   customer_property_name,   customer_property_value)
   VALUES (p_customer_id, p_customer_property_name, p_customer_property_value)
   ON     CONFLICT (customer_id, customer_property_name) DO UPDATE
   SET    customer_property_value = EXCLUDED.customer_property_value
   WHERE  customer_property_value IS DISTINCT FROM p_current_item_value;
END
$proc$;

这需要对两个表分别施加

UNIQUE

约束 - 正是您声明的表（

ux_customer_informations

和

ux_customers

）。参见：

如果

customer_property_value

和

p_current_item_value

都不能是

null

，则将最终的 WHERE 子句简化为：

...
WHERE  customer_property_value <> p_current_item_value;

如何改进对 Postgres DB 的批量插入

问题描述投票：0回答：1

1个回答

最新问题

如何改进对 Postgres DB 的批量插入

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1