具有非自动生成ID的JPA批处理插入

Question

尝试批量插入几百万个实体。批处理插入类的工作，但我的程序在后台执行一些我不想要的JDBC语句。

List < IceCream > iceList = new ArrayList < IceCream > ();

for (CSVRecord record: records) {
if (flushCounter > 40000) {

    iceCreamRepository.saveAll(iceList);
    iceList= new ArrayList < IceCream > ();
    flushCounter = 0;
}
flushCounter++;

IceCream iceCream = new IceCream();

int id = getIdFromCSV();
iceCream.setId(id);
iceCream.set...
    ...
iceList.add(iceCream);

}

我的存储库：

public interface IceCreamRepository extends JpaRepository<IceCream, Long>
{
}

我的实体：

@Entity
@Table(name="IceCream")
public class IceCream 
{
   private static final long serialVersionUID = 1L;

   @OneToMany(targetEntity=entity.OtherEntity.class, mappedBy="IceCream")
   private Set<OtherEntity> otherEntitys = new HashSet<OtherEntity>();

   @Id
   private int id;

   @Basic
   @Column(name="import_tstamp")
   @Temporal(TemporalType.TIMESTAMP)
   private Date importTstamp;

   @Basic
   @Column(name="import_type", length=2147483647)
   private String importType;

   @Basic
   @Column(length=2147483647)
   private String text;

 ...

}

我的JPA设置：

spring.jpa.properties.hibernate.batch_versioned_data: true
spring.jpa.properties.hibernate.order_updates: true
spring.jpa.properties.hibernate.order_inserts: true
spring.jpa.properties.hibernate.generate_statistics: true
spring.jpa.properties.hibernate.jdbc.format_sql: true
spring.jpa.properties.hibernate.jdbc.batch_size: 1000

批处理插入确实有效，但是如果我尝试上载100个实体，则我有33个JDBC语句正在检查ID。

这是33个实体的输出：

2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(2)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(3)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(4)]
2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(5)]

...我的程序正在尝试加载实体，但idk为什么，我还没有插入它们。他这样做是为了32个ID。对于除第一个（0）以外的每个ID输出后，所有33个实体都有一个批处理插入...

2020-03-25 09:25:50.334 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:11, Success:True, Type:Prepared, Batch:True, QuerySize:1, BatchSize:33, Query:["insert into iceCream(import_tstamp, import_type, text, id) values (?, ?, ?, ?)"], Params:[(2020-03-25 09:25:50.127,ice,teext,0),(2020-03-25 09:25:50.127,ice,teext,1),(2020-03-25 09:25:50.127,ice,teext,2)...]

..之后，我得到此摘要：

2020-03-25 09:25:50.359 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    2222222 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    21234400 nanoseconds spent preparing 33 JDBC statements;
    40600005 nanoseconds spent executing 32 JDBC statements;
    27859771 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    100978099 nanoseconds spent executing 1 flushes (flushing a total of 34 entities and 33 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

如果我仅使用1个实体，则输出为：

2020-03-25 11:17:40.119 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    1375995 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    12024409 nanoseconds spent preparing 1 JDBC statements;
    0 nanoseconds spent executing 0 JDBC statements;
    5597005 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    38446070 nanoseconds spent executing 1 flushes (flushing a total of 1 entities and 1 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

对于2个实体，它显示以下内容（我的ID从0开始，因此他仅JDBC执行第二个实体）：

2020-03-25 09:25:50.172 [scheduling-1] INFO  net.ttddyy.dsproxy.listener.logging.SLF4JQueryLoggingListener - Name:, Connection:4, Time:1, Success:True, Type:Prepared, Batch:False, QuerySize:1, BatchSize:0, Query:["select ice0_.id as id1_4_0_, ice0_.text as text6_4_0_,  ice0_.import_tstamp as import_10_4_0_, ice0_.import_type as import_11_4_0_, from iceCream ice0 where ice0_.id=?"], Params:[(1)]

2020-03-25 11:25:00.180 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    1446363 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    13101435 nanoseconds spent preparing 2 JDBC statements;
    11427142 nanoseconds spent executing 1 JDBC statements;
    3762785 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    22309803 nanoseconds spent executing 1 flushes (flushing a total of 2 entities and 2 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

3的输出是

2020-03-25 11:47:00.277 [scheduling-1] INFO  org.hibernate.engine.internal.StatisticalLoggingSessionEventListener - Session Metrics {
    1010843 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    31706133 nanoseconds spent preparing 3 JDBC statements;
    57180996 nanoseconds spent executing 2 JDBC statements;
    3839505 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    23923340 nanoseconds spent executing 1 flushes (flushing a total of 3 entities and 3 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

...所以我有两个问题：

为什么只需要一个批处理插入，为什么会有所有这些JDBC语句？（以及如何解决此问题）
我为数百万个实体尝试了此操作，但是在完成编程之前，我看不到数据库中的任何更新。我的确叫iceCreamRepository.saveAll（iceList）;每4000行运行一次。我认为这会将所有实体写入数据库。我的ram不是hughe，我有一个10 gb的数据文件，只有2 gb的ram。如果程序一直等待写入所有数据，直到最后，我为什么还没用完内存呢？

Answer 1

答案将有些令人费解，但请耐心等待。

我确实叫iceCreamRepository.saveAll（iceList）

从上面，我假设您正在将Spring Data与JPA一起使用。

为什么当我只想插入一个批处理时为什么拥有所有这些JDBC语句？（以及如何解决此问题）

JpaRepository.saveAll()的实现是在列表中的每个实体上调用save()，而save()的实现如下：

if (entityInformation.isNew(entity)) {
    em.persist(entity);
    return entity;
} else {
    return em.merge(entity);
}

EntityInformation 'considers an entity to be new whenever EntityInformation.getId(Object) returns null'的默认实现，这意味着您的实体属于EntityInformation.getId(Object)语句的第二个分支。

有效地，Spring Data告诉JPA将实体与数据库中的现有版本合并。因此，JPA需要首先加载该现有版本，这就是为什么您看到所有其他查询的原因。

要解决此问题，请选择：

使您的实体实现if ... else ...，并从Persistable返回false（请注意，这可能会影响其他地方的持久逻辑；有关更多信息，请参见isNew()）
或直接注入this link并与之交互，调用EntityManager而不是persist()

我为数百万个实体尝试了此操作，但是在完成编程之前，我看不到数据库中的任何更新

为了使实际查询得以执行，您需要在每个批处理之后调用merge()（如果您选择不直接与EntityManager.flush()进行交互，请改用EntityManager）

（作为附带说明，JPA带有很多涉及缓存，转换等的开销，通常对于批处理操作而言是一个糟糕的选择。如果您愿意，我会考虑使用JDBC切换到Spring Batch）

具有非自动生成ID的JPA批处理插入

问题描述投票：0回答：1

1个回答

最新问题

具有非自动生成ID的JPA批处理插入

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1