df.rdd.foreachPartition(write_partition)
where write_partition仅迭代行,并使用psycopg2进行批处理插入。 我的问题是,我在数据库上看到分区查询加倍。
SELECT "receiptid","itemindex","barcode","productnumberraw","itemdescription","itemdescriptionraw","itemextendedprice","itemdiscountedextendedprice","itemquantity","barcodemanufacturer","barcodebrand","barcodecategory1","barcodecategory2","barcodecategory3","isfetchpromo","ispartnerbrand","subscribeandsave","soldby","yyyymm","retailerid","partition_id" FROM (select receiptid, itemindex, barcode, productnumberraw, itemdescription, itemdescriptionraw, itemextendedprice, itemdiscountedextendedprice, itemquantity, barcodemanufacturer, barcodebrand, barcodecategory1, barcodecategory2, barcodecategory3, isfetchpromo, ispartnerbrand, subscribeandsave, soldby, yyyymm, retailerid,
MOD(ABS(CAST('x' || md5(receiptid) AS bit(32))::int), 16) AS partition_id from mytable) as subquery WHERE "partition_id" >= 10 AND "partition_id" < 11
是什么导致了对数据的双重读取?
如果您看到它们在pg_stat_activity
中重复,其中一些可能会显示出其他人的指向其他的,这意味着查询正在通过多个工作过程来处理。
在分区表上查看分布在多个工人之间的查询。您专门将其范围缩小到单个分区的事实:
leader_pid
pid
WHERE "partition_id" >= 10 AND "partition_id" < 11
QueryPlan
-gather(成本= 1000.00..8766.49行= 2652宽度= 36)(实际时间= 0.329..188.652行= 500000 loops = 1)
|
|
|
|
|
|
|
|
< 2)) |
|
计划时间:0.333ms |
|
|
QueryPlan< 2)) |
输出:test.partition_id,test.payload |
过滤器:( test.partition_id = 1) |
工人0:实际时间= 0.023..2.136行= 13440 loops =1 |
工人1:实际时间= 0.036..2.131行= 13440 loops =1 |
计划时间:0.212ms
|
|