我有一个 databricks 增量表,大约 400GB 并且未分区(databricks 建议不要分区,如果大小 < 1 TB), this table is target of a streaming pipeline. I have Z ordered this table based on occurence_dttm(data type is string but holds timestamp column -sample value - 2022-03-15T22:41:30.011Z). Z order command I use is '''
OPTIMIZE tablename
ZORDER BY(occurence_dttm)
''' 我每天运行一次这个优化操作。 Occurence_dttm 是我们直接从源中获取的列,它可能包含今天和昨天的日期时间戳的数据,但不包含之前的数据。 但现在的问题是,当我使用 Z 顺序运行优化命令时,大多数时候这最终会进行完全优化并需要 4 个小时才能完成,只有一次它会在 7 分钟内完成增量优化操作。 下面是优化操作执行增量操作时的输出。 '''
{"numFilesAdded": 1, "numFilesRemoved": 100, "filesAdded": {"min": 55277570, "max": 55277570, "avg": 55277570, "totalFiles": 1, "totalSize": 55277570}, "filesRemoved": {"min": 428703, "max": 4891829, "avg": 957992.17, "totalFiles": 100, "totalSize": 95799217}, "partitionsOptimized": 0, "zOrderStats": {"strategyName": "minCubeSize(107374182400)", "inputCubeFiles": {"num": 514, "size": 108596707661}, "inputOtherFiles": {"num": 100, "size": 95799217}, "inputNumCubes": 1, "mergedFiles": {"num": 100, "size": 95799217}, "numOutputCubes": 1, "mergedNumCubes": null}, "numBatches": 1, "totalConsideredFiles": 614, "totalFilesSkipped": 514, "preserveInsertionOrder": false}
'''
下面是完全优化后的输出。 '''
{"numFilesAdded": 396, "numFilesRemoved": 739, "filesAdded": {"min": 57810288, "max": 442186574, "avg": 268744949.3055556, "totalFiles": 396, "totalSize": 106422999925}, "filesRemoved": {"min": 398923, "max": 430759983, "avg": 143900314.78213802, "totalFiles": 739, "totalSize": 106342332624}, "partitionsOptimized": 0, "zOrderStats": {"strategyName": "minCubeSize(107374182400)", "inputCubeFiles": {"num": 0, "size": 0}, "inputOtherFiles": {"num": 739, "size": 106342332624}, "inputNumCubes": 0, "mergedFiles": {"num": 739, "size": 106342332624}, "numOutputCubes": 1, "mergedNumCubes": null}, "numBatches": 1, "totalConsideredFiles": 739, "totalFilesSkipped": 0, "preserveInsertionOrder": false}
''' 我想了解,在什么情况下,databricks 决定进行完整或增量 Z 顺序操作?