我正在尝试按照以下链接中的描述查询 Delta Lake 表历史
https://learn.microsoft.com/en-us/azure/databricks/delta/history
当我如下描述delta表时
describe history '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1'
我得到下表输出
版本 | 时间戳 | 用户名 | 用户名 | 操作 | 操作参数 | 工作 | 笔记本 | clusterId | 阅读版本 | 隔离级别 | 是盲追 | 操作指标 | 用户元数据 | 引擎信息 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
"2 | 18/03/2023 12:25:54.0000000 | 615257000000000 | [email protected] | 合并 | {""predicate"":""(s.primary_key_hash = t.primary_key_hash)"",""matchedPredicates"":""[{""predicate"":""(NOT (s.change_key_hash = t.change_key_hash ))"",""actionType"":""更新""}]"",""notMatchedPredicates"":"""[{""actionType"":""插入""}]"",""notMatchedBySourcePredicates "":""[{""actionType"":""删除""}]""} | (空) | {""notebookId"":""3807690121522291""} | 0318-105603-oyrrx3xc | 1 | 可序列化 | 假 | {""numTargetRowsCopied"":""0"",""numTargetRowsDeleted"":""1"",""numTargetFilesAdded"":""1"",""numTargetBytesAdded"":""9070", ""numTargetBytesRemoved"":""9176"",""numTargetDeletionVectorsAdded"":""0"",""numTargetRowsMatchedUpdated"":""27"",""executionTimeMs"":""13999"","" numTargetRowsInserted"":""0"",""numTargetRowsMatchedDeleted"":""0"",""scanTimeMs"":""4276"",""numTargetRowsUpdated"":""27"",""numOutputRows" ":""27"",""numTargetDeletionVectorsRemoved"":""0"",""numTargetRowsNotMatchedBySourceUpdated"":""0"",""numTargetChangeFilesAdded"":""0"",""numSourceRows"": ""27"",""numTargetFilesRemoved"":""1"",""numTargetRowsNotMatchedBySourceDeleted"":""1"",""rewriteTimeMs"":""9012""} | (空) | Databricks-Runtime/12.2.x-scala2.12" |
"1 | 18/03/2023 12:14:43.0000000 | 615257000000000 | [email protected] | 合并 | {""predicate"":""(s.primary_key_hash = t.primary_key_hash)"",""matchedPredicates"":""[{""predicate"":""(NOT (s.change_key_hash = t.change_key_hash ))"",""actionType"":""更新""}]"",""notMatchedPredicates"":"""[{""actionType"":""插入""}]"",""notMatchedBySourcePredicates "":""[{""actionType"":""删除""}]""} | (空) | {""notebookId"":""3807690121522291""} | 0318-105603-oyrrx3xc | 0 | 可序列化 | 假 | {""numTargetRowsCopied"":""0"",""numTargetRowsDeleted"":""0"",""numTargetFilesAdded"":""1"",""numTargetBytesAdded"":""9176", ""numTargetBytesRemoved"":""0"",""numTargetDeletionVectorsAdded"":""0"",""numTargetRowsMatchedUpdated"":""0"",""executionTimeMs"":""6222"","" numTargetRowsInserted"":""28"",""numTargetRowsMatchedDeleted"":""0"",""scanTimeMs"":""2280"",""numTargetRowsUpdated"":""0"",""numOutputRows" ":""28"",""numTargetDeletionVectorsRemoved"":""0"",""numTargetRowsNotMatchedBySourceUpdated"":""0"",""numTargetChangeFilesAdded"":""0"",""numSourceRows"": ""28"",""numTargetFilesRemoved"":""0"",""numTargetRowsNotMatchedBySourceDeleted"":""0"",""rewriteTimeMs"":""3593""} | (空) | Databricks-Runtime/12.2.x-scala2.12" |
"0 | 18/03/2023 12:14:23.0000000 | 615257000000000 | [email protected] | 创建或替换表 | {""isManaged"":""false"",""description"":null,""partitionBy"":""[]"",""properties"":""{}""} | (空) | {""notebookId"":""3807690121522291""} | 0318-105603-oyrrx3xc | (空) | 可序列化 | 真实 | {} | (空) | Databricks-Runtime/12.2.x-scala2.12" |
我已经为路径分配了以下变量
saveloc = '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1'
正如您从上面的历史输出中看到的那样,有一个名为 version 和 operationParameters
的字段通过以下代码很容易从历史表中获取最新版本:
df4 = spark.read.option("versionAsof", 3).load(saveloc)
有多种获取最新版本的方法,例如:
df5 = spark.read.load("/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1@v3")
Or
df6 = df5 = spark.read.load(saveloc+"@v3")
Or in SQL it would be something similar to:
SELECT * FROM saveloc@v3
有人可以告诉我是否可以在版本字段上写一个 WHERE 子句,例如
Select * From saveloc
where version > 2
select * from deltatable version as of 9
这是不可能的。假设您有一个包含 5 个版本的表。如果您使用像
这样的查询Select * From saveloc
where version > 2
您希望看到哪个版本,3、4 或 5? 您需要指定一个版本。