我在 Unity Catalog 中有多个表,我想在一个结果中获取指定表的记录计数。在 SQL 中,我们可以使用以下命令轻松地做到这一点:
%sql
Select * From (
Select 'Table1' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table1
Union All
Select 'Table2' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table2
Union All
Select 'Table3' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table3
) a
Order By RecordCount Desc
结果:
---------------------------
TableName | RecordCount
---------------------------
Table3 | 500
Table1 | 300
Table2 | 100
想要使用 PySpark 获得相同的结果集,如果可能的话,仅使用上面的 1 个命令。
df = spark.sql('''
Select * From (
Select 'Table1' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table1
Union All
Select 'Table2' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table2
Union All
Select 'Table3' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table3
) a
Order By RecordCount Desc
''')
df.show()
或
from pyspark.sql import functions as F
df = spark.createDataFrame(data=[], schema='count(1): int')
for i in range(1,4):
tn = f'Table{i}'
df.unionAll(
spark.table(tn).agg({'*': 'count'})
.withColumnRenamed('count(1)', 'RecordCount')
.withColumn(tn, F.lit('t1'))
)
df.show()
或
for i in range(1,4):
tn = f'Table{i}'
print(f'{tn}: {spark.table(tn).count()}')