在 Unity Catalog 中,使用 PySpark,是否可以在一个数据帧中获取多个表的“记录计数”(最好使用 1 个命令),就像在 SQL 中所做的那样

问题描述 投票:0回答:1

我在 Unity Catalog 中有多个表,我想在一个结果中获取指定表的记录计数。在 SQL 中,我们可以使用以下命令轻松地做到这一点:

%sql
Select * From (
    Select 'Table1' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table1
    Union All
    Select 'Table2' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table2
    Union All
    Select 'Table3' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table3
) a
Order By RecordCount Desc

结果:

---------------------------
TableName   |   RecordCount
---------------------------
Table3      |   500
Table1      |   300
Table2      |   100

想要使用 PySpark 获得相同的结果集,如果可能的话,仅使用上面的 1 个命令。

sql pyspark union databricks-unity-catalog multiple-resultsets
1个回答
0
投票
df = spark.sql('''
Select * From (
    Select 'Table1' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table1
    Union All
    Select 'Table2' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table2
    Union All
    Select 'Table3' as TableName, count(*) As RecordCount From CatalogName.SchemaName.Table3
) a
Order By RecordCount Desc
''')

df.show()

from pyspark.sql import functions as F

df = spark.createDataFrame(data=[], schema='count(1): int')
for i in range(1,4):
  tn = f'Table{i}'
  df.unionAll(
     spark.table(tn).agg({'*': 'count'})
     .withColumnRenamed('count(1)', 'RecordCount')
     .withColumn(tn, F.lit('t1'))
  )

df.show()

for i in range(1,4):
  tn = f'Table{i}'
  print(f'{tn}: {spark.table(tn).count()}')
最新问题
© www.soinside.com 2019 - 2024. All rights reserved.