我正在使用 Databricks 笔记本。我想以编程方式查找实际上已通过操作调用缓存的 DataFrame 列表,即
.cache()
后跟一个操作,如 .show()
。
这篇文章为更简单的情况提供了解决方案:查找已调用 .cache()
或
.persist()
的对象,因此
is_cached
属性设置为
True
,但不能保证缓存已被调用触发了。我尝试了
sc._jsc.getPersistentRDDs()
,但它也返回了我不想看到的java对象;它们与我的实际代码无关,并且很难弄清楚哪个项目(如果有)是我的缓存数据框,请参阅下面的玩具示例。我该如何继续?
from pyspark.sql import DataFrame
from pyspark import StorageLevel
cached_dfs = []
def cache_and_track(df: DataFrame, name: str):
"""
Caches a DataFrame and tracks its name.
This function ensures the DataFrame is cached and marked as "used".
"""
df.cache()
cached_dfs.append((name, df))
return df
df1 = spark.createDataFrame([('a', 1), ('b', 2)], ["letter", "number"])
df2 = spark.createDataFrame([('x', 3), ('y', 4)], ["char", "num"])
df1 = cache_and_track(df1, "df1")
df2 = cache_and_track(df2, "df2")
df1.show()
df2.count()
actual_cached_dfs = [name for name, df in cached_dfs if df.storageLevel != StorageLevel.NONE]
print("DataFrames that were actually cached and triggered:")
print(actual_cached_dfs)
结果:
+------+------+
|letter|number|
+------+------+
| a| 1|
| b| 2|
+------+------+
DataFrames that were actually cached and triggered:
['df1', 'df2']
在上面的代码中创建一个列表来存储对缓存 DataFrame 的引用
将 DataFrame 标记为缓存
缓存 DataFrame 并按名称跟踪它们
触发操作并检查哪些 DataFrame 已被缓存和访问