找到jupyter笔记本中导致大量文件大小的输出单元格

Question

i有一个约400个单元格的jupyter笔记本。总文件大小为8MB，因此我想抑制具有较大尺寸的输出单元，以减少整体文件大小。

有很多可能导致此（主要是matplotlib和seaborn图）的可能输出单元，因此避免在反复试验上花费时间，是否有一种方法可以找到每个输出单元的大小？我想保留尽可能多的输出图，就像将作品推向其他人供其他人一起看到。

Answer 1

############### Get test notebook ######################################## import os notebook_example = "matplotlib3d-scatter-plots.ipynb" if not os.path.isfile(notebook_example): !curl -OL https://raw.githubusercontent.com/fomightez/3Dscatter_plot-binder/master/matplotlib3d-scatter-plots.ipynb ### Use nbformat to get estimate of output size from code cells. ######### import nbformat as nbf ntbk = nbf.read(notebook_example, nbf.NO_CONVERT) size_estimate_dict = {} for cell in ntbk.cells: if cell.cell_type == 'code': size_estimate_dict[cell.execution_count] = len(str(cell.outputs)) out_size_info = [k for k, v in sorted(size_estimate_dict.items(), key=lambda item: item[1],reverse=True)] out_size_info

（要有一个容易运行该代码的地方，然后单击

launch binder

按钮。当会话旋转时，打开新笔记本并在代码中粘贴并运行它。笔记本的静态形式为这里）我尝试过的示例没有包含剧情，但是使用带有所有绘图的笔记本也可以使用类似的绘图。我不知道它将如何处理混合。如果不同的种类可能并不完美。希望这会给您一个想法，尽管如何做您想知道的事情。可以进一步扩展代码示例，以使用检索到的尺寸估计值以使NBFormat制作输入笔记本的副本，而没有显示最大的十大代码单元的输出。

我也有类似的问题，并基于wayne的答案创建了自己的脚本。您可以将其传递到jupyter笔记本电脑，并以大小订购的最大输出的代码单元格。

为参考更容易，单元号，其产生的输出的大小以及其代码的前几行行都打印出来。您可以通过击中Enter来跳过最大输出到最小的代码单元：）

请注意，您需要从命令行运行此脚本（否则
input()

部分将无法正常工作）

Answer 2

import nbformat as nbf
from typing import TypedDict


class CodeCellMeta(TypedDict):
    cell_num: int
    output_size_bytes: int
    first_lines: list[str]


def get_code_cell_metadata(nb_path: str):
    ntbk = nbf.read(nb_path, nbf.NO_CONVERT)
    cell_metas: list[CodeCellMeta] = []
    for i, cell in enumerate(ntbk.cells):
        cell_num = i + 1
        if cell.cell_type == "code":
            meta: CodeCellMeta = {
                "output_size_bytes": len(str(cell.outputs)),
                "cell_num": cell_num,
                "first_lines": cell.source.split("\n")[:5],
            }
            cell_metas.append(meta)

    return cell_metas


def human_readable_size(size_bytes: int) -> str:
    size_current_unit: float = size_bytes
    for unit in ["B", "KB", "MB", "GB", "TB"]:
        if size_current_unit < 1024:
            return f"{size_current_unit:.2f} {unit}"
        size_current_unit /= 1024.0
    return f"{size_current_unit:.2f} PB"


def show_large_cells(nb_path: str):
    code_cell_meta = get_code_cell_metadata(nb_path)

    cell_meta_by_size_est = sorted(
        code_cell_meta, key=lambda x: x["output_size_bytes"], reverse=True
    )

    bytes_remaining = sum([el["output_size_bytes"] for el in cell_meta_by_size_est])

    for i, el in enumerate(cell_meta_by_size_est):
        print(f"Cell #{el['cell_num']}: {human_readable_size(el['output_size_bytes'])}")
        print("\n".join(el["first_lines"]))
        print("\n")
        bytes_remaining -= el["output_size_bytes"]

        if i != len(cell_meta_by_size_est) - 1:
            input(
                f"Remaining cell outputs account for {human_readable_size(bytes_remaining)} total. Hit enter to view info for next cell."
            )
        else:
            print("No more cells to view.")


if __name__ == "__main__":
    import sys

    try:
        nb_path = sys.argv[1]
        if not nb_path.endswith(".ipynb"):
            raise ValueError("Please provide a path to a Jupyter notebook file.")
    except IndexError:
        raise ValueError("Please provide a path to a Jupyter notebook file.")

    show_large_cells(nb_path)

找到jupyter笔记本中导致大量文件大小的输出单元格

问题描述投票：0回答：2

2个回答

最新问题

找到jupyter笔记本中导致大量文件大小的输出单元格

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2