如何使用argparse将二进制文件作为stdin传递到Docker容器化的Python脚本？

Question

我重新实现了他的解决方案以简化问题。让我们将Docker和Django排除在外。目标是通过以下两种方法使用Pandas读取excel：

python example.py - < /path/to/file.xlsx
cat /path/to/file.xlsx | python example.py -

其中example.py复制如下：

import argparse
import contextlib
from typing import IO
import sys
import pandas as pd


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield sys.stdin.buffer
    else:
        with open(filename, 'rb') as f:
            yield f


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('FILE')
    args = parser.parse_args()

    with file_ctx(args.FILE) as input_file:
        print(input_file.read())
        df = pd.read_excel(input_file)
        print(df)


if __name__ == "__main__":
    main()

问题在于，Pandas（请参阅下面的追溯）不接受2。但是在1时可以正常工作。

而仅打印Excel文件的文本表示在1和2中均有效。

如果您想轻松地重现Docker环境：

首先构建名为pandas的Docker映像：

docker build --pull -t pandas - <<EOF
FROM python:latest
RUN pip install pandas xlrd
EOF

然后使用pandas Docker镜像运行：docker run --rm -i -v /path/to/example.py:/example.py pandas python example.py - < /path/to/file.xlsx

注意它如何能够正确打印出excel文件的纯文本表示形式，但熊猫无法读取它。

更简洁的回溯，类似于下面：

Traceback (most recent call last):
  File "example.py", line 29, in <module>
    main()
  File "example.py", line 24, in main
    df = pd.read_excel(input_file)
  File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 819, in __init__
    self._reader = self._engines[engine](self._io)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
    super().__init__(filepath_or_buffer)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 356, in __init__
    filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.

以显示在装入excel文件时代码的工作原理（即未由stdin传递）：

docker run --rm -i -v /path/to/example.py:/example.py -v /path/to/file.xlsx:/file.xlsx pandas python example.py file.xlsx

原始问题描述（用于其他上下文）

[在主机系统上，您有一个文件位于/tmp/test.txt，并且您想在其中使用head，但是在Docker容器内（echo 'Hello World!' > /tmp/test.txt重现我拥有的示例数据）：

您可以运行：

docker run -i busybox head -1 - < /tmp/test.txt将第一行打印到屏幕上：

OR

cat /tmp/test.txt | docker run -i busybox head -1 -

并且输出是：

Hello World!

即使使用.xlsx这样的二进制格式而不是纯文本格式，也可以完成上述操作，并且您将得到一些类似于以下内容的奇怪输出：

�Oxl/_rels/workbook.xml.rels���j�0
                                  ��}

以上要点是，即使通过抽象Docker，head也可以使用二进制和文本格式。

但是在我自己的基于argparse的CLI（Actually custom Django management command，我相信使用了argparse）中，尝试在Docker上下文中使用熊猫的read_excel时出现以下错误。

打印的错误如下：

Traceback (most recent call last):
  File "./manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/home/jovyan/sequence_databaseApp/management/commands/seq_db.py", line 54, in handle
    df_snapshot = pd.read_excel(options['FILE'].buffer, sheet_name='Snapshot', header=0, dtype=dtype)
  File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 819, in __init__
    self._reader = self._engines[engine](self._io)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
    super().__init__(filepath_or_buffer)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 356, in __init__
    filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.

具体地，

docker run -i <IMAGE> ./manage.py my_cli import - < /path/to/file.xlsx 无效，

但是./manage.py my_cli import - < /path/to/file.xlsx 起作用！

以某种方式在Docker上下文中有所不同。

但是我也注意到，甚至将Docker排除在外：

cat /path/to/file.xlsx | ./manage.py my_cli import - 无效

尽管：

./manage.py my_cli import - < /path/to/file.xlsx 起作用（如前所述）

最后，我正在使用的代码（您应该能够在管理/命令下将其另存为my_cli.py，以使其在Django项目中正常工作：]

import argparse


import sys


from django.core.management.base import BaseCommand


class Command(BaseCommand):
    help = 'my_cli help'

    def add_arguments(self, parser):
        subparsers = parser.add_subparsers(
            title='commands', dest='command', help='command help')
        subparsers.required = True
        parser_import = subparsers.add_parser('import', help='import help')
        parser_import.add_argument('FILE', type=argparse.FileType('r'), default=sys.stdin)

    def handle(self, *args, **options):
        import pandas as pd
        df = pd.read_excel(options['FILE'].buffer, header=0)
        print(df)

Answer 1

非常基于Anthony Sottile's Answer，但稍加修改即可完全解决问题：

import argparse
import contextlib
import io
from typing import IO
import sys

import pandas as pd


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield io.BytesIO(sys.stdin.buffer.read())
    else:
        with open(filename, 'rb') as f:
            yield f


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('FILE')
    args = parser.parse_args()

    with file_ctx(args.FILE) as input_file:
        print(input_file.read())
        df = pd.read_excel(input_file)
        print(df)


if __name__ == "__main__":
    main()

[从this answer读到Pandas 0.25.0 and xlsx from response content stream后我有了主意

根据原始问题基于Django的上下文的外观：

import contextlib
import io
import sys
from typing import IO

import pandas as pd

from django.core.management.base import BaseCommand


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield io.BytesIO(sys.stdin.buffer.read())
    else:
        with open(filename, 'rb') as f:
            yield f


class Command(BaseCommand):
    help = 'my_cli help'

    def add_arguments(self, parser):
        subparsers = parser.add_subparsers(
            title='commands', dest='command', help='command help')
        subparsers.required = True
        parser_import = subparsers.add_parser('import', help='import help')
        parser_import.add_argument('FILE')

    def handle(self, *args, **options):
        with file_ctx(options['FILE']) as input_file:
            df = pd.read_excel(input_file)
            print(df)

Answer 2

好像您正在以文本模式（FileType('r') / sys.stdin）读取文件]

根据this bpo issue argparse不支持直接打开二进制文件

我建议您使用与此类似的代码自己处理文件类型（我不熟悉django / pandas的方式，因此我将其简化为纯Python）

import argparse
import contextlib
import io
from typing import IO


@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
    if filename == '-':
        yield io.BytesIO(sys.stdin.buffer.read())
    else:
        with open(filename, 'rb') as f:
            yield f


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument('FILE')
    args = parser.parse_args()

    with file_ctx(args.FILE) as input_file:
        # do whatever you need with that input file

如何使用argparse将二进制文件作为stdin传递到Docker容器化的Python脚本？

问题描述投票：0回答：2

2个回答

最新问题

如何使用argparse将二进制文件作为stdin传递到Docker容器化的Python脚本？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2