当标头是动态的时,避免批量数据导出到csv

问题描述 投票:4回答:4

我偶然发现了一个非常简单的情况,我似乎无法找到解决方案。

我想做的很简单:将一些数据写入包含以下内容的.csv文件:

  • 动态标题
  • 一些数据

我现在这样做的方式似乎是我能想到的唯一解决方案:

  • 将我需要的数据存储在词典列表中
  • 获取上面列表中每个字典的keys()并将它们添加到set()(这将是标题)
  • 使用writer.writerows(data)将数据写入文件

基本上,简单的MCVE可能如下所示:

from csv import DictWriter

RESULT_FILE = 'test_result.csv'


def get_fieldnames(data):
    fieldnames = set()
    for item in data:
        fieldnames.update(item.keys())
    return fieldnames


def main(data):
    fieldnames = get_fieldnames(data)

    with open(RESULT_FILE, 'a', newline='', encoding='utf-8') as f:
        writer = DictWriter(f, fieldnames=fieldnames, delimiter=',')
        writer.writeheader()
        writer.writerows(data)


if __name__ == '__main__': 
    data_ = [
        {
            'a': '1',
            'b': '2',
            'c': '3',
        },
        {
            'a': '6',
            'd': '1',
            'b': '3',
        },
        {
            'c': '2',
            'e': '1',
            'f': '9',
        }
    ]
    main(data_)

现在,我不喜欢这个:

  • 该列表可能会变得非常大(~100k dicts /每个dict包含大约10个字段)。
  • 如果程序在将66666 dict添加到列表时崩溃,则一切都会丢失,并且我在csv中也没有任何数据。因为我必须等待将所有数据添加到列表中以获取所有可能的标头,所以我无法避免这种情况。

当标题是动态时,如何避免在csv中一次性导出所有数据?


根据要求,真实数据如下所示:

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 24.70',
 'Info': '',
 'Line art': '',
 'Name': '(5") Non-Vacuum Disc Pad Vinyl-Face',
 'Product number': '91456106T',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/6/1201076/1419675_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '',
 'Each': '$ 8.19',
 'Info': '<p><strong>material: </strong>Cork</p>',
 'Line art': '',
 'Name': 'Replacement Plate for MKT9924DB Belt Sander',
 'Product number': 'MKT4230358',
 'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/116/631/4/1166314/1281513_700.jpg',
 '\xa0': '$ 257.80'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '',
 'Each': '$ 8.19',
 'Info': '<p><strong>material: </strong>Graphite</p>',
 'Line art': '',
 'Name': 'Replacement Plate for MKT9924DB Belt Sander',
 'Product number': 'MKT4230366',
 'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
 'image_1': 'https://www.richelieu.com/documents/docsPr/MK/T4/23/03/66/MKT4230366/1281514_700.jpg',
 '\xa0': '$ 257.80'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 38.47',
 'Info': '',
 'Line art': '',
 'Name': 'Non-Grip Vacuum Pads',
 'Product number': '9154325',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: '
                             '</strong>Medium</p><p><strong>nap: '
                             '</strong>Short</p>',
 'image_1': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213330_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213331_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 52.92',
 'Info': '',
 'Line art': '',
 'Name': 'Non-Grip Vacuum Pads',
 'Product number': '9154327',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: '
                             '</strong>Medium</p><p><strong>nap: '
                             '</strong>Short</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/1/1051221/1213328_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/7/9154327/1213332_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Unique one-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 26.84',
 'Info': '',
 'Line art': '',
 'Name': 'Stick-on Non-Vacuum Pads',
 'Product number': '9156106',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: </strong>Medium</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/4/1051224/1213343_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/6/9156106/1213345_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Unique one-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 51.70',
 'Info': '',
 'Line art': '',
 'Name': 'Stick-on Non-Vacuum Pads',
 'Product number': '9156107',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: </strong>Medium</p>',
 'image_1': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213344_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213346_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K060',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/06/0/PC371K060/1263523_700.jpg',
 '\xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K080',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/08/0/PC371K080/1263524_700.jpg',
 '\xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K120',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/12/0/PC371K120/1263526_700.jpg',
 '\xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K100',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/10/0/PC371K100/1263525_700.jpg',
 '\xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 25.22',
 'Info': '',
 'Line art': '',
 'Name': '5" Non-Vacuum Disc Pad Hook-Face',
 'Product number': '91454325T',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/7/1201077/1419678_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Pads mount with screws.',
 'Each': '$ 31.80',
 'Info': '',
 'Line art': '',
 'Name': 'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x '
         '10.79 cm (3" x 4-1/4")',
 'Product number': '9156315',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: </strong>Medium</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/116/625/4/1166254/1280825_700.jpg',
 '\xa0': '$ 179.95'}
python csv stream python-3.6
4个回答
2
投票

临时保存数据

由于您的数据来自抓取,因此可能会将其视为流。为了模仿流,我使用data_.pop()to一次获取一个项目。以下解决方案添加了来自流的每个项目。 csv的标题和正文存储在不同的文件中。标题随着时间的推移可能会长度增加。在这样的增长步骤之前保存的行自然不能知道这种增长,因此可能缺少一些尾随逗号来表示缺少的项目。

import csv
import os

class StreamCSV:  # Python 3
    def __init__(self, header_file_name, body_file_name):
        self.header_file_name = header_file_name
        self.fbody = open(body_file_name, 'a', newline='', encoding='utf-8')
        self.csv_body = csv.writer(self.fbody)

    def add_item(self, item):
        if os.path.exists(self.header_file_name):
            with open(self.header_file_name, 'r', newline='', encoding='utf-8') as fobj:
                reader = csv.reader(fobj)
                try:
                    current_header = next(reader)
                except StopIteration:
                    current_header = []
        else:
            current_header = []
        header_set = set(current_header)
        for key in item:
            if key not in header_set:
                current_header.append(key)
        if len(header_set) < len(current_header):
            with open(self.header_file_name, 'w', newline='', encoding='utf-8') as fobj:
                writer = csv.writer(fobj)
                writer.writerow(current_header)
        item_data = [item.get(head, '') for head in current_header]
        self.csv_body.writerow(item_data)
        self.fbody.flush()  # allows peeing into the file


if __name__ == '__main__':

    data_ = [
        {
            'a': '1',
            'b': '2',
            'c': '3',
        },
        {
            'a': '6',
            'd': '1',
            'b': '3',
        },
        {
            'c': '2',
            'e': '1',
            'f': '9',
        }
    ]

    def show_saved(file_names):
        for name in file_names:
            with open(name) as fobj:
                print(name)
                print(fobj.read())

    header_file_name, body_file_name = 'header.csv', 'body.csv'
    stream_writer = StreamCSV(header_file_name, body_file_name)

    for x in range(1, 4):
        print('step:', x)
        stream_writer.add_item(data_.pop())
        show_saved([header_file_name, body_file_name])

显示随时间增长的输出:

step: 1
header.csv
c,e,f

body.csv
2,1,9

step: 2
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3

step: 3
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2

合并最终结果

您可能希望在附加步骤中合并标题和正文,添加此类缺少的尾随逗号。

def merge_header_body(header_file_name, body_file_name, out_file_name):
    with open(header_file_name, 'r', newline='', encoding='utf-8') as fobj:
        reader = csv.reader(fobj)
        header = next(reader)

    with open(out_file_name, 'w', newline='', encoding='utf-8') as fobj_out, \
    open(body_file_name, 'r', newline='', encoding='utf-8') as fobj_in:
        reader = csv.reader(fobj_in)
        writer = csv.writer(fobj_out)
        writer.writerow(header)
        target_length = len(header)
        for row in reader:
            diff = target_length - len(row)
            row.extend([''] * diff)
            writer.writerow(row)

out_file_name = 'merged.csv'
merge_header_body(header_file_name, body_file_name, out_file_name)

merged.csv的内容:

c,e,f,a,d,b
2,1,9,,,
,,,6,1,3
3,,,1,,2

崩溃恢复

如果程序在两者之间崩溃,它将恢复。让我们采用与以前相同的数据并添加更多行:

for x in range(1, 4):
    print('step:', x)
    stream_writer.add_item(data_.pop())
    show_saved([header_file_name, body_file_name])

输出:

step: 1
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,

step: 2
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
,,,6,1,3

step: 3
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
,,,6,1,3
3,,,1,,2

4
投票

Edit-1 26-Dec:更新了代码,根据您的数据生成数据

根据您的要求,我建议如下

  • 在headers.csv文件中写入标头
  • 在data.csv文件中写入数据
  • 如果要读取/发送此文件,只需将两个文件合并为一个文件即可
  • 在程序开始时,读取现有的headers.csv文件并创建字段到索引映射
  • 当您在数据中遇到新密钥时,使用新索引更新标头映射并更新header.csv
  • 在编写字典数据时,您将使用标题映射来创建行数据

下面是一个快速/脏的POC,它对我来说很好

import csv

try:
    f = open("headers.csv", mode="r+", encoding="utf-8")
except FileNotFoundError:
    f = open("headers.csv", mode="w+", encoding="utf-8")

f2 = open("data.csv", mode="a+", encoding="utf-8")
f.seek(0)
headers = f.readline().strip().split(",")
if headers == ['']:
    headers = []

headers_map = {}

for index, field in enumerate(headers):
    headers_map[field] = index


def update_header_dict(data):
    updated_headers = False
    for key in data.keys():
        if key not in headers_map:
            new_index = len(headers_map)
            headers_map[key] = new_index
            updated_headers = True

    if updated_headers:
        f.seek(0)
        csv.DictWriter(f, headers_map.keys()).writeheader()
        f.flush()


def get_row_data_dict(data):
    row_data = [""] * len(headers_map)

    for k, v in data.items():
        # if v and v[0] in ('=', '-'):
        #     # Mark the value as text, only needed if you want to display data in excel
        #     # else should be commented out
        #     v = "'" + v
        row_data[headers_map[k]] = v

    return row_data


def main(data):
    data_writer = csv.writer(f2)
    for row in data:
        update_header_dict(row)
        data_writer.writerow(get_row_data_dict(row))


data_ = [
    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 24.70',
     'Info': '',
     'Line art': '',
     'Name': '(5") Non-Vacuum Disc Pad Vinyl-Face',
     'Product number': '91456106T',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/6/1201076/1419675_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '',
     'Each': '$ 8.19',
     'Info': '<p><strong>material: </strong>Cork</p>',
     'Line art': '',
     'Name': 'Replacement Plate for MKT9924DB Belt Sander',
     'Product number': 'MKT4230358',
     'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
     'image_1': 'https://www.richelieu.com/documents/docsGr/116/631/4/1166314/1281513_700.jpg',
     '\xa0': '$ 257.80'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '',
     'Each': '$ 8.19',
     'Info': '<p><strong>material: </strong>Graphite</p>',
     'Line art': '',
     'Name': 'Replacement Plate for MKT9924DB Belt Sander',
     'Product number': 'MKT4230366',
     'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
     'image_1': 'https://www.richelieu.com/documents/docsPr/MK/T4/23/03/66/MKT4230366/1281514_700.jpg',
     '\xa0': '$ 257.80'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 38.47',
     'Info': '',
     'Line art': '',
     'Name': 'Non-Grip Vacuum Pads',
     'Product number': '9154325',
     'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                                 'in</p><p><strong>density: '
                                 '</strong>Medium</p><p><strong>nap: '
                                 '</strong>Short</p>',
     'image_1': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213330_700.jpg',
     'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213331_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 52.92',
     'Info': '',
     'Line art': '',
     'Name': 'Non-Grip Vacuum Pads',
     'Product number': '9154327',
     'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                                 'in</p><p><strong>density: '
                                 '</strong>Medium</p><p><strong>nap: '
                                 '</strong>Short</p>',
     'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/1/1051221/1213328_700.jpg',
     'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/7/9154327/1213332_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '- Unique one-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 26.84',
     'Info': '',
     'Line art': '',
     'Name': 'Stick-on Non-Vacuum Pads',
     'Product number': '9156106',
     'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                                 'in</p><p><strong>density: </strong>Medium</p>',
     'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/4/1051224/1213343_700.jpg',
     'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/6/9156106/1213345_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '- Unique one-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 51.70',
     'Info': '',
     'Line art': '',
     'Name': 'Stick-on Non-Vacuum Pads',
     'Product number': '9156107',
     'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                                 'in</p><p><strong>density: </strong>Medium</p>',
     'image_1': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213344_700.jpg',
     'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213346_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Size: 2-1/2" x 14".',
     'Each': '$ 12.36',
     'Info': '',
     'Line art': '',
     'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
     'Product number': 'PC371K060',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/06/0/PC371K060/1263523_700.jpg',
     '\xa0': '$ 148.18'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Size: 2-1/2" x 14".',
     'Each': '$ 12.36',
     'Info': '',
     'Line art': '',
     'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
     'Product number': 'PC371K080',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/08/0/PC371K080/1263524_700.jpg',
     '\xa0': '$ 148.18'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Size: 2-1/2" x 14".',
     'Each': '$ 12.36',
     'Info': '',
     'Line art': '',
     'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
     'Product number': 'PC371K120',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/12/0/PC371K120/1263526_700.jpg',
     '\xa0': '$ 148.18'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Size: 2-1/2" x 14".',
     'Each': '$ 12.36',
     'Info': '',
     'Line art': '',
     'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
     'Product number': 'PC371K100',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/10/0/PC371K100/1263525_700.jpg',
     '\xa0': '$ 148.18'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 25.22',
     'Info': '',
     'Line art': '',
     'Name': '5" Non-Vacuum Disc Pad Hook-Face',
     'Product number': '91454325T',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/7/1201077/1419678_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '- Pads mount with screws.',
     'Each': '$ 31.80',
     'Info': '',
     'Line art': '',
     'Name': 'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x '
             '10.79 cm (3" x 4-1/4")',
     'Product number': '9156315',
     'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                                 'in</p><p><strong>density: </strong>Medium</p>',
     'image_1': 'https://www.richelieu.com/documents/docsGr/116/625/4/1166254/1280825_700.jpg',
     '\xa0': '$ 179.95'}
]

data2_ = [
    {
        'a': '2',
        'f': '1',
        'z': '9',
    },
]

main(data_)
# main(data2_)

f.close()
f2.close()

上面运行生成两个文件,然后我在终端上运行

cat headers.csv data.csv > output.csv

然后在excel中打开output.csv

Excel Data

您可能看到的唯一问题是#NAME?,但这些是因为excel正在尝试处理您在文本开头时使用的-。如果要处理这样的文本,则需要取消注释代码的下面部分

    # if v and v[0] in ('=', '-'):
    #     # Mark the value as text, only needed if you want to display data in excel
    #     # else should be commented out
    #     v = "'" + v

2
投票

As it was said,我会使用一个带有冲洗机制系统的“写入文件”。如果你不介意使用pandas,最简单的方法恕我直言,将改变你的main功能如下:

def main(data):
    df = pd.DataFrame()
    for item in data:
        df_current = pd.DataFrame.from_dict(item, orient='index').T
        df = df.append(df_current)
        df.to_csv(RESULT_FILE, index=False)

这样你就可以用新更新的RESULT_FILE更新你的DataFrame而无需知道完整的标题。

为了进一步提高性能,您可以添加条件来为每个n数据集写入文件:

def main(data):
    df = pd.DataFrame()
    chunksize = 5
    for i, item in enumerate(data):
        df_c = pd.DataFrame.from_dict(item, orient='index').T
        df = df.append(df_c)
        if ((i%chunksize)==0 or i==(len(data)-1)):
            df.to_csv(RESULT_FILE, index=False)

至于memory issues,我建议你使用iterators over lists作为最初报废的data传递给这个函数,以减少内存消耗。


1
投票

也许它有点超过顶部,但对我来说是解决问题的最简单方法。它利用sqlite并能够随时向表中添加列。另外,我没有详尽地测试它。

#!/bin/env python
from os import path

import sqlite3
import atexit

how_many = 0


class DB(object):
    db_file = "data.db"

    def __init__(self):
        self._fieldnames = set(["ignore_field"])
        self._cursor = None
        self._db_conn = None
        create = False

        if not path.isfile(self.db_file):
            create = True
        self._db_conn = sqlite3.connect(self.db_file)
        self._cursor = self._db_conn.cursor()

        if create:
            self._cursor.execute("""CREATE TABLE data (ignore_field integer)""")
        else:
            # retrieve already existing fieldnames so we can continue
            pragma = self._db_conn.execute("pragma table_info('data')").fetchall()
            self._fieldnames = set([x[1] for x in pragma])

    def _add_fields(self, field_list):
        for field in field_list:
            if field not in self._fieldnames:
                self._cursor.execute("alter table data add column '%s' 'TEXT'" % field)
                self._fieldnames.add(field)

    def _insert_data(self, data):
        fields = []
        values = []
        for f, v in data.iteritems():
            fields.append(f)
            values.append("'{}'".format(v))
        sql = """insert into data ({}) values ({})""".format(", ".join(fields), ", ".join(values))
        self._db_conn.execute(sql)

    def consume(self, one_dict):
        self._add_fields(one_dict.keys())
        self._insert_data(one_dict)
        self._db_conn.commit()

    def csv_out(self):
        self._cursor.execute("select * from data")
        header = [x[0] for x in self._cursor.description]
        print(",".join(header))
        for row in self._cursor:
            out = []
            for field in row:
                out.append(field if field else "")
            print(",".join(out))


def cleanup(total):
    print("Ended after record {}/{}".format(how_many, total))


def main(data):
    global how_many
    atexit.register(cleanup, len(data))

    db = DB()

    skip = False
    if how_many:
        skip = how_many

    for each in data:
        if not skip:
            db.consume(each)
        else:
            skip -= 1
            if not skip:
                print("Finished skipping {} records.".format(how_many))

        how_many += 1

    print("Completed loading available data.")

    db.csv_out()


if __name__ == "__main__":
    data_ = [
        {
            'a': '1',
            'b': '2',
            'c': '3',
        },
        {
            'a': '6',
            'd': '1',
            'b': '3',
        },
        {
            'c': '2',
            'e': '1',
            'f': '9',
        }
    ]

    main(data_)

如果你修改how_many,那么主循环会跳过那么多记录。这可以让你从崩溃中恢复,因为atexit钩子应该告诉你程序有多远。

还有一个虚假的列/字段名称,因为你不能创建一个空表,我变得懒惰,并没有将表创建绑定到DB.consume()的第一次迭代。您始终可以将“ignore_field”替换为现有字段之一。

更懒惰,我没有做文件IO我只打印出CSV。

© www.soinside.com 2019 - 2024. All rights reserved.