我需要将 CSV 读入元组列表,同时根据值 (>=0.75) 调整列表并将列更改为不同的类型。 请注意你不能!!使用熊猫,而不是熊猫
我正在尝试找出如何以最快的方法做到这一点。
我就是这样做的(我认为效率不高):
def load_csv_to_list(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
count = 0
for row in table[1:]:
if float(row[2]) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date,int(row[1]),float(row[2]))
lst.append(row)
return (lst)
start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)
答案:0.00013872199997422285
原始代码执行相同的
float(row[2])
转换两次。在我的测试中,将转换后的值分配给变量并稍后重用它会带来轻微的性能提升。利用 Python 3.8 中引入的海象运算符 :=
可以进一步改进。使用批处理或内存映射数据文件可提供最佳性能。
def load_variable(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
for row in table[1:]:
float_two = float(row[2])
if float_two >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date, int(row[1]), float_two)
lst.append(row)
return lst
def load_walrus(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
for row in table[1:]:
if (float_two := float(row[2])) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date, int(row[1]), float_two)
lst.append(row)
return lst
加载 1,000,000 行的 csv 文件的时间:
Function Name | Fastest | Slowest | Average |
load_csv_to_list | 6.36s | 6.69s | 6.47s |
load_variable | 6.10s | 6.65s | 6.44s |
load_walrus | 5.95s | 6.57s | 6.29s |
作为进一步的实验,我实现了一个批量处理数据的功能。
def batch_walrus(path, batch_size=1000):
lst = []
with open(path) as csv_file:
csv_reader = reader(csv_file)
header = next(csv_reader) # Read the header
lst.append(header) # Add the header to the result list
batch = []
for row in csv_reader:
# Check the condition and convert the date
if (two := float(row[2])) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
batch.append((date, int(row[1]), two))
# If batch size is reached or end of file, process the batch
if len(batch) == batch_size or not row:
lst.extend(batch)
batch = []
return lst
更新时间信息:
Function Name | Fastest | Slowest | Average |
load_csv_to_list | 6.36s | 6.69s | 6.47s |
load_variable | 6.10s | 6.65s | 6.44s |
load_walrus | 5.95s | 6.57s | 6.29s |
batch_walrus | 5.69s | 5.89s | 5.79s |
Python 的
mmap
模块提供内存映射文件 I/O。它利用较低级别的操作系统功能来读取文件,就好像它们是一个大字符串/数组一样。此版本的函数在创建 mmapped_file
之前使用 decode("utf-8")
将 csv.reader
内容解码为字符串。
from csv import reader
from datetime import datetime
import mmap
def load_mmap_walrus(path):
lst = []
with open(path, "r") as csv_file:
# Memory-map the file, size 0 means the entire file
with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
# Decode the bytes-like object to a string
content = mmapped_file.read().decode("utf-8")
# Create a CSV reader from the decoded string
csv_reader = reader(content.splitlines())
header = next(csv_reader) # Read the header
lst.append(header) # Add the header to the result list
for row in csv_reader:
# Check the condition and convert the date
if (two := float(row[2])) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
lst.append((date, int(row[1]), two))
# Close the memory-mapped file
mmapped_file.close()
return lst
更新时间信息:
Function Name | Fastest | Slowest | Average |
load_csv_to_list | 6.36s | 6.69s | 6.47s |
load_variable | 6.10s | 6.65s | 6.44s |
load_walrus | 5.95s | 6.57s | 6.29s |
batch_walrus | 5.69s | 5.89s | 5.79s |
load_mmap_walrus | 5.49s | 5.68s | 5.57s |
用于生成 1,000,000 行 csv 数据的代码:
import csv
import random
from datetime import datetime, timedelta
# Function to generate a random date within a range
def random_date(start_date, end_date):
delta = end_date - start_date
random_days = random.randint(0, delta.days)
return start_date + timedelta(days=random_days)
# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)
with open("sample_data.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Date", "Integer", "Float"])
for _ in range(1_000_000):
date = random_date(start_date, end_date).strftime("%d/%m/%Y")
integer = random.randint(0, 100)
float_num = round(random.uniform(0, 1), 2)
writer.writerow([date, integer, float_num])