如何使用 PDF Plumber 在 python 中继续从 PDF 表中提取数据?

问题描述 投票:0回答:1

我目前正在使用 python 从 PDF 中的表格中提取数据,特别是其单圈时间数据,该数据以 PDF 形式提供,如下所示: f1 laptimedata 我使用 PDF Plumber 来提取表数据,然后使用 Python 来处理数据,以便创建一系列列出每个车手圈数及其圈速的字典列表,以便我可以对这些信息进行进一步的处理。

目前,我的代码如下所示:

import pdfplumber
import re

# Predefined list of drivers
drivers_list = ["Max VERSTAPPEN", "Daniel RICCIARDO", "Nicholas LATIFI", "Lewis HAMILTON", "Lando NORRIS", "Sebastian VETTEL", "Nicholas LATIFI", "Pierre GASLY", "Sergio PEREZ", "Fernando ALONSO", "Charles LECLERC", "George RUSSELL", "Alexander ALBON", "Lance STROLL", "Kevin MAGNUSSEN", "Yuki TSUNODA", "ZHOU Guanyu", "Esteban OCON"]

# Initialize a dict for lap times
driver_lap_times = {driver: [] for driver in drivers_list}

# Define a pattern to detect the start of a lap time section
pattern = re.compile(r'(\d+)\s+(\d{1,2}:\d{2}:\d{2}|\d{1,2}:\d{2}\.\d{3})')

# Extract text from the PDF
pdf_path = "Lap Analysis - SIN.pdf"
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        
        # Process text to split by driver
        lines = text.split('\n')
        current_driver = None
        current_lap_times = []
        in_lap_time_section = False

        for line in lines:
            line = line.strip()
            
            # Check if the line contains a driver's name
            driver_found = False
            for driver in drivers_list:
                if driver in line:
                    if current_driver:
                        # Save lap times for the previous driver
                        driver_lap_times[current_driver].extend(current_lap_times)

                    current_driver = driver
                    current_lap_times = []
                    driver_found = True
                    in_lap_time_section = False
                    break
            
            if driver_found:
                continue

            # Determine if the line is part of a lap time section
            if "LAP TIME" in line:
                in_lap_time_section = True
                continue 

            # Collect lap times if in a lap time section and a driver is currently identified
            if in_lap_time_section and current_driver:
                match = pattern.match(line)
                if match:
                    lap_number, lap_time = match.groups()
                    if lap_number.isdigit() or 'P' in lap_number:  # Handle 'P' (pit) laps
                        current_lap_times.append({lap_number: lap_time})

        # Save lap times for the last driver in the current page
        if current_driver:
            driver_lap_times[current_driver].extend(current_lap_times)
#Print by driver
for driver, laps in driver_lap_times.items():
    formatted_laps = ', '.join(f"{{'{lap_number}': '{lap_time}'}}" for lap in laps for lap_number, lap_time in lap.items())
    print(f"{driver}: [{formatted_laps}]")

并且以一种稍微碰运气的方式产生了一个几乎可以工作的输出,它没有找到每个车手,但是它确实找到了它似乎获得了正确的信息,但对于每个车手来说它停在第 30 圈,看起来像这样:

Max VERSTAPPEN: [{'1': '21:11:14'}, {'2': '2:04.389'}, {'3': '2:03.369'}, {'4': '2:03.238'}, {'5': '2:02.703'}, {'6': '2:03.027'}, {'7': '2:03.289'}, {'8': '2:23.240'}, {'9': '2:42.690'}, {'10': '2:38.596'}, {'11': '2:01.612'}, {'12': '2:00.967'}, {'13': '2:01.842'}, {'14': '2:01.558'}, {'15': '2:01.407'}, {'16': '2:01.138'}, {'17': '2:00.909'}, {'18': '2:00.807'}, {'19': '2:00.520'}, {'20': '2:00.559'}, {'21': '2:09.641'}, {'22': '2:35.288'}, {'23': '1:58.377'}, {'24': '1:58.784'}, {'25': '1:59.689'}, {'26': '2:22.695'}, {'27': '2:00.464'}, {'28': '2:13.500'}, {'29': '2:40.814'}, {'30': '2:08.953'}]
Daniel RICCIARDO: []
Nicholas LATIFI: [{'1': '21:11:11'}, {'2': '2:05.790'}, {'3': '2:04.098'}, {'4': '2:03.184'}, {'5': '2:03.366'}, {'6': '2:03.052'}, {'7': '2:03.297'}, {'8': '2:21.496'}, {'9': '2:43.215'}, {'10': '2:39.865'}, {'11': '2:04.900'}, {'12': '2:02.910'}, {'13': '2:02.938'}, {'14': '2:02.701'}, {'15': '2:02.067'}, {'16': '2:01.664'}, {'17': '2:01.690'}, {'18': '2:01.285'}, {'19': '2:01.333'}, {'20': '2:01.365'}, {'21': '2:14.128'}, {'22': '2:30.735'}, {'23': '1:59.802'}, {'24': '2:00.070'}, {'25': '1:59.714'}, {'26': '2:21.882'}, {'27': '1:59.805'}, {'28': '2:17.203'}, {'29': '2:37.487'}, {'30': '2:07.022'}]
Lando NORRIS: []

暂时搁置它没有找到所有车手并附加他们的信息的问题,我如何让它为它找到的车手超越第 30 圈?我错过了一些非常明显的东西吗?另外,如果有人确实对为什么只在每页数据上找到第一个驱动程序有一些建议,我将非常感谢您的建议!

我热衷于继续使用 PDF Plumber,因为它与 python 3.12 兼容,并且对于我成功提取的数据,它保持了较高的准确性。

python python-3.x pdfplumber
1个回答
0
投票

我改进了你的代码,现在它可以找到所有车手以及所有单圈时间。

import pdfplumber
import re
from itertools import groupby, islice

# Pattern to match timing formats (HH:MM:SS or MM:SS.sss)
timing_pattern = r'(\d+)\s+(\d\d?:\d\d:\d\d|\d\d?:\d\d\.\d{3})'

# Predefined list of drivers
drivers_list = ["Max VERSTAPPEN", "Daniel RICCIARDO", "Nicholas LATIFI", "Lewis HAMILTON", "Lando NORRIS", "Sebastian VETTEL", "Nicholas LATIFI", "Pierre GASLY", "Sergio PEREZ", "Fernando ALONSO", "Charles LECLERC", "George RUSSELL", "Alexander ALBON", "Lance STROLL", "Kevin MAGNUSSEN", "Yuki TSUNODA", "ZHOU Guanyu", "Esteban OCON", "Mick SCHUMACHER", "Carlos SAINZ", "George RUSSELL", "Valtteri BOTTAS"]

# Initialize a dict for lap times
driver_lap_times = {driver: [] for driver in drivers_list}

# Extract text from the PDF
pdf_path = "Lap Analysis - SIN.pdf"
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        lines = text.split('\n')

        current_drivers = []
        for line in lines:
            line = line.strip()

            if any(driver_name in line for driver_name in drivers_list):
                current_drivers = ["".join(g).strip() for _, g in groupby(line, key=str.isdigit)][1::2]
                continue
            elif ":" in line:
                iterator = iter(re.findall(timing_pattern, line.replace("P", "")))
                cnt=0
                while chunk := list(islice(iterator, 2)):
                    for time in chunk:
                        driver_lap_times[current_drivers[cnt]].insert(int(time[0])-1, time[1])
                    cnt+=1

结果:

"Max VERSTAPPEN": ["21:11:14", "2:04.389", "2:03.369", "2:03.238", "2:02.703", "2:03.027", "2:03.289", "2:23.240", "2:42.690", "2:38.596", "2:01.612", "2:00.967", "2:01.842", "2:01.558", "2:01.407", "2:01.138", "2:00.909", "2:00.807", "2:00.520", "2:00.559", "2:09.641", "2:35.288", "1:58.377", "1:58.784", "1:59.689", "2:22.695", "2:00.464", "2:13.500", "2:40.814", "2:01.087", "2:08.953", "1:59.035", "1:58.925", "1:59.987", "1:59.730", "2:05.082", "2:55.179", "2:37.134", "2:47.754", "2:28.080", "2:13.541", "2:18.778", "1:53.170", "1:51.370", "1:50.616", "1:50.049", "1:51.824", "1:51.455", "1:50.508", "1:49.944", "1:50.250", "1:49.846", "1:49.142", "1:49.979", "1:50.890", "1:50.878", "1:50.652", "1:50.934", "1:50.597", "1:51.068"]
"Daniel RICCIARDO": ["21:11:15", "2:08.435", "2:05.767", "2:04.551", "2:04.656", "2:03.973", "2:03.816", "2:22.957", "2:38.839", "2:36.441", "2:03.032", "2:02.862", "2:03.083", "2:03.193", "2:02.900", "2:02.186", "2:02.632", "2:02.034", "2:01.884", "2:01.471", "2:16.073", "2:29.699", "2:01.076", "2:00.897", "2:00.839", "2:23.917", "2:00.418", "2:20.556", "2:37.841", "2:08.887", "1:59.291", "1:59.098", "1:59.323", "1:58.690", "1:59.070", "2:33.600", "2:56.617", "2:38.596", "2:24.180", "1:58.500", "1:56.357", "1:55.744", "1:55.095", "1:54.728", "1:53.419", "1:53.508", "1:51.975", "1:52.198", "1:52.162", "1:51.589", "1:51.688", "1:51.621", "1:51.076", "1:51.061", "1:51.197", "1:51.176", "1:51.006", "1:52.526", "1:52.080"]
"Nicholas LATIFI": ["21:11:20", "2:09.155", "2:07.657", "2:03.402", "2:03.659", "2:03.653", "2:03.128", "2:19.609", "2:43.943", "2:41.360", "2:04.843", "2:03.028", "2:02.941", "2:02.308", "2:01.994", "2:01.484", "2:01.533", "2:01.284", "2:01.020", "2:00.747", "2:12.248", "2:31.864", "2:00.568", "2:00.254", "1:59.829", "2:22.811", "2:00.028", "2:16.203", "2:38.033", "2:06.887", "2:05.585", "2:06.200", "3:14.303", "2:01.565", "2:35.635", "2:38.164", "2:37.571", "2:24.004", "2:00.093", "1:56.624", "1:55.031", "1:54.342", "1:53.713", "1:55.876", "1:53.497", "1:52.934", "1:52.678", "1:54.531", "1:52.307", "1:54.699", "1:51.753", "1:51.349", "1:50.728", "1:50.836", "1:50.569", "1:51.662", "1:50.894", "1:52.655"]
"Lewis HAMILTON": ["21:11:06", "2:02.920", "2:01.712", "2:01.585", "2:01.454", "2:00.981", "2:00.460", "2:13.316", "2:59.421", "2:45.752", "2:02.192", "2:00.260", "2:00.603", "2:00.810", "2:00.576", "2:00.436", "2:00.320", "2:00.161", "1:59.706", "1:59.735", "2:06.741", "2:36.072", "2:01.904", "1:59.451", "1:59.195", "2:17.933", "2:04.014", "2:09.534", "2:38.755", "2:12.223", "2:13.306", "1:57.601", "1:57.973", "2:14.994", "2:07.369", "2:38.829", "2:36.164", "2:36.898", "2:37.773", "2:23.575", "2:00.250", "1:56.067", "1:54.858", "1:55.027", "1:54.632", "1:55.941", "1:53.348", "1:52.403", "1:52.736", "1:51.903", "1:51.363", "1:51.935", "1:50.994", "1:51.249", "1:50.798", "1:50.794", "1:50.750", "1:54.064", "1:50.622", "1:51.101"]
"Lando NORRIS": ["21:11:08", "2:04.065", "2:02.717", "2:02.313", "2:02.322", "2:02.022", "2:02.009", "2:18.617", "2:49.938", "2:43.014", "2:03.436", "2:01.307", "2:01.110", "2:01.304", "2:01.074", "2:01.026", "2:00.715", "2:00.524", "2:00.932", "2:00.536", "2:08.476", "2:36.906", "1:59.786", "1:59.302", "2:00.145", "2:22.891", "1:59.951", "2:13.595", "2:41.231", "1:58.834", "1:58.753", "1:59.385", "1:58.369", "1:58.746", "2:26.728", "3:00.514", "3:00.513", "2:29.021", "1:59.591", "1:54.999", "1:53.336", "1:53.478", "1:52.396", "1:51.109", "1:51.071", "1:50.560", "1:51.165", "1:50.139", "1:49.684", "1:49.993", "1:50.472", "1:50.253", "1:49.929", "1:50.427", "1:49.212", "1:49.749", "1:50.014", "1:50.751"]
"Sebastian VETTEL": ["21:11:11", "2:05.790", "2:04.098", "2:03.184", "2:03.366", "2:03.052", "2:03.297", "2:21.496", "2:43.215", "2:39.865", "2:04.900", "2:02.910", "2:02.938", "2:02.701", "2:02.067", "2:01.664", "2:01.690", "2:01.285", "2:01.333", "2:01.365", "2:14.128", "2:30.735", "1:59.802", "2:00.070", "1:59.714", "2:21.882", "1:59.805", "2:17.203", "2:37.487", "2:06.509", "2:07.022", "1:58.680", "1:58.943", "1:58.895", "2:06.308", "2:28.180", "2:35.422", "2:37.784", "2:38.552", "2:23.272", "1:59.228", "1:57.462", "1:55.313", "1:54.864", "1:54.654", "1:55.825", "1:53.467", "1:52.259", "1:52.233", "1:51.999", "1:51.319", "1:51.799", "1:51.396", "1:51.311", "1:51.040", "1:51.022", "1:50.759", "1:51.449", "1:50.669", "1:52.728"]
"Pierre GASLY": ["21:11:10", "2:05.132", "2:03.844", "1:59.033", "1:58.769", "2:07.555", "2:29.129"]
"Sergio PEREZ": ["21:11:01", "2:01.358", "2:00.875", "2:00.310", "2:00.267", "2:00.094", "2:00.714", "2:15.730", "3:02.306", "2:49.136", "1:59.580", "1:59.473", "1:59.434", "1:59.429", "1:59.018", "1:59.358", "1:59.238", "1:58.905", "1:58.717", "1:58.519", "1:58.780", "2:39.777", "2:03.986", "1:58.332", "1:58.161", "2:14.801", "2:06.003", "2:02.874", "2:39.800", "2:15.377", "2:17.858", "1:57.451", "1:57.603", "1:56.945", "1:56.267", "2:06.352", "2:43.256", "3:06.473", "3:02.549", "2:35.490", "1:56.340", "1:53.693", "1:52.701", "1:51.903", "1:51.355", "1:50.538", "1:50.363", "1:50.501", "1:49.500", "1:49.189", "1:49.285", "1:49.565", "1:48.841", "1:48.578", "1:48.576", "1:48.645", "1:48.251", "1:48.165", "1:49.009", "1:49.652"]
"Fernando ALONSO": ["21:11:09", "2:04.847", "2:03.306", "2:02.935", "2:02.858", "2:02.447", "2:02.399", "2:17.774", "2:48.597", "2:42.370", "2:00.177", "1:59.328", "1:59.653", "1:59.603", "1:59.357", "1:59.359", "1:58.983", "1:59.115", "1:59.176", "1:59.107", "1:59.446", "2:39.789", "2:03.112", "1:58.443", "1:58.576", "2:15.018", "2:06.478", "2:04.950", "2:40.754", "2:02.788", "2:01.994", "2:01.844", "2:01.474", "2:01.425", "2:00.911", "2:00.964", "2:00.709", "2:00.463", "2:00.640", "1:53.302", "1:52.533", "1:51.724", "1:51.582", "1:50.328", "1:50.798", "1:50.151", "1:50.469", "1:49.177", "1:49.336", "1:50.099", "1:49.557", "1:48.839", "1:48.753", "1:49.016", "1:49.012", "1:49.069", "1:51.181", "1:49.913"]
"Charles LECLERC": ["21:11:02", "2:01.214", "2:00.939", "2:00.734", "2:00.219", "2:00.229", "2:00.138", "2:16.598", "3:02.958", "2:47.301", "1:56.259", "1:57.226", "1:56.811", "2:05.037", "2:27.541", "2:17.673", "3:01.111", "3:03.916", "2:33.178", "1:56.709"]
"George RUSSELL": ["21:11:20", "2:08.743", "2:07.696", "2:04.604", "2:02.566", "2:03.817", "2:10.503", "2:23.222", "2:38.861", "2:28.349", "2:03.247", "2:02.443", "2:02.647", "2:03.310", "2:02.758", "2:02.750", "2:02.744", "2:02.454", "2:02.174", "2:02.244", "2:27.932", "2:45.912", "2:09.869", "2:05.338", "2:13.650", "2:20.493", "2:05.212", "2:37.252", "2:36.017", "1:59.285", "2:29.586", "2:02.097", "1:59.791", "1:57.392", "1:56.177", "1:54.795", "2:14.468", "2:52.303", "2:02.336", "2:11.341", "2:19.959", "1:59.931", "3:12.235", "2:25.466", "1:55.808", "2:04.133", "1:55.047", "1:53.640", "1:50.505", "1:51.603", "1:58.702", "1:49.567", "2:01.663", "2:16.484", "1:50.349", "1:46.458", "1:51.674", "1:55.950", "1:51.903"]
"Alexander ALBON": ["21:11:21", "2:09.436", "2:07.765", "2:06.625", "2:05.513", "2:05.889", "2:04.953", "2:27.674", "2:36.241", "2:27.761", "2:03.609", "2:03.271", "2:02.665", "2:02.652", "2:03.222", "2:03.057", "2:03.125", "2:03.446", "2:03.364", "2:02.806", "2:03.093", "2:20.921", "2:29.359", "2:03.038", "2:02.121", "2:50.661"]
"Lance STROLL": ["21:11:13", "2:08.661", "2:05.367", "2:04.710", "2:04.403", "2:03.964", "2:03.323", "2:24.030", "2:37.465", "2:37.318", "2:03.997", "2:03.049", "2:03.193", "2:02.999", "2:02.272", "2:02.335", "2:02.808", "2:02.165", "2:01.428", "2:01.250", "2:16.746", "2:29.236", "1:59.592", "1:59.199", "1:59.695", "2:22.271", "1:59.002", "2:19.004", "2:37.291", "1:58.710", "2:04.724", "1:59.260", "1:58.733", "1:58.974", "1:59.151", "2:06.733", "2:55.690", "2:37.344", "2:38.824", "2:22.974", "1:59.611", "1:57.095", "1:56.169", "1:55.502", "1:54.765", "1:55.740", "1:52.756", "1:52.270", "1:51.564", "1:51.854", "1:51.786", "1:52.587", "1:51.511", "1:50.823", "1:51.337", "1:51.074", "1:50.708", "1:50.420", "1:50.283", "1:51.958"]
"Kevin MAGNUSSEN": ["21:11:14", "2:08.617", "2:05.679", "2:04.881", "2:04.360", "2:04.122", "2:11.492", "3:06.525", "2:35.888", "2:04.183", "2:01.895", "2:02.043", "2:02.299", "2:02.769", "2:02.496", "2:02.777", "2:02.798", "2:02.811", "2:02.761", "2:03.088", "2:20.564", "2:27.787", "2:00.849", "2:00.292", "2:00.860", "2:25.279", "1:59.948", "2:26.924", "2:33.043", "1:58.604", "1:58.696", "2:07.629", "2:31.683", "2:10.970", "2:36.740", "2:09.626", "2:25.490", "2:22.108", "2:01.971", "1:58.852", "1:55.979", "1:56.500", "1:55.982", "1:53.827", "1:54.003", "1:55.269", "1:53.596", "1:52.081", "1:52.540", "1:53.138", "1:52.514", "1:52.228", "1:53.317", "1:52.660", "1:53.041", "1:53.585", "1:54.259", "1:52.067"]
"Yuki TSUNODA": ["21:11:12", "2:06.485", "2:05.476", "2:04.333", "2:03.862", "2:03.803", "2:03.162", "2:22.256", "2:41.673", "2:38.012", "2:03.513", "2:03.167", "2:03.129", "2:03.167", "2:02.582", "2:02.443", "2:02.801", "2:01.737", "2:01.240", "2:01.304", "2:28.641", "2:28.707", "2:01.296", "2:00.864", "2:00.402", "2:25.390", "2:00.193", "2:23.205", "2:38.491", "1:59.934", "1:59.622", "1:58.716", "2:06.961", "2:30.183"]
"ZHOU Guanyu": ["21:11:20", "2:09.958", "2:07.541", "2:04.482", "2:04.592", "2:04.209", "2:04.065", "2:24.525", "2:38.253", "2:35.211", "2:03.222", "2:02.811", "2:06.289", "2:05.556", "2:06.105", "2:02.641", "2:02.501", "2:02.158", "2:02.038", "2:21.871", "2:28.090", "2:01.732", "2:01.105", "2:01.171", "2:24.936"]
"Esteban OCON": ["21:11:17", "2:08.739", "2:06.079", "2:03.245", "2:02.831", "2:02.983"]
"Mick SCHUMACHER": ["21:11:16", "2:08.586", "2:05.628", "2:04.531", "2:04.601", "2:03.975", "2:04.039", "2:23.615", "2:37.698", "2:36.977", "2:02.944", "2:02.854", "2:03.170", "2:03.100", "2:02.944", "2:03.052", "2:02.455", "2:02.454", "2:02.038", "2:01.780", "2:20.120", "2:28.565", "2:00.362", "2:00.225", "2:00.487", "2:23.638", "2:00.664", "2:22.204", "2:36.902", "2:01.745", "1:58.777", "1:58.692", "1:59.069", "2:07.583", "2:34.532", "2:36.563", "2:20.553", "2:32.841", "2:24.621", "2:00.769", "3:01.053", "2:32.828", "1:59.502", "2:01.846", "1:57.253", "1:55.624", "1:55.088", "1:52.651", "1:52.416", "1:51.132", "1:52.195", "1:50.731", "1:51.607", "1:52.194", "1:51.917", "1:52.865", "1:51.193", "1:50.290"]
"Carlos SAINZ": ["21:11:04", "2:02.702", "2:01.872", "2:01.488", "2:01.111", "2:00.428", "2:00.833", "2:14.044", "3:00.236", "2:46.483", "2:01.252", "2:00.463", "2:00.631", "2:00.917", "2:00.439", "2:00.389", "2:00.144", "2:00.061", "1:59.869", "2:00.003", "2:06.780", "2:36.659", "2:01.685", "1:58.940", "1:59.192", "2:17.465", "2:04.708", "2:10.081", "2:38.981", "1:57.456", "1:58.538", "1:58.780", "1:58.296", "2:07.162", "2:49.952", "2:40.946", "3:00.885", "2:31.655", "1:58.724", "1:54.980", "1:53.471", "1:54.241", "1:52.063", "1:50.988", "1:51.048", "1:50.340", "1:50.101", "1:50.005", "1:50.096", "1:49.424", "1:49.683", "1:49.420", "1:49.626", "1:49.346", "1:49.013", "1:48.712", "1:48.746", "1:48.414"]
"Valtteri BOTTAS": ["21:11:18", "2:09.037", "2:06.307", "2:05.300", "2:04.635", "2:04.208", "2:05.534", "2:25.828", "2:37.459", "2:32.122", "2:03.254", "2:02.395", "2:02.810", "2:03.152", "2:02.841", "2:02.980", "2:02.744", "2:02.374", "2:02.142", "2:02.065", "2:21.775", "2:28.554", "2:01.548", "2:01.628", "2:00.992", "2:25.028", "2:01.219", "2:24.225", "1:59.572", "1:59.484", "2:07.048", "2:32.611", "2:11.632", "2:36.790", "2:09.167", "2:25.762", "2:23.782", "2:01.914", "1:57.085", "1:54.236", "1:54.129", "1:53.723", "1:54.985", "1:53.728", "1:52.736", "1:52.579", "1:53.071", "1:54.301", "1:53.656", "1:51.864", "1:52.899", "1:52.228", "1:52.335", "1:52.262", "1:53.276", "1:56.931", "1:56.059"]
© www.soinside.com 2019 - 2024. All rights reserved.