使用 Cord-V2 数据集微调 LayoutLmv3

问题描述 投票:0回答:1

我正在使用 CORD-v2 数据集微调 LayoutLMv3。我正在努力解决数据预处理部分,特别是如何从图像中正确提取总量(TTC)。我在网上找到的示例似乎使用旧的 CORD 数据集,该数据集具有不同的格式。新的 CORD-v2 数据集仅包含图像和真实标签。

如何解决这个问题?

我尝试过 YouTube 和 Hugging Face 中的示例,但没有取得任何成功。

machine-learning dataset fine-tuning
1个回答
0
投票

我也有类似的问题。我一直致力于创建自定义数据集。我在 Label Studio 中用 3 个标签标记了 6 张送货收据(仅用于测试目的以设置管道)。从中导出 JSON,以编程方式为导出的每个数据条目创建一个 .JSON 文件。现在我一路跟踪直到这个youtube视频结束,但不幸的是我也陷入了“索引外”问题:/ 如果我打印我的数据集,那么它们不为空,所以我的问题是了解索引外错误实际上来自哪里! 这是运行 trainer.train() 后的堆栈跟踪

Currently training with a batch size of: 1 The following columns in the training set don't have a corresponding argument in `VisionEncoderDecoderModel.forward` and have been ignored: image_path, ground_truth_string, annotation_path. If image_path, ground_truth_string, annotation_path are not expected by `VisionEncoderDecoderModel.forward`, you can safely ignore this message. ***** Running training ***** Num examples = 5 Num Epochs = 1 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 1 Gradient Accumulation steps = 1 Total optimization steps = 5 Number of trainable parameters = 201,121,912 --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-160-3435b262f1ae> in <cell line: 1>() ----> 1 trainer.train() 13 frames /usr/local/lib/python3.10/dist-packages/datasets/table.py in <listcomp>(.0) 122 return pa.Table.from_batches( 123 [ --> 124 self._batches[batch_idx].slice(i - self._offsets[batch_idx], 1) 125 for batch_idx, i in zip(batch_indices, indices) 126 ], IndexError: list index out of range

这是我的数据集代码,适用于 CORD-V2:

import json KEY_START = "<key_start>" KEY_END = "<key_end>" VALUE_START = "<value_start>" VALUE_END = "<value_end>" def parse_json(batch): batch_data = [] image_path = [] for filename in batch["annotation_path"]: with open(filename) as fp: data = json.load(fp) x_values, y_values, width_values, height_values, labels, texts = [], [], [], [], [], [] # Extract the values from the JSON structure for annotation in data['annotations']: for result in annotation['result']: value = result['value'] x_values.append(value['x']) y_values.append(value['y']) width_values.append(value['width']) height_values.append(value['height']) labels.append(value['rectanglelabels'][0]) # Assuming each label list contains one item texts.append(value.get('text', '')) # Extract the associated text if available x_values = list(map(str, x_values)) y_values = list(map(str, y_values)) width_values = list(map(str, width_values)) height_values = list(map(str, height_values)) labels = list(map(str, labels)) texts = list(map(str, texts)) ground_truth_string = "" for label, text in zip(labels, texts): ground_truth_string += f"{KEY_START}{label}{KEY_END}{VALUE_START}{text}{VALUE_END}" batch_data.append(ground_truth_string) print(f"batch_data: {batch_data}") file_id = Path(filename).stem print(Path(filename)) print(file_id) image_path.append(str(images_dir / f"{file_id}.png")) print(f"image_path len: {len(image_path)}") return { "ground_truth_string":batch_data, "image_path":image_path}

	
© www.soinside.com 2019 - 2024. All rights reserved.