我正在使用 CORD-v2 数据集微调 LayoutLMv3。我正在努力解决数据预处理部分,特别是如何从图像中正确提取总量(TTC)。我在网上找到的示例似乎使用旧的 CORD 数据集,该数据集具有不同的格式。新的 CORD-v2 数据集仅包含图像和真实标签。
如何解决这个问题?
我尝试过 YouTube 和 Hugging Face 中的示例,但没有取得任何成功。
我也有类似的问题。我一直致力于创建自定义数据集。我在 Label Studio 中用 3 个标签标记了 6 张送货收据(仅用于测试目的以设置管道)。从中导出 JSON,以编程方式为导出的每个数据条目创建一个 .JSON 文件。现在我一路跟踪直到这个youtube视频结束,但不幸的是我也陷入了“索引外”问题:/ 如果我打印我的数据集,那么它们不为空,所以我的问题是了解索引外错误实际上来自哪里! 这是运行 trainer.train() 后的堆栈跟踪
Currently training with a batch size of: 1
The following columns in the training set don't have a corresponding argument in `VisionEncoderDecoderModel.forward` and have been ignored: image_path, ground_truth_string, annotation_path. If image_path, ground_truth_string, annotation_path are not expected by `VisionEncoderDecoderModel.forward`, you can safely ignore this message.
***** Running training *****
Num examples = 5
Num Epochs = 1
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 5
Number of trainable parameters = 201,121,912
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-160-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()
13 frames
/usr/local/lib/python3.10/dist-packages/datasets/table.py in <listcomp>(.0)
122 return pa.Table.from_batches(
123 [
--> 124 self._batches[batch_idx].slice(i - self._offsets[batch_idx], 1)
125 for batch_idx, i in zip(batch_indices, indices)
126 ],
IndexError: list index out of range
这是我的数据集代码,适用于 CORD-V2:
import json
KEY_START = "<key_start>"
KEY_END = "<key_end>"
VALUE_START = "<value_start>"
VALUE_END = "<value_end>"
def parse_json(batch):
batch_data = []
image_path = []
for filename in batch["annotation_path"]:
with open(filename) as fp:
data = json.load(fp)
x_values, y_values, width_values, height_values, labels, texts = [], [], [], [], [], []
# Extract the values from the JSON structure
for annotation in data['annotations']:
for result in annotation['result']:
value = result['value']
x_values.append(value['x'])
y_values.append(value['y'])
width_values.append(value['width'])
height_values.append(value['height'])
labels.append(value['rectanglelabels'][0]) # Assuming each label list contains one item
texts.append(value.get('text', '')) # Extract the associated text if available
x_values = list(map(str, x_values))
y_values = list(map(str, y_values))
width_values = list(map(str, width_values))
height_values = list(map(str, height_values))
labels = list(map(str, labels))
texts = list(map(str, texts))
ground_truth_string = ""
for label, text in zip(labels, texts):
ground_truth_string += f"{KEY_START}{label}{KEY_END}{VALUE_START}{text}{VALUE_END}"
batch_data.append(ground_truth_string)
print(f"batch_data: {batch_data}")
file_id = Path(filename).stem
print(Path(filename))
print(file_id)
image_path.append(str(images_dir / f"{file_id}.png"))
print(f"image_path len: {len(image_path)}")
return {
"ground_truth_string":batch_data,
"image_path":image_path}