Pytorchvideo 模型 Resnet 输入形状

问题描述 投票:0回答:2

我正在使用以下代码加载 resnet50,但因为这是一个视频。我不确定预期的输入是什么。是

([batch_size, channels, frames,img1,img2])

任何帮助都会很棒。

import pytorchvideo.models.resnet

def resnet():
  return pytorchvideo.models.resnet.create_resnet(
      input_channel=3,     # RGB input from Kinetics
      model_depth=50,      # For the tutorial let's just use a 50 layer network
      model_num_class=400, # Kinetics has 400 classes so we need out final head to align
      norm=nn.BatchNorm3d,
      activation=nn.ReLU,
  )
python deep-learning pytorch model vision
2个回答
0
投票

输入张量的形状应该是(B, C, T, H, W)

来源:https://pytorchvideo.readthedocs.io/en/latest/models.html#resnet-models-for-video-classification

文档中的一个用法示例。

import pytorchvideo.models as models

resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)

0
投票

输入张量形状应该是[batch_size, channels, frames, height, width]。其中:

  • 通道:RGB图像3个,
  • Frames:每个视频片段的帧数,
  • 高度宽度:框架的空间尺寸。

在您的情况下(Kinetics 400),预期的输入张量形状应为 [batch_size, 3, frames, height, width]。

这是一个关于如何加载视频的小例子:

import torch
import torchvision.transforms as transform
from pytorchvideo.data.encoded_video import EncodedVideo

def load_video_clip(video_path, frames, height, width):
    video = EncodedVideo.from_path(video_path)
    
    trans = transform.Compose([
        transform.Resize((height, width)),
        transform.ToTensor(),
    ])

    video_frames = []
    for frame in video.get_clip(start_sec=0, end_sec=video.duration):
        video_frames.append(trans(frame))

    indices = torch.linspace(0, len(video_frames) - 1, steps=frames).long()
    video_frames = torch.stack([video_frames[idx] for idx in indices])

    video_clip = video_frames.unsqueeze(0)
    return video_clip

video_path = 'video.mp4'
frames = 16
height = 224
width = 224

video_clip = load_video_clip(video_path, frames, height, width)

在这里,我们加载视频并将变换应用于每个帧样本帧。最后,我们从视频中统一采样帧数并添加批次维度。

© www.soinside.com 2019 - 2024. All rights reserved.