Pytorchvideo 模型 Resnet 输入形状

Question

我正在使用以下代码加载 resnet50，但因为这是一个视频。我不确定预期的输入是什么。是

([batch_size, channels, frames,img1,img2])

吗

任何帮助都会很棒。

import pytorchvideo.models.resnet

def resnet():
  return pytorchvideo.models.resnet.create_resnet(
      input_channel=3,     # RGB input from Kinetics
      model_depth=50,      # For the tutorial let's just use a 50 layer network
      model_num_class=400, # Kinetics has 400 classes so we need out final head to align
      norm=nn.BatchNorm3d,
      activation=nn.ReLU,
  )

Answer 1

输入张量的形状应该是(B, C, T, H, W)

来源：https://pytorchvideo.readthedocs.io/en/latest/models.html#resnet-models-for-video-classification

文档中的一个用法示例。

import pytorchvideo.models as models

resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)

Answer 2

输入张量形状应该是[batch_size, channels, frames, height, width]。其中：

通道：RGB图像3个，
Frames：每个视频片段的帧数，
高度和宽度：框架的空间尺寸。

在您的情况下（Kinetics 400），预期的输入张量形状应为 [batch_size, 3, frames, height, width]。

这是一个关于如何加载视频的小例子：

import torch
import torchvision.transforms as transform
from pytorchvideo.data.encoded_video import EncodedVideo

def load_video_clip(video_path, frames, height, width):
    video = EncodedVideo.from_path(video_path)
    
    trans = transform.Compose([
        transform.Resize((height, width)),
        transform.ToTensor(),
    ])

    video_frames = []
    for frame in video.get_clip(start_sec=0, end_sec=video.duration):
        video_frames.append(trans(frame))

    indices = torch.linspace(0, len(video_frames) - 1, steps=frames).long()
    video_frames = torch.stack([video_frames[idx] for idx in indices])

    video_clip = video_frames.unsqueeze(0)
    return video_clip

video_path = 'video.mp4'
frames = 16
height = 224
width = 224

video_clip = load_video_clip(video_path, frames, height, width)

在这里，我们加载视频并将变换应用于每个帧样本帧。最后，我们从视频中统一采样帧数并添加批次维度。

Pytorchvideo 模型 Resnet 输入形状

问题描述投票：0回答：2

2个回答

最新问题

Pytorchvideo 模型 Resnet 输入形状

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2