我正在使用以下代码加载 resnet50,但因为这是一个视频。我不确定预期的输入是什么。是
([batch_size, channels, frames,img1,img2])
吗
任何帮助都会很棒。
import pytorchvideo.models.resnet
def resnet():
return pytorchvideo.models.resnet.create_resnet(
input_channel=3, # RGB input from Kinetics
model_depth=50, # For the tutorial let's just use a 50 layer network
model_num_class=400, # Kinetics has 400 classes so we need out final head to align
norm=nn.BatchNorm3d,
activation=nn.ReLU,
)
输入张量的形状应该是(B, C, T, H, W)
来源:https://pytorchvideo.readthedocs.io/en/latest/models.html#resnet-models-for-video-classification
文档中的一个用法示例。
import pytorchvideo.models as models
resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)
输入张量形状应该是[batch_size, channels, frames, height, width]。其中:
在您的情况下(Kinetics 400),预期的输入张量形状应为 [batch_size, 3, frames, height, width]。
这是一个关于如何加载视频的小例子:
import torch
import torchvision.transforms as transform
from pytorchvideo.data.encoded_video import EncodedVideo
def load_video_clip(video_path, frames, height, width):
video = EncodedVideo.from_path(video_path)
trans = transform.Compose([
transform.Resize((height, width)),
transform.ToTensor(),
])
video_frames = []
for frame in video.get_clip(start_sec=0, end_sec=video.duration):
video_frames.append(trans(frame))
indices = torch.linspace(0, len(video_frames) - 1, steps=frames).long()
video_frames = torch.stack([video_frames[idx] for idx in indices])
video_clip = video_frames.unsqueeze(0)
return video_clip
video_path = 'video.mp4'
frames = 16
height = 224
width = 224
video_clip = load_video_clip(video_path, frames, height, width)
在这里,我们加载视频并将变换应用于每个帧样本帧。最后,我们从视频中统一采样帧数并添加批次维度。