如何高效计算所有输出相对于参数的梯度？

Question

我有一个相对简单的要求，但令人惊讶的是，这似乎并不容易在 pytorch 中实现。给定一个具有 $P$ 参数的神经网络，该网络输出长度为 $Y$ 的向量和一批 $B$ 数据输入，我想计算输出相对于模型参数的梯度。

换句话说，我想要以下功能：

def calculate_gradients(model, X):
    """
    Args:
        nn module with P parameters in total that outputs a tensor of size (B, Y).
        torch tensor of shape (B, .).

    Returns:
        torch tensor of shape (B, Y, P)
    """
    # function logic here

不幸的是，我目前没有看到有效计算此值的明显方法，尤其是在不聚合数据或目标维度的情况下。下面的一个最小的工作示例涉及循环输入和目标维度，但肯定有更有效的方法吗？

import torch
from torchvision import datasets, transforms
import torch.nn as nn

###### SETUP ######

class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        h = self.fc1(x)
        pred = self.fc2(self.relu(h))
        return pred
    
train_dataset = datasets.MNIST(root='./data', train=True, download=True, 
                            transform=transforms.Compose(
                                [transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (0.5,))
        ]))

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=False)

X, y = next(iter(train_dataloader))  # take a random batch of data

net = MLP(28*28, 20, 10)  # define a network


###### CALCULATE GRADIENTS ######
def calculate_gradients(model, X):
    # Create a tensor to hold the gradients
    gradients = torch.zeros(X.shape[0], 10, sum(p.numel() for p in model.parameters()))

    # Calculate the gradients for each input and target dimension
    for i in range(X.shape[0]):
        for j in range(10):
            model.zero_grad()
            output = model(X[i])
            # Calculate the gradients
            grads = torch.autograd.grad(output[j], model.parameters())
            # Flatten the gradients and store them
            gradients[i, j, :] = torch.cat([g.view(-1) for g in grads])
            
    return gradients

grads = calculate_gradients(net, X.view(X.shape[0], -1))

Answer 1

要解决这个问题，我们需要三个想法：

输出相对于参数的梯度是网络相对于参数的雅可比行列式。 https://pytorch.org/functorch/stable/ generated/functorch.jacrev.html
我们可以对pytorch模型进行功能化，即将模型转换为其参数的函数https://pytorch.org/functorch/nightly/ generated/functorch.functionize.html
Pytorch 可以使用 vmap 对许多操作进行矢量化https://pytorch.org/functorch/stable/ generated/functorch.vmap.html

这是全部内容

functorch

/

torch.func

。

将它们放在一起，这与您的代码相同：

# extract the parameters and buffers for a funcional call
params = {k: v.detach() for k, v in net.named_parameters()}
buffers = {k: v.detach() for k, v in net.named_buffers()}

def one_sample(sample):
    # this will calculate the gradients for a single sample
    # we want the gradients for each output wrt to the parameters
    # this is the same as the jacobian of the network wrt the parameters

    # define a function that takes the as input returns the output of the network
    call = lambda x: torch.func.functional_call(net, (x, buffers), sample)
    
    # calculate the jacobian of the network wrt the parameters
    J = torch.func.jacrev(call)(params)
    
    # J is a dictionary with keys the names of the parameters and values the gradients
    # we want a tensor
    grads = torch.cat([v.flatten(1) for v in J.values()],-1) 
    return grads

# no we can use vmap to calculate the gradients for all samples at once
grads2 = torch.vmap(one_sample)(X.flatten(1))

print(torch.allclose(grads,grads2))

它应该并行运行，你应该尝试更大的模型等，我没有对它进行基准测试。

这也与例如Pytorch：输出w.r.t参数的梯度有关（老实说没有一个很好的答案），以及pytorch.org/tutorials/intermediate/per_sample_grads.html，它显示了torch中的一些功能.func 用于计算每个样本的梯度。

如何高效计算所有输出相对于参数的梯度？

问题描述投票：0回答：1

1个回答

最新问题

如何高效计算所有输出相对于参数的梯度？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1