Coding attention mechanisms 注意力机制

本章，我们来到第 LLM 的第二个部分

NeatReader-1744895959920

The problem with modeling long sequences 长序列建模的问题#

假设我们想要开发一个将文本从一种语言翻译成另一种语言的语言翻译模型。 NeatReader-1744896100401

为了解决这个问题，通常使用具有两个子模块（编码器和解码器）的深度神经网络。编码器的工作是首先读取并处理整个文本，然后解码器生成翻译后的文本。

NeatReader-1744896404188

编码器 - 解码器循环神经网络的一个大限制是，在解码阶段，循环神经网络无法直接访问编码器中的早期隐藏状态。因此，它仅依赖于当前隐藏状态，该状态封装了所有相关信息。这可能导致上下文丢失，特别是在依赖关系可能跨越很长距离的复杂句子中。

因为这样的缺点，促使了注意力机制的设计。

Capturing data dependencies with attention mechanisms 用注意力机制捕获数据依赖关系#

NeatReader-1744896633028

Attending to different parts of the input with self-attention 利用自注意力关注输入的不同部分#

在自注意力 (self-attention) 机制中，“self” 指的是该机制通过关联单个输入序列内的不同位置来计算注意力权重的能力。它评估并学习输入本身各个部分（如句子中的单词或图像中的像素）之间的关系和依赖关系。

A simple self-attention mechanism without trainable weights 一个没有可训练权重的简单自注意力机制#

NeatReader-1744897331316

自注意力的目标是为每个输入元素计算一个上下文向量，该向量结合了所有其他输入元素的信息。在这个例子中，我们计算上下文向量 $z^{\left(2\right)}$ 。计算 $z^{\left(2\right)}$ 时，每个输入元素的重要性或贡献由注意力权重 $a_{21}$ 到 $a_{2T}$ 决定。在计算 $z^{\left(2\right)}$ 时，注意力权重是相对于输入元素 $x^{\left(2\right)}$ 和所有其他输入计算的。

Example:

Your journey starts with one step

在这种情况下，序列的每个元素，如 x(1)，对应于一个 d 维的嵌入向量，代表一个特定的标记，如“Your”, 在自注意力中，我们的目标是为输入序列中的每个元素 x(i) 计算上下文向量 z(i)。上下文向量可以被解释为一个丰富的嵌入向量。

比如我们有这么一个嵌入

import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

实现自注意力的第一步是计算中间值 w，这些值被称为注意力分数

NeatReader-1744898334280

query = inputs[1]  #1  第二个输入标记用作查询
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):  # 和输入中的每一个都算点积
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)

> tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

在自注意力机制的上下文中，点积决定了序列中的每个元素在多大程度上关注或“关注”任何其他元素：点积越高，两个元素之间的相似性和注意力得分就越高。

然后计算归一化

attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

NeatReader-1744898728409

实践过程中还是使用 softmax 会更加适合

def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

通过将嵌入的输入令牌 x(i) 与相应的注意力权重相乘，然后对得到的向量求和，来计算上下文向量 z(2)。因此，上下文向量 z(2) 是所有输入向量的加权和，通过将每个输入向量与其相应的注意力权重相乘得到

query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)

NeatReader-1744900021955

剩下来，我们就计算全部

NeatReader-1744900363664

attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

结果如下

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

使用矩阵乘法就更快

attn_scores = inputs @ inputs.T
print(attn_scores)

attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

直到这里，整体是这样的

import torch

inputs = torch.tensor(
    [[0.43, 0.15, 0.89],  # Your     (x^1)
     [0.55, 0.87, 0.66],  # journey  (x^2)
     [0.57, 0.85, 0.64],  # starts   (x^3)
     [0.22, 0.58, 0.33],  # with     (x^4)
     [0.77, 0.25, 0.10],  # one      (x^5)
     [0.05, 0.80, 0.55]]  # step     (x^6)
)

attn_scores = inputs @ inputs.T
attn_weights = torch.softmax(attn_scores, dim=-1)
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

Implementing self-attention with trainable weights 使用可训练权重实现自我注意#

NeatReader-1745042030710

三个可训练的权重矩阵 $$ W_q $$ $$ W_k $$ 和 $$ W_v $$ 来逐步实现自我注意机制.

$$ W_q $$ : 查询类似于数据库中的搜索查询。它表示模型关注或尝试理解的当前项目（例如，句子中的单词或标记）。该查询用于探测 input 序列的其他部分，以确定要对它们的关注程度。
$$ W_k $$ : 该键类似于用于索引和搜索的数据库键。在注意力机制中，输入序列中的每个项目（例如，句子中的每个单词）都有一个关联的键。这些键用于匹配查询。
$$ W_v $$ : 此上下文中的值类似于数据库中键值对中的值。它表示输入项的实际内容或表示形式。一旦模型确定哪些键（以及输入的哪些部分）与查询（当前焦点项）最相关，它就会检索相应的值。

第一步先尝试

x_2 = inputs[1]     #1 第二个输入元素
d_in = inputs.shape[1]      #2 输入嵌入大小，d=3
d_out = 2         #3 输出嵌入大小，d_out=2

接下来，我们初始化三个权重矩阵

torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

然后计算

query_2 = x_2 @ W_query 
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value
print(query_2)

我们可以通过矩阵乘法获得所有键和值

keys = inputs @ W_key 
values = inputs @ W_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

从输出中可以看出，我们成功地将 6 个输入标记从三维投影到二维嵌入空间上，计算注意力的方式如下

NeatReader-1745053732462

过将注意力分数除以键的嵌入维度的平方根来缩放注意力分数（取平方根在数学上与幂 0.5 相同）

d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

NeatReader-1745054510140

将上下文向量计算为值向量的加权和

context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

Implementing a compact self-attention Python class 实现一个紧凑的自我注意力 Python 类#

import torch.nn as nn
class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec

NeatReader-1745055827017

Hiding future words with causal attention 用因果注意力隐藏将来的单词#

对于许多LLM任务，在预测序列中的下一个标记时，您将希望自我注意机制仅考虑出现在当前位置之前的标记。因果注意力，也称为蒙面注意力，是自我注意的一种特殊形式。它限制模型在计算注意力分数时处理任何给定标记时仅考虑序列中的先前和当前输入。

NeatReader-1745056163045

我们掩盖了对角线上方的注意力权重，并对未被掩盖的注意力权重进行了归一化，以便每行的注意力权重之和为 1。

NeatReader-1745056221429

Applying a causal attention mask 应用因果注意力掩码#

首先计算注意力分数

queries = sa_v2.W_query(inputs)
keys = sa_v2.W_key(inputs) 
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

> tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)

使用 PyTorch 的 tril 函数实现第二步，创建一个对角线以上的值为零的掩码

context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)

> tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])

和注意力矩阵相乘

masked_simple = attn_weights*mask_simple
print(masked_simple)
> tensor([[0.1717, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1636, 0.1749, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1637, 0.1749, 0.1746, 0.0000, 0.0000, 0.0000],
        [0.1636, 0.1704, 0.1702, 0.1652, 0.0000, 0.0000],
        [0.1667, 0.1722, 0.1721, 0.1618, 0.1633, 0.0000],
        [0.1624, 0.1709, 0.1706, 0.1654, 0.1625, 0.1682]],
       grad_fn=<MulBackward0>)

然后重新规范化注意力权重，在每行中再次求和 1。

row_sums = masked_simple.sum(dim=-1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)

> tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)

这里还可以使用一个优化的方案

context_length = attn_scores.shape[0]
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
attn_weights = torch.softmax(masked / keys.shape[-1] ** 0.5, dim=1)
print(attn_weights)

> tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4833, 0.5167, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3190, 0.3408, 0.3402, 0.0000, 0.0000, 0.0000],
        [0.2445, 0.2545, 0.2542, 0.2468, 0.0000, 0.0000],
        [0.1994, 0.2060, 0.2058, 0.1935, 0.1953, 0.0000],
        [0.1624, 0.1709, 0.1706, 0.1654, 0.1625, 0.1682]],
       grad_fn=<SoftmaxBackward0>)

Masking additional attention weights with dropout 使用 dropout 屏蔽额外的注意力权重#

深度学习中的 Dropout 是一种技术，在训练过程中会忽略随机选择的隐藏层单元，从而有效地将它们 “drop” 掉下来。此方法通过确保模型不会过度依赖任何特定的隐藏层单元集来帮助防止过度拟合。需要强调的是，dropout 仅在训练期间使用，之后会禁用。

torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) 
torch.manual_seed(123)
print(dropout(attn_weights))

Implementing a compact causal attention class 实现紧凑的因果注意力类#

class CausalAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length,
                dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)            #1 增加 Dropout 层
        self.register_buffer(
           'mask',
           torch.triu(torch.ones(context_length, context_length),
           diagonal=1)
        )     #2 缓冲区会与我们的模型一起自动移动到适当的设备（CPU 或 GPU），这在训练我们的 LLM.这意味着我们不需要手动确保这些张量与模型参数位于同一设备上

    def forward(self, x):
        b, num_tokens, d_in = x.shape       #3  转置维度 1 和 2，将批量维度保持在第一个位置 （0）
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)   
        attn_scores.masked_fill_(   #4 带有尾部下划线的作是就地执行的，避免了不必要的内存复制
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) 
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

Extending single-head attention to multi-head attention 将单头注意力扩展到多头注意力#

Stacking multiple single-head attention layers 堆叠多个单头注意力层#

在实际应用中，实现多头注意力涉及创建多个自注意力机制的实例，每个实例都有自己的权重，然后将它们的输出组合起来。

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, context_length,
                 dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(
                 d_in, d_out, context_length, dropout, qkv_bias
             ) 
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

Implementing multi-head attention with weight splits 使用权重拆分实现多头注意力#

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, 
                 context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads    #1
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)    #2
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)         #3
        queries = self.W_query(x)    #3
        values = self.W_value(x)     #3

        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)       #4
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)  
        queries = queries.view(                                             
            b, num_tokens, self.num_heads, self.head_dim                    
        )                                                                   

        keys = keys.transpose(1, 2)          #5
        queries = queries.transpose(1, 2)    #5
        values = values.transpose(1, 2)      #5

        attn_scores = queries @ keys.transpose(2, 3)   #6
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]    #7

        attn_scores.masked_fill_(mask_bool, -torch.inf)     #8

        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = (attn_weights @ values).transpose(1, 2)   #9
 #10
        context_vec = context_vec.contiguous().view(
            b, num_tokens, self.d_out
        )
        context_vec = self.out_proj(context_vec)    #11
        return context_vec