STEP 11：Attentionメカニズム

🔴 1. Seq2Seqの課題（復習）

前のSTEP 10で学んだSeq2Seqモデルには、情報のボトルネックという大きな問題がありました。Attentionを理解するために、まずこの問題を復習しましょう。

1-1. 情報のボトルネック問題とは？

【従来のSeq2Seqの問題】

入力: “The cat, which was very cute and fluffy, 
       was sitting on the mat near the window.”
       （16単語の長い文）

処理の流れ:
┌─────────────────────────────────────────┐
│ “The” → “cat” → “which” → … → “window” │
│                  ↓                       │
│         Encoder（LSTM）                  │
│                  ↓                       │
│      Context Vector（512次元）           │  ← ここが問題！
│                  ↓                       │
│         Decoder（LSTM）                  │
│                  ↓                       │
│            翻訳文                        │
└─────────────────────────────────────────┘

■ 何が問題か？
・16単語の意味を、たった512次元のベクトルに圧縮
・多くの情報が失われる
・特に文の最初の方の情報が薄まる

■ 結果
短い文（5単語）: 精度良好
長い文（50単語）: 精度が大幅に低下

1-2. 具体例で理解する

【翻訳での問題例】

入力（英語）:
“The keys to the car that I bought last week are on the table.”

■ 従来のSeq2Seq
Context Vector = [0.2, -0.5, 0.8, …, 0.3]
                 （たった512次元に全情報を圧縮）

問題:
・”keys” → 主語だが、後ろの方に”are”がある
・”car” → “bought”との関係が重要
・文の最初の”The keys”の情報が薄まる

翻訳結果:
× “先週買った車の鍵がテーブルにある”（主語が曖昧）
× “車を買った鍵が…”（関係が崩壊）

■ 理想的な翻訳プロセス
“鍵” を翻訳 → “keys” に注目すべき
“車” を翻訳 → “car” に注目すべき
“買った” を翻訳 → “bought” に注目すべき

→ これを実現するのがAttention！

⚠️ Context Vectorの限界

問題の本質：

どんなに長い文も、1つの固定長ベクトルに圧縮される
Decoderは「このベクトルだけ」を頼りに翻訳する
入力の「どこに何があるか」の情報が失われる

💡 2. Attentionの基本アイデア

Attention（注意機構）は、 「出力の各単語を生成する際に、入力のどの部分に注目すべきか」 を学習するメカニズムです。

2-1. 人間の翻訳プロセスとの比較

🧠 人間はどう翻訳するか？

人間が翻訳するとき、入力文を一度読んで全て覚えてから翻訳するわけではありません。翻訳しながら、必要な部分に目を戻して確認します。

【人間の翻訳プロセス】

英語: “I love you”
日本語: “私はあなたを愛しています”

■ ステップ1: “私”を書く
人間の目: “I” を見る
          ^^
→ 「Iは私だな」と確認

■ ステップ2: “は”を書く
人間の目: 文法的な助詞（英語に直接対応なし）
→ 特定の単語を見ない

■ ステップ3: “あなた”を書く
人間の目: “you” を見る
               ^^^
→ 「youはあなただな」と確認

■ ステップ4: “を”を書く
人間の目: 文法的な助詞
→ 特定の単語を見ない

■ ステップ5: “愛しています”を書く
人間の目: “love” を見る
           ^^^^
→ 「loveは愛するだな」と確認

【Attentionの核心】
人間のこのプロセスを模倣:
・各出力単語を生成するとき、入力の特定部分に「注目」
・動的に注目する場所を変える
・全体を一度に覚える必要がない

2-2. Attentionの仕組み（概要）

【Attentionの基本的な仕組み】

入力: “The   cat   is   on   the   mat”
       ↓     ↓     ↓    ↓    ↓     ↓
       h1    h2    h3   h4   h5    h6  （Encoderの各状態）

■ 従来のSeq2Seq
Decoderが使えるのは h6（最終状態）だけ
→ h1〜h5の情報は直接使えない

■ Attention付きSeq2Seq
Decoderは h1, h2, h3, h4, h5, h6 全てにアクセスできる！
→ 必要な情報を直接取り出せる

【出力生成時の動作】

時刻1: “その” を生成
       各状態との関連度を計算:
       h1(“The”):  0.8  ← 強く注目！
       h2(“cat”):  0.1
       h3(“is”):   0.0
       h4(“on”):   0.0
       h5(“the”):  0.1
       h6(“mat”):  0.0
       合計:       1.0

       → “The”の情報を主に使って “その” を生成

時刻2: “猫” を生成
       各状態との関連度を計算:
       h1(“The”):  0.0
       h2(“cat”):  0.9  ← 強く注目！
       h3(“is”):   0.0
       h4(“on”):   0.0
       h5(“the”):  0.0
       h6(“mat”):  0.1
       合計:       1.0

       → “cat”の情報を主に使って “猫” を生成

2-3. 従来手法との違い

項目	従来（Context Vector）	Attention
情報源	Encoderの最終状態のみ	Encoderの全ての状態
圧縮	全体を1つのベクトルに	全ての情報を保持
注目	全体に均等	関連部分に選択的に注目
動的性	固定（同じContext）	各ステップで変化
長い文	情報が失われる	情報を保持できる

🔢 3. Attention Weightの計算手順

Attentionは3つのステップで計算されます。この手順を理解することが、Attention理解の核心です。

3-1. 記号の定義

【記号の意味】

■ Encoderの出力
h₁, h₂, …, hₙ: Encoderの各時刻の隠れ状態
                 （入力の各単語に対応）

例: “I love you” の場合
h₁: “I” を処理した後の状態
h₂: “love” を処理した後の状態
h₃: “you” を処理した後の状態

■ Decoderの状態
s_t: Decoderの時刻tでの隠れ状態
     （現在生成しようとしている単語に対応）

例: “私はあなたを愛しています” の生成中
s₁: “私” を生成しようとしている状態
s₂: “は” を生成しようとしている状態
…

■ Attentionの出力
α_ti: 時刻tで入力単語iに対するAttention Weight
c_t: 時刻tでのContext Vector（Attention適用後）

3-2. ステップ1：スコア計算

📊 スコア計算（Score）

「Decoderの現在状態と、Encoderの各状態がどれだけ関連しているか」を計算

【スコア計算の例】

状況: “猫” を生成しようとしている（s₂の状態）

入力: “The cat is on the mat”
       h₁   h₂  h₃  h₄  h₅   h₆

■ 各入力との関連度（スコア）を計算
score(s₂, h₁) = 0.5   ← “The”との関連度（低い）
score(s₂, h₂) = 3.8   ← “cat”との関連度（高い！）
score(s₂, h₃) = 0.2   ← “is”との関連度（低い）
score(s₂, h₄) = 0.1   ← “on”との関連度（低い）
score(s₂, h₅) = 0.3   ← “the”との関連度（低い）
score(s₂, h₆) = 1.5   ← “mat”との関連度（やや高い）

→ “cat”のスコアが最も高い
→ “猫”を生成する際に”cat”が重要だと判断

3-3. ステップ2：Attention Weight計算

📊 Attention Weight（α）

スコアをSoftmaxで正規化して、合計1.0の確率分布にする

【Softmaxによる正規化】

■ スコア（ステップ1の結果）
score = [0.5, 3.8, 0.2, 0.1, 0.3, 1.5]

■ Softmax計算
α_ti = exp(score_i) / Σ exp(score_j)

exp(0.5) = 1.65
exp(3.8) = 44.70  ← 最大
exp(0.2) = 1.22
exp(0.1) = 1.11
exp(0.3) = 1.35
exp(1.5) = 4.48
合計 = 54.51

■ Attention Weight
α = [1.65/54.51, 44.70/54.51, 1.22/54.51, …]
  = [0.03,       0.82,        0.02,     0.02, 0.02, 0.08]
      ↑          ↑
    “The”      “cat”に82%の注目！

■ 確認: 合計 = 1.0
0.03 + 0.82 + 0.02 + 0.02 + 0.02 + 0.08 = 1.0 ✅

【重要】
・Attention Weightは確率分布
・合計が必ず1.0になる
・高い値 = 強く注目している

3-4. ステップ3：Context Vector計算

📊 Context Vector（c_t）

Attention Weightで重み付けした加重平均を計算

【Context Vectorの計算】

■ Attention Weight
α = [0.03, 0.82, 0.02, 0.02, 0.02, 0.08]

■ Encoderの状態（例: 3次元）
h₁ = [0.1, 0.5, 0.3]  ← “The”
h₂ = [0.8, 0.2, 0.9]  ← “cat”
h₃ = [0.4, 0.6, 0.1]  ← “is”
h₄ = [0.2, 0.3, 0.4]  ← “on”
h₅ = [0.1, 0.4, 0.2]  ← “the”
h₆ = [0.6, 0.1, 0.8]  ← “mat”

■ Context Vector計算
c_t = Σ α_ti × h_i

c_t = 0.03×[0.1, 0.5, 0.3]  ← “The”の寄与（3%）
    + 0.82×[0.8, 0.2, 0.9]  ← “cat”の寄与（82%）
    + 0.02×[0.4, 0.6, 0.1]  ← “is”の寄与（2%）
    + 0.02×[0.2, 0.3, 0.4]  ← “on”の寄与（2%）
    + 0.02×[0.1, 0.4, 0.2]  ← “the”の寄与（2%）
    + 0.08×[0.6, 0.1, 0.8]  ← “mat”の寄与（8%）

c_t = [0.003, 0.015, 0.009]
    + [0.656, 0.164, 0.738]  ← 最大の寄与！
    + [0.008, 0.012, 0.002]
    + …

c_t ≈ [0.72, 0.23, 0.81]
       ↑
    “cat”の情報を強く反映！

【結果】
Context Vectorは”cat”の情報を82%含む
→ “猫”を正確に生成できる

3-5. 3ステップのまとめ

💡 Attentionの3ステップ

ステップ1：スコア計算

score(s_t, h_i) = 「s_tとh_iの関連度」

ステップ2：Softmaxで正規化

α_ti = exp(score) / Σ exp(score) → 合計1.0の確率分布

ステップ3：加重平均

c_t = Σ α_ti × h_i → 注目した情報を含むContext Vector

📐 4. スコア関数の種類（Bahdanau vs Luong）

ステップ1の「スコア計算」には、主に2つの方式があります。

4-1. Bahdanau Attention（Additive Attention）

📊 Bahdanau Attention（2015年）

別名：Additive Attention（加算的注意）

【Bahdanau Attentionのスコア関数】

score(s_t, h_i) = v^T × tanh(W₁ × s_t + W₂ × h_i)

■ 各要素の意味
・s_t: Decoderの現在の隠れ状態
・h_i: Encoderのi番目の隠れ状態
・W₁: s_tを変換する重み行列（学習される）
・W₂: h_iを変換する重み行列（学習される）
・v: スカラー値を出力するベクトル（学習される）
・tanh: 活性化関数

■ 計算手順（例: 隠れ次元=3）
s_t = [1.0, 2.0, 3.0]  （Decoder状態）
h_i = [0.5, 1.5, 2.5]  （Encoder状態）

ステップ1: 線形変換
W₁ × s_t = [2.0, 3.0, 4.0]
W₂ × h_i = [1.0, 2.5, 3.5]

ステップ2: 加算（←”Additive”の由来）
sum = [2.0+1.0, 3.0+2.5, 4.0+3.5]
    = [3.0, 5.5, 7.5]

ステップ3: tanh
tanh([3.0, 5.5, 7.5]) = [0.995, 0.999, 1.0]

ステップ4: v^Tとの内積
score = v^T × [0.995, 0.999, 1.0]
      = 2.5（例）

■ 特徴
・s_tとh_iを「別々に」変換してから加算
・パラメータ: W₁, W₂, v
・やや複雑だが表現力が高い

4-2. Luong Attention（Multiplicative Attention）

📊 Luong Attention（2015年）

別名：Multiplicative Attention（乗算的注意）

【Luong Attentionのスコア関数（3種類）】

■ 1. Dot（内積）- 最もシンプル
score(s_t, h_i) = s_t^T × h_i

例:
s_t = [1.0, 2.0, 3.0]
h_i = [0.5, 1.5, 2.5]

score = 1.0×0.5 + 2.0×1.5 + 3.0×2.5
      = 0.5 + 3.0 + 7.5
      = 11.0

特徴: パラメータなし、非常に高速

■ 2. General（一般化）- 最もよく使われる
score(s_t, h_i) = s_t^T × W × h_i

例:
W = [[0.1, 0.2, 0.3],
     [0.4, 0.5, 0.6],
     [0.7, 0.8, 0.9]]

W × h_i = [1.25, 3.05, 4.85]
score = s_t^T × [1.25, 3.05, 4.85]
      = 1.0×1.25 + 2.0×3.05 + 3.0×4.85
      = 21.9

特徴: 学習可能なWで関連度を調整

■ 3. Concat（結合）- Bahdanauに近い
score(s_t, h_i) = v^T × tanh(W × [s_t; h_i])

[s_t; h_i]: s_tとh_iを結合

4-3. Bahdanau vs Luong 比較

項目	Bahdanau	Luong（General）
別名	Additive（加算的）	Multiplicative（乗算的）
計算式	v^T × tanh(W₁s + W₂h)	s^T × W × h
複雑さ	やや複雑	シンプル
速度	やや遅い	速い
精度	同程度	同程度
推奨	教育目的	実装推奨

💻 5. PyTorchでのAttention実装

Attentionの計算をPyTorchで実装します。段階的に解説します。

5-1. Luong Attention（General）の実装

ステップ1：必要なライブラリのインポート

import torch
import torch.nn as nn
import torch.nn.functional as F

# torch: PyTorchの基本機能
# torch.nn: ニューラルネットワークのモジュール
# torch.nn.functional: Softmaxなどの関数

ステップ2：Attentionクラスの定義

class LuongAttention(nn.Module):
    “””Luong Attention（General）”””
    
    def __init__(self, hidden_dim):
        super(LuongAttention, self).__init__()
        
        # hidden_dim: 隠れ状態の次元（例: 512）
        
        # W行列: スコア計算用の重み
        # score = s^T × W × h
        self.W = nn.Linear(hidden_dim, hidden_dim, bias=False)
        # nn.Linear(入力次元, 出力次元, バイアスなし)
        # → hidden_dim × hidden_dim の行列を作成

ステップ3：forwardメソッドの実装

    def forward(self, decoder_hidden, encoder_outputs):
        “””
        Attention計算
        
        Args:
            decoder_hidden: (batch_size, hidden_dim)
                → Decoderの現在の隠れ状態
            encoder_outputs: (batch_size, src_length, hidden_dim)
                → Encoderの全ての隠れ状態
        
        Returns:
            context: (batch_size, hidden_dim)
                → Attention適用後のContext Vector
            attention_weights: (batch_size, src_length)
                → 各入力単語への注目度（可視化用）
        “””
        
        # ステップ1: スコア計算
        # decoder_hiddenを変換: (batch_size, hidden_dim)
        transformed = self.W(decoder_hidden)
        
        # encoder_outputsとの内積でスコア計算
        # (batch_size, 1, hidden_dim) × (batch_size, hidden_dim, src_length)
        attention_scores = torch.bmm(
            transformed.unsqueeze(1),           # (batch, 1, hidden)
            encoder_outputs.transpose(1, 2)     # (batch, hidden, src_len)
        ).squeeze(1)  # (batch_size, src_length)
        
        # ステップ2: Softmaxで正規化
        attention_weights = F.softmax(attention_scores, dim=1)
        # attention_weights: (batch_size, src_length)
        # 合計が1.0になる確率分布
        
        # ステップ3: Context Vector計算（加重平均）
        # (batch_size, 1, src_length) × (batch_size, src_length, hidden_dim)
        context = torch.bmm(
            attention_weights.unsqueeze(1),     # (batch, 1, src_len)
            encoder_outputs                      # (batch, src_len, hidden)
        ).squeeze(1)  # (batch_size, hidden_dim)
        
        return context, attention_weights

5-2. 完成コードとテスト

※コードが長いため、横スクロールできます。

import torch
import torch.nn as nn
import torch.nn.functional as F

class LuongAttention(nn.Module):
    “””Luong Attention（General）”””
    
    def __init__(self, hidden_dim):
        super(LuongAttention, self).__init__()
        self.W = nn.Linear(hidden_dim, hidden_dim, bias=False)
    
    def forward(self, decoder_hidden, encoder_outputs):
        # ステップ1: スコア計算
        transformed = self.W(decoder_hidden)
        attention_scores = torch.bmm(
            transformed.unsqueeze(1),
            encoder_outputs.transpose(1, 2)
        ).squeeze(1)
        
        # ステップ2: Softmaxで正規化
        attention_weights = F.softmax(attention_scores, dim=1)
        
        # ステップ3: Context Vector計算
        context = torch.bmm(
            attention_weights.unsqueeze(1),
            encoder_outputs
        ).squeeze(1)
        
        return context, attention_weights

# テスト
hidden_dim = 512
attention = LuongAttention(hidden_dim)

# ダミーデータ
batch_size = 4
src_length = 6  # “The cat is on the mat” (6単語)
decoder_hidden = torch.randn(batch_size, hidden_dim)
encoder_outputs = torch.randn(batch_size, src_length, hidden_dim)

# Attention計算
context, weights = attention(decoder_hidden, encoder_outputs)

print(f”Context Vector shape: {context.shape}”)
print(f”Attention Weights shape: {weights.shape}”)
print(f”Attention Weights sum: {weights[0].sum().item():.4f}”)
print(f”\n1つ目のサンプルのAttention Weights:”)
print(weights[0].detach().numpy().round(3))

実行結果: Context Vector shape: torch.Size([4, 512]) Attention Weights shape: torch.Size([4, 6]) Attention Weights sum: 1.0000 1つ目のサンプルのAttention Weights: [0.152 0.089 0.312 0.098 0.201 0.148]

5-3. Bahdanau Attentionの実装

class BahdanauAttention(nn.Module):
    “””Bahdanau Attention（Additive）”””
    
    def __init__(self, hidden_dim):
        super(BahdanauAttention, self).__init__()
        
        # W₁: Decoder状態を変換
        self.W1 = nn.Linear(hidden_dim, hidden_dim)
        # W₂: Encoder状態を変換
        self.W2 = nn.Linear(hidden_dim, hidden_dim)
        # v: スカラー値を出力
        self.V = nn.Linear(hidden_dim, 1)
    
    def forward(self, decoder_hidden, encoder_outputs):
        batch_size = encoder_outputs.shape[0]
        src_length = encoder_outputs.shape[1]
        
        # decoder_hiddenを拡張して繰り返す
        # (batch_size, hidden_dim) → (batch_size, src_length, hidden_dim)
        decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_length, 1)
        
        # スコア計算: v^T × tanh(W₁×s + W₂×h)
        energy = torch.tanh(
            self.W1(decoder_hidden) + self.W2(encoder_outputs)
        )
        attention_scores = self.V(energy).squeeze(2)
        
        # Softmax
        attention_weights = F.softmax(attention_scores, dim=1)
        
        # Context Vector
        context = torch.bmm(
            attention_weights.unsqueeze(1),
            encoder_outputs
        ).squeeze(1)
        
        return context, attention_weights

# テスト
bahdanau = BahdanauAttention(hidden_dim)
context_b, weights_b = bahdanau(decoder_hidden, encoder_outputs)
print(f”Bahdanau Context shape: {context_b.shape}”)
print(f”Bahdanau Weights sum: {weights_b[0].sum().item():.4f}”)

実行結果: Bahdanau Context shape: torch.Size([4, 512]) Bahdanau Weights sum: 1.0000

🔄 6. Attention付きSeq2Seqモデル

AttentionをSeq2Seqモデルに組み込みます。

6-1. Attention付きDecoderの実装

class AttentionDecoder(nn.Module):
    “””Attention付きDecoder”””
    
    def __init__(self, output_dim, embedding_dim, hidden_dim, n_layers=1, dropout=0.5):
        super(AttentionDecoder, self).__init__()
        
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        
        # Embedding層
        self.embedding = nn.Embedding(output_dim, embedding_dim)
        
        # Attention層
        self.attention = LuongAttention(hidden_dim)
        
        # LSTM層
        # 入力: embedding + context（Attentionの結果）
        self.lstm = nn.LSTM(
            embedding_dim + hidden_dim,  # contextを結合するので +hidden_dim
            hidden_dim,
            n_layers,
            dropout=dropout if n_layers > 1 else 0,
            batch_first=True
        )
        
        # 出力層
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, hidden, cell, encoder_outputs):
        “””
        1単語を生成
        
        Args:
            input: 現在の入力単語 (batch_size, 1)
            hidden: 前の隠れ状態 (n_layers, batch_size, hidden_dim)
            cell: 前のセル状態 (n_layers, batch_size, hidden_dim)
            encoder_outputs: Encoderの全出力 (batch_size, src_length, hidden_dim)
        “””
        # 埋め込み
        embedded = self.dropout(self.embedding(input))
        # (batch_size, 1, embedding_dim)
        
        # Attention計算（最上層のhiddenを使用）
        decoder_state = hidden[-1]  # (batch_size, hidden_dim)
        context, attention_weights = self.attention(decoder_state, encoder_outputs)
        # context: (batch_size, hidden_dim)
        
        # embeddedとcontextを結合
        context = context.unsqueeze(1)  # (batch_size, 1, hidden_dim)
        lstm_input = torch.cat([embedded, context], dim=2)
        # (batch_size, 1, embedding_dim + hidden_dim)
        
        # LSTM
        output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))
        
        # 予測
        prediction = self.fc_out(output.squeeze(1))
        # (batch_size, output_dim)
        
        return prediction, hidden, cell, attention_weights

6-2. 効果の確認

💡 Attentionの効果

従来のSeq2Seq：

短い文（10単語以下）: BLEU 25〜30
長い文（50単語以上）: BLEU 10〜15

Attention付きSeq2Seq：

短い文: BLEU 30〜35（+5ポイント）
長い文: BLEU 25〜30（+15ポイント！）

特に長い文での改善が大きいのが特徴です。

📊 7. Attention Weightの可視化

Attention Weightを可視化することで、 「モデルがどこに注目しているか」を視覚的に確認できます。

7-1. 可視化コード

import matplotlib.pyplot as plt
import numpy as np

# 日本語フォント設定（Google Colabの場合）
# !pip install japanize-matplotlib
# import japanize_matplotlib

def visualize_attention(src_words, trg_words, attention_weights):
    “””
    Attention Weightsをヒートマップで可視化
    
    Args:
        src_words: 入力文の単語リスト（例: [“I”, “love”, “you”]）
        trg_words: 出力文の単語リスト（例: [“私”, “は”, “愛し”, “ます”]）
        attention_weights: (trg_length, src_length) のテンソル
    “””
    # NumPyに変換
    if torch.is_tensor(attention_weights):
        attention_weights = attention_weights.cpu().detach().numpy()
    
    # ヒートマップ作成
    fig, ax = plt.subplots(figsize=(8, 6))
    
    im = ax.imshow(attention_weights, cmap=’YlOrRd’)
    
    # 軸ラベル
    ax.set_xticks(range(len(src_words)))
    ax.set_yticks(range(len(trg_words)))
    ax.set_xticklabels(src_words)
    ax.set_yticklabels(trg_words)
    
    # 軸の説明
    ax.set_xlabel(‘Input (English)’)
    ax.set_ylabel(‘Output (Japanese)’)
    ax.set_title(‘Attention Weights’)
    
    # カラーバー
    plt.colorbar(im)
    
    # 値を表示
    for i in range(len(trg_words)):
        for j in range(len(src_words)):
            value = attention_weights[i, j]
            color = ‘white’ if value > 0.5 else ‘black’
            ax.text(j, i, f'{value:.2f}’, ha=’center’, va=’center’, color=color)
    
    plt.tight_layout()
    plt.show()

# ダミーデータでテスト
src_words = [‘I’, ‘love’, ‘you’, ‘<eos>’]
trg_words = [‘私’, ‘は’, ‘あなた’, ‘を’, ‘愛します’, ‘<eos>’]

# 理想的なAttention Weights（例）
attention_weights = np.array([
    [0.85, 0.05, 0.05, 0.05],  # “私” → “I”
    [0.10, 0.10, 0.10, 0.70],  # “は” → 文法
    [0.05, 0.05, 0.85, 0.05],  # “あなた” → “you”
    [0.10, 0.10, 0.10, 0.70],  # “を” → 文法
    [0.05, 0.85, 0.05, 0.05],  # “愛します” → “love”
    [0.05, 0.05, 0.05, 0.85],  # “<eos>” → “<eos>”
])

visualize_attention(src_words, trg_words, attention_weights)

7-2. 可視化から読み取れること

【ヒートマップの読み方】

縦軸: 出力文（日本語）
横軸: 入力文（英語）

色:
・明るい色（黄色/赤）: 強く注目（高いAttention Weight）
・暗い色（青/黒）: あまり注目していない（低いWeight）

【例の解釈】

“私” を生成 → “I” に0.85で注目
→ 正しい！「私」と「I」は対応

“あなた” を生成 → “you” に0.85で注目
→ 正しい！「あなた」と「you」は対応

“愛します” を生成 → “love” に0.85で注目
→ 正しい！「愛します」と「love」は対応

“は” “を” を生成 → 注目が分散
→ これも正しい！日本語の助詞は英語に直接対応しない

【重要な発見】
・モデルは自動的に「翻訳のアライメント」を学習
・人間の直感と一致する注目パターン
・明示的に教えなくても、単語の対応関係を学習

🚀 8. Transformerへのプレビュー

このステップで学んだAttentionは、現代のNLPの基盤となるTransformerへとつながります。

8-1. Attentionの進化の歴史

【NLP技術の進化】

■ 2014年: Seq2Seq登場
・Encoder-Decoder構造
・機械翻訳で大成功
・問題: 情報のボトルネック

        ↓

■ 2015年: Attention登場（このステップ）
・Bahdanau Attention
・Luong Attention
・情報のボトルネックを解決
・翻訳精度が大幅に向上

        ↓

■ 2017年: Transformer登場
・論文: “Attention is All You Need”
・革新: RNN/LSTMを完全に廃止
・Attentionだけで系列を処理
・並列処理が可能に
・現代のNLPの標準

        ↓

■ 2018年〜: 事前学習モデルの時代
・BERT（2018年）
・GPT-2（2019年）
・GPT-3（2020年）
・ChatGPT（2022年）
・全てTransformerベース！

8-2. RNN + AttentionからTransformerへ

💡 Transformerの革新的アイデア

「RNNを使わず、Attentionだけで系列を処理できないか？」

RNN + Attentionの問題：

RNNは逐次処理（t=1 → t=2 → t=3 → …）
並列化できない → GPUを活かせない
長い系列では依然として勾配消失

Transformerの解決策：

Self-Attention: 全単語ペア間の関連度を一度に計算
完全に並列化可能
長距離依存関係も直接捉える
RNNより高速・高精度

8-3. 次のPartで学ぶこと

STEP	内容
STEP 12	Transformerの概要、”Attention is All You Need”
STEP 13	Self-Attention、Query/Key/Value、Multi-Head
STEP 14	Position Encoding、Feed-Forward Network
STEP 15	Transformer全体の実装

📝 練習問題

このステップで学んだ内容を確認しましょう。

問題1：Attentionの目的

Attentionメカニズムの主な目的は何ですか？

モデルのパラメータ数を削減する
訓練速度を高速化する
情報のボトルネックを解決する
メモリ使用量を削減する

正解：c

Attentionの主な目的は情報のボトルネックを解決することです。

従来のSeq2Seq：

入力文全体を1つのContext Vectorに圧縮
長い文では情報が失われる

Attention：

Encoderの全ての状態を保持
各ステップで関連する部分に選択的に注目
情報を失わない

問題2：Attention Weightの性質

Attention Weightが満たす必須の条件は何ですか？

全ての値が1.0以上である
合計が1.0になる（確率分布）
最大値が必ず1.0である
値が負になることがある

正解：b

Attention Weightは合計が1.0になる必要があります。

理由：Softmax関数で正規化されるため

Attention Weightの計算：

スコア計算: score(s_t, h_i)
Softmax: α = exp(score) / Σ exp(score)

Softmaxの性質：

全ての値が0〜1の範囲
合計が必ず1.0
確率分布として解釈可能

問題3：Bahdanau vs Luong

Luong Attentionの特徴として正しいものはどれですか？

Bahdanauより複雑
Additive（加算的）である
Multiplicative（乗算的）である
Bahdanauより必ず精度が高い

正解：c

Luong AttentionはMultiplicative（乗算的）です。

比較：

Bahdanau: v^T × tanh(W₁s + W₂h) → 加算的
Luong: s^T × W × h → 乗算的

各選択肢：

a: Luongの方がシンプル ❌
b: AdditiveはBahdanau ❌
c: Multiplicative ✅
d: 精度はほぼ同等 ❌

問題4：Transformerへの発展

Transformer（2017年）の革新的なアイデアは何ですか？

Attentionを追加した
RNNを使わず、Attentionだけで構築した
LSTMを改良した
Context Vectorを大きくした