🔀 STEP 24: 双方向RNN & Attention入門

過去と未来の両方から学ぶ！「どこに注目するか」を学習する技術

📋 このステップで学ぶこと

双方向RNN（Bidirectional RNN）とは何か？
なぜ双方向が有効なのか？
Kerasでの双方向RNN実装
Attention機構の基本アイデア
Self-Attentionの概念
IMDBレビュー感情分析の実践

🔀 1. 双方向RNN（Bidirectional RNN）とは？

1-1. 通常のRNNの限界

通常のRNNは、シーケンスを左から右へ（過去から未来へ）一方向に処理します。
しかし、多くのタスクでは未来の情報も重要です。

📖 穴埋め問題の例え

「私は　昨日　＿＿＿　を　食べた」

左からだけ見ると：
「私は　昨日」→ 何を食べた？全く分からない…

右からも見ると：
「を　食べた」→ 食べ物だ！

両方を組み合わせると：
「昨日」「食べた」→ 食事に関する単語（ラーメン、寿司など）

1-2. 双方向RNNの仕組み

以下の図は横スクロールできます。

【双方向RNNの構造】

■ 通常のRNN（一方向）
x₁ → x₂ → x₃ → x₄
 ↓    ↓    ↓    ↓
h₁ → h₂ → h₃ → h₄ → 出力


■ 双方向RNN（Bidirectional）

順方向（Forward）：過去 → 未来
x₁ → x₂ → x₃ → x₄
 ↓    ↓    ↓    ↓
h₁→ → h₂→ → h₃→ → h₄→
                         ↘
                          → 結合 → 出力
                         ↗
h₁← ← h₂← ← h₃← ← h₄←
 ↑    ↑    ↑    ↑
x₁ ← x₂ ← x₃ ← x₄
逆方向（Backward）：未来 → 過去

各タイムステップの出力 = [h→; h←]（順方向と逆方向の結合）

1-3. 双方向RNNの出力

✅ 出力の形状

LSTM(64)を双方向にすると、出力は128次元になります。

・順方向の隠れ状態：64次元
・逆方向の隠れ状態：64次元
・結合後：64 + 64 = 128次元

💻 2. Kerasでの双方向RNN実装

2-1. Bidirectionalラッパー

from tensorflow.keras.layers import Bidirectional, LSTM, GRU

# 双方向LSTM
Bidirectional(LSTM(64, return_sequences=True))

# 双方向GRU
Bidirectional(GRU(64, return_sequences=False))

# merge_modeオプション
Bidirectional(LSTM(64), merge_mode=’concat’)  # デフォルト：結合
Bidirectional(LSTM(64), merge_mode=’sum’)     # 足し算
Bidirectional(LSTM(64), merge_mode=’ave’)     # 平均
Bidirectional(LSTM(64), merge_mode=’mul’)     # 掛け算

2-2. 感情分析モデル

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ===== データ準備 =====
max_features = 10000
maxlen = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

# ===== 双方向LSTMモデル =====
model = Sequential([
    Embedding(max_features, 128, input_length=maxlen),
    Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2)),
    Dense(64, activation=’relu’),
    Dropout(0.5),
    Dense(1, activation=’sigmoid’)
])

model.compile(
    optimizer=’adam’,
    loss=’binary_crossentropy’,
    metrics=[‘accuracy’]
)

model.summary()

実行結果：

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 200, 128)          1280000   
 bidirectional (Bidirection) (None, 128)               98816     
                             ↑ 64×2 = 128次元
 dense (Dense)               (None, 64)                8256      
 dropout (Dropout)           (None, 64)                0         
 dense_1 (Dense)             (None, 1)                 65        
=================================================================
Total params: 1,387,137

👁️ 3. Attention機構の基本アイデア

3-1. Attentionとは？

Attention（注意機構）は、「どこに注目すべきか」を学習する仕組みです。
2017年の論文「Attention Is All You Need」で注目を集め、現在のAI革命の基盤となっています。

📚 読書の例え

長い文章を要約するとき、人間は全ての単語を均等に見ません。

「昨日、私は東京で美味しいラーメンを食べました」

質問：「どこで食べた？」
→ 「東京」に注目（Attention重み高）

質問：「何を食べた？」
→ 「ラーメン」に注目（Attention重み高）

Attentionはこの「注目度」を学習します。

3-2. Attentionの計算

【Attentionの基本的な流れ】

入力シーケンス：[h₁, h₂, h₃, h₄, h₅]

1. 各要素の重要度スコアを計算
   score = [0.1, 0.2, 0.5, 0.15, 0.05]
           ↑低い  ↑↑高い!  ↑低い

2. ソフトマックスで正規化（合計=1）
   attention_weights = softmax(score)
                     = [0.08, 0.15, 0.54, 0.14, 0.09]

3. 重み付き和を計算
   context = 0.08×h₁ + 0.15×h₂ + 0.54×h₃ + 0.14×h₄ + 0.09×h₅

結果：h₃（最も重要な要素）の情報が強く反映された出力

3-3. Self-Attention

📌 Self-Attentionとは？

Self-Attentionは、シーケンス内の各要素が他の全ての要素との関係を学習します。

「私は　銀行の　川の　近くに　住んでいる」

「銀行」の意味を理解するために：
・「川」に注目 → 「銀行」は川岸の意味（金融機関ではない）

このような文脈理解がSelf-Attentionの強みです。
（TransformerやBERTの核心技術）

💻 4. 簡単なAttention実装

4-1. カスタムAttentionレイヤー

import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense

class SimpleAttention(Layer):
    “””シンプルなAttentionレイヤー”””
    
    def __init__(self, **kwargs):
        super(SimpleAttention, self).__init__(**kwargs)
    
    def build(self, input_shape):
        self.W = self.add_weight(
            name=’attention_weight’,
            shape=(input_shape[-1], 1),
            initializer=’glorot_uniform’,
            trainable=True
        )
        self.b = self.add_weight(
            name=’attention_bias’,
            shape=(input_shape[1], 1),
            initializer=’zeros’,
            trainable=True
        )
        super(SimpleAttention, self).build(input_shape)
    
    def call(self, x):
        # スコア計算
        e = tf.nn.tanh(tf.matmul(x, self.W) + self.b)
        
        # ソフトマックスで正規化
        a = tf.nn.softmax(e, axis=1)
        
        # 重み付き和
        output = tf.reduce_sum(x * a, axis=1)
        
        return output

4-2. Attention付きモデル

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout

# ===== Attention付き感情分析モデル =====
max_features = 10000
maxlen = 200

# 入力
inputs = Input(shape=(maxlen,))

# Embedding
x = Embedding(max_features, 128)(inputs)

# 双方向LSTM（全タイムステップの出力を返す）
x = Bidirectional(LSTM(64, return_sequences=True, dropout=0.2))(x)

# Attention
x = SimpleAttention()(x)

# 分類
x = Dense(64, activation=’relu’)(x)
x = Dropout(0.5)(x)
outputs = Dense(1, activation=’sigmoid’)(x)

model = Model(inputs, outputs)
model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’])
model.summary()

実行結果：

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 200)]             0         
 embedding (Embedding)       (None, 200, 128)          1280000   
 bidirectional (Bidirection) (None, 200, 128)          98816     
 simple_attention (SimpleAt) (None, 128)               328       
 dense (Dense)               (None, 64)                8256      
 dropout (Dropout)           (None, 64)                0         
 dense_1 (Dense)             (None, 1)                 65        
=================================================================
Total params: 1,387,465

📝 5. 完成コード（IMDB感情分析）

“””
双方向LSTM + Attention による IMDB感情分析
Google Colabで実行可能
“””
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
    Input, Embedding, Bidirectional, LSTM, 
    Dense, Dropout, Layer
)
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping

# ===== カスタムAttentionレイヤー =====
class SimpleAttention(Layer):
    def __init__(self, **kwargs):
        super(SimpleAttention, self).__init__(**kwargs)
    
    def build(self, input_shape):
        self.W = self.add_weight(
            shape=(input_shape[-1], 1),
            initializer=’glorot_uniform’,
            trainable=True
        )
        self.b = self.add_weight(
            shape=(input_shape[1], 1),
            initializer=’zeros’,
            trainable=True
        )
        super(SimpleAttention, self).build(input_shape)
    
    def call(self, x):
        e = tf.nn.tanh(tf.matmul(x, self.W) + self.b)
        a = tf.nn.softmax(e, axis=1)
        return tf.reduce_sum(x * a, axis=1)

# ===== データ準備 =====
max_features = 10000
maxlen = 200

print(“データを読み込み中…”)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)
print(f”訓練データ: {X_train.shape}, テストデータ: {X_test.shape}”)

# ===== モデル構築 =====
inputs = Input(shape=(maxlen,))
x = Embedding(max_features, 128)(inputs)
x = Bidirectional(LSTM(64, return_sequences=True, dropout=0.2))(x)
x = SimpleAttention()(x)
x = Dense(64, activation=’relu’)(x)
x = Dropout(0.5)(x)
outputs = Dense(1, activation=’sigmoid’)(x)

model = Model(inputs, outputs)
model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’])

# ===== 学習 =====
early_stop = EarlyStopping(monitor=’val_loss’, patience=3, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=64,
    validation_split=0.2,
    callbacks=[early_stop]
)

# ===== 評価 =====
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f”\n🎯 テスト精度: {test_acc:.4f}”)

実行結果：

データを読み込み中...
訓練データ: (25000, 200), テストデータ: (25000, 200)

Epoch 1/10 - accuracy: 0.7812 - val_accuracy: 0.8523
Epoch 2/10 - accuracy: 0.8934 - val_accuracy: 0.8678
Epoch 3/10 - accuracy: 0.9234 - val_accuracy: 0.8712
Epoch 4/10 - accuracy: 0.9456 - val_accuracy: 0.8698
Early stopping...

🎯 テスト精度: 0.8734

→ 双方向 + Attentionで高精度を達成！

📝 STEP 24 のまとめ

✅ このステップで学んだこと

双方向RNN：過去と未来の両方から情報を取得
Bidirectional：Kerasで簡単に双方向化
出力次元：双方向にすると2倍になる
Attention：「どこに注目するか」を学習
Self-Attention：シーケンス内の要素間の関係を学習
Transformer：Attentionを核とした現代AIの基盤

💡 覚えておくべきコード

# 双方向LSTMの基本
from tensorflow.keras.layers import Bidirectional, LSTM

model = Sequential([
    Embedding(vocab_size, 128),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(num_classes, activation=’softmax’)
])

🚀 次のステップへ

RNNパートが完了しました！次のSTEP 25からはPart 6「実践とチューニング」に入ります。

ハイパーパラメータチューニングとモデル評価の技術を習得しましょう！

🔮 発展的な話題

このコースで学んだAttentionは、現代のNLP革命の基盤です。

・Transformer：Self-Attentionだけで構築（RNN不要）
・BERT：双方向Transformerで文脈理解
・GPT：大規模言語モデルの基盤

これらは自然言語処理（NLP）コースで詳しく学べます！

📝 練習問題

問題1 やさしい

双方向RNNの利点

双方向RNNの利点として、正しいものを選んでください。

A. パラメータ数が減る
B. 過去と未来の両方の情報を使える
C. 計算速度が速くなる
D. メモリ使用量が減る

正解：B

双方向RNNは、順方向と逆方向の2つのRNNを使い、シーケンスの過去と未来の両方から情報を取得できます。パラメータ数と計算量は増えます。

問題2 やさしい

双方向の出力次元

Bidirectional(LSTM(64))の出力次元はいくつですか？（merge_mode=’concat’の場合）

A. 32
B. 64
C. 128
D. 256

正解：C

merge_mode=’concat’（デフォルト）の場合、順方向64次元と逆方向64次元が結合されて128次元になります。

問題3 ふつう

Attentionの役割

Attention機構の主な役割として、最も適切なものを選んでください。

A. シーケンスを短くする
B. 重要な部分に注目して情報を集約する
C. 勾配消失を完全に解決する
D. パラメータ数を削減する

正解：B

Attentionは、シーケンスの各要素に「注目度」（重み）を計算し、重要な部分の情報を強調して集約します。

問題4 ふつう

return_sequencesの設定

Attentionを適用するLSTMで必要な設定として、正しいものを選んでください。

A. return_sequences=False（最後の出力のみ）
B. return_sequences=True（全タイムステップの出力）
C. return_state=True
D. units=1

正解：B

Attentionは各タイムステップの出力に重みを計算して集約するため、return_sequences=Trueで全タイムステップの出力を取得する必要があります。

問題5 むずかしい

Self-Attentionの特徴

Self-Attentionの説明として、正しいものを選んでください。

A. 外部の情報源に注目する機構
B. シーケンス内の各要素が他の全要素との関係を学習する
C. 自分自身のみに注目する機構
D. RNNと組み合わせないと使えない

正解：B

Self-Attentionは、シーケンス内の各要素（例：文中の各単語）が、同じシーケンス内の他の全要素との関係性を学習します。Transformerではこれがメインの機構で、RNNは不要です。

📝

学習メモ

ディープラーニング基礎 - Step 24

📋 過去のメモ一覧 ▼