【Autoregressive(自己回帰)生成】
定義:
前の単語を条件として、次の単語を1つずつ生成する
数式:
P(文全体) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × …
意味:
・w₁の確率 × w₂がw₁の後に来る確率 × w₃がw₁,w₂の後に来る確率…
【生成プロセスの例】
入力プロンプト: “Once upon a time”
ステップ1:
入力: “Once upon a time”
予測: 次の単語の確率を計算
“there” (30%), “in” (20%), “the” (15%), …
選択: “there”(最も確率が高い)
出力: “Once upon a time there”
ステップ2:
入力: “Once upon a time there”
予測: “was” (40%), “lived” (25%), …
選択: “was”
出力: “Once upon a time there was”
ステップ3:
入力: “Once upon a time there was”
予測: “a” (35%), “an” (20%), …
選択: “a”
出力: “Once upon a time there was a”
… (繰り返し)
終了条件:
・<EOS>(End of Sentence)トークンが生成される
・指定した最大長に達する
【図解】
“Once upon a time” → [GPT] → “there” (0.30)
“in” (0.20)
“the” (0.15)
…
最も確率が高い “there” を選択
↓
“Once upon a time there” → [GPT] → “was” (0.40)
“lived” (0.25)
…
=== Greedy Search ===
Once upon a time, in a kingdom far away, there was a young man who
was very good at his job. He was a very good worker. He was a very
good worker. He was a very good worker. He was
⚠️ Greedy Searchの問題点
「He was a very good worker.」が繰り返されています。
これがGreedy Searchの欠点です。
=== Top-p Sampling ===
Once upon a time, in a kingdom far away, there lived a brave knight
named Sir Arthur. He embarked on a quest to find the legendary sword
that was hidden deep in the enchanted forest. Along the way, he
encountered mystical creatures and faced many challenges.
# ========================================
# Temperature の比較
# ========================================
prompt = “The future of artificial intelligence is”
input_ids = tokenizer.encode(prompt, return_tensors=’pt’).to(device)
temperatures = [0.3, 0.7, 1.0, 1.5]
print(“=== Temperature 比較 ===\n”)
for temp in temperatures:
output = model.generate(
input_ids,
max_length=50,
do_sample=True,
top_p=0.9,
temperature=temp,
pad_token_id=tokenizer.eos_token_id
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f”Temperature = {temp}:”)
print(f” {text}”)
print()
実行結果(例):
=== Temperature 比較 ===
Temperature = 0.3:
The future of artificial intelligence is likely to be dominated
by machine learning and deep learning algorithms. These technologies
are expected to revolutionize various industries.
Temperature = 0.7:
The future of artificial intelligence is exciting and full of
possibilities. We can expect AI to play an increasingly important
role in our daily lives.
Temperature = 1.0:
The future of artificial intelligence is uncertain yet fascinating.
Some experts predict rapid advancements while others remain cautious
about potential risks.
Temperature = 1.5:
The future of artificial intelligence is quantum raspberry cosmic
synthesizers dancing through nebulous datastreams while telepathic
algorithms whisper secrets…
(← 高すぎて意味不明に)
【Temperature の解釈】
T = 0.3(低い):
・確信的で保守的
・予測可能な文章
・事実的な内容向き
T = 0.7(中程度):
・バランスが良い
・多くの用途に適する
・推奨設定
T = 1.0(元の分布):
・モデル本来の確率
・やや創造的
T = 1.5(高い):
・非常にランダム
・意味不明になりがち
・通常は使わない
2-6. 複数の候補を生成
# ========================================
# 複数の候補を生成
# ========================================
prompt = “Once upon a time, in a kingdom far away,”
input_ids = tokenizer.encode(prompt, return_tensors=’pt’).to(device)
# num_return_sequences: 生成する候補の数
output = model.generate(
input_ids,
max_length=60,
do_sample=True,
top_p=0.9,
temperature=0.8,
num_return_sequences=3, # 3つの候補を生成
pad_token_id=tokenizer.eos_token_id
)
print(“=== 複数候補の生成 ===\n”)
for i, seq in enumerate(output, 1):
text = tokenizer.decode(seq, skip_special_tokens=True)
print(f”候補 {i}:”)
print(f” {text}”)
print()
実行結果(例):
=== 複数候補の生成 ===
候補 1:
Once upon a time, in a kingdom far away, there was a magical
forest where fairies danced under the moonlight…
候補 2:
Once upon a time, in a kingdom far away, ruled a wise queen who
made decisions that brought prosperity to her people…
候補 3:
Once upon a time, in a kingdom far away, lived a curious child
who dreamed of exploring the world beyond the mountains…
# ========================================
# 要約する記事
# ========================================
article = “””
The Amazon rainforest, also known as Amazonia, is a moist broadleaf
tropical rainforest in the Amazon biome that covers most of the Amazon
basin of South America. This basin encompasses 7 million square kilometers,
of which 5.5 million square kilometers are covered by the rainforest.
The majority of the forest is contained within Brazil, with 60% of the
rainforest, followed by Peru with 13%, Colombia with 10%, and with minor
amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname, and French Guiana.
The Amazon represents over half of the planet’s remaining rainforests,
and comprises the largest and most biodiverse tract of tropical rainforest
in the world.
“””
print(“=== 元の記事 ===”)
print(article)
print(f”文字数: {len(article)}”)
実行結果:
=== 元の記事 ===
The Amazon rainforest, also known as Amazonia, is a moist broadleaf…
文字数: 723
=== T5 要約 ===
the Amazon rainforest is a tropical rainforest in the Amazon biome.
it covers most of the Amazon basin of South America. the majority
of the forest is in Brazil. the Amazon represents over half of the
planet’s remaining rainforests.
文字数: 234
=== 短い要約(30-50トークン) ===
the Amazon rainforest covers most of the Amazon basin. it is the
largest tropical rainforest in the world.
=== 長い要約(100-150トークン) ===
the Amazon rainforest, also known as Amazonia, is a moist broadleaf
tropical rainforest in the Amazon biome. the basin encompasses 7
million square kilometers. the majority of the forest is contained
within Brazil, with 60% of the rainforest. the Amazon represents
over half of the planet’s remaining rainforests.
=== BART 要約 ===
The Amazon rainforest covers most of the Amazon basin of South America.
The basin encompasses 7 million square kilometers. The majority of
the forest is in Brazil, followed by Peru, Colombia, and Venezuela.
💡 Pipelineのメリット
簡単: 数行のコードで要約が完成
高品質: ファインチューニング済みモデルを使用
BART: ニュース要約に特化したモデル
📊 5. 評価指標
テキスト生成と要約の品質を測定する指標を学びます。
5-1. ROUGE Score
【ROUGE(Recall-Oriented Understudy for Gisting Evaluation)】
定義:
生成された要約と参照要約の「重なり」を測定
3つの種類:
ROUGE-1: 単語(unigram)の重なり
ROUGE-2: 2単語の組(bigram)の重なり
ROUGE-L: 最長共通部分列(LCS)
【ROUGE-1 の計算例】
参照: “The cat sat on the mat”
生成: “The cat is on the mat”
参照の単語: {The, cat, sat, on, the, mat} → 6個
生成の単語: {The, cat, is, on, the, mat} → 6個
共通の単語: {The, cat, on, the, mat} → 5個
Precision = 共通 / 生成 = 5/6 = 0.833
Recall = 共通 / 参照 = 5/6 = 0.833
F1 = 2 × P × R / (P + R) = 0.833
【ROUGE-2 の計算例】
参照のbigram: {The cat, cat sat, sat on, on the, the mat}
生成のbigram: {The cat, cat is, is on, on the, the mat}
共通: {The cat, on the, the mat} → 3個
Precision = 3/5 = 0.60
Recall = 3/5 = 0.60
F1 = 0.60
【ROUGE-L】
最長共通部分列(LCS)を使用
参照: “The cat sat on the mat”
生成: “The cat is on the mat”
LCS: “The cat on the mat”(5単語)
ROUGE-L は語順を考慮しつつ、連続でなくてもOK
# ========================================
# ROUGEスコアの計算
# ========================================
# 参照要約(人間が書いた正解)
reference = “””
The Amazon rainforest is a tropical rainforest covering most of
the Amazon basin in South America. It is the largest rainforest
in the world and contains incredible biodiversity.
“””
# 生成要約(モデルが出力)
generated = “””
The Amazon rainforest covers the Amazon basin of South America.
It is the world’s largest rainforest with remarkable biodiversity.
“””
# スコア計算
scores = scorer.score(reference, generated)
print(“=== ROUGE スコア ===\n”)
for metric, score in scores.items():
print(f”{metric.upper()}:”)
print(f” Precision: {score.precision:.4f}”)
print(f” Recall: {score.recall:.4f}”)
print(f” F1: {score.fmeasure:.4f}”)
print()