🚀 STEP 15: 高度なアンサンブル手法

XGBoost、LightGBM – Kaggleで頻繁に使われる最強の手法を学びます

📋 このステップで学ぶこと

Gradient Boostingの詳細
XGBoostの特徴と使い方
LightGBMの高速性
CatBoostの紹介
ハイパーパラメータチューニング
実践：Kaggle風コンペティション

演習問題： 6問

🎯 1. Gradient Boostingの詳細

Gradient Boostingとは？

📈 勾配降下法 + ブースティング

Gradient Boostingは、残差（誤差）を予測するモデルを逐次的に追加していく手法です。

AdaBoostとの違い：
・AdaBoost：誤分類したデータの重みを増やす
・Gradient Boosting：残差そのものを予測するモデルを追加
→ より柔軟で、高性能！

Gradient Boostingの流れ

【Gradient Boostingのプロセス】

1回目:
  初期予測: 全データの平均値（例：150）
  
  実際の値: [100, 200, 180]
  残差:     [-50,  50,  30]  ← これを予測したい
  
  モデル1を訓練（残差を予測）
  モデル1の予測: [-40, 45, 25]
  
  新しい予測 = 150 + [-40, 45, 25]
             = [110, 195, 175]

2回目:
  実際の値: [100, 200, 180]
  現在の予測: [110, 195, 175]
  新しい残差: [-10, 5, 5]  ← まだ誤差がある
  
  モデル2を訓練（新しい残差を予測）
  新しい予測 = [110, 195, 175] + [-8, 4, 4]
             = [102, 199, 179]

→ どんどん誤差が小さくなる

最終予測 = 初期予測 + モデル1 + モデル2 + モデル3 + …

Scikit-learnでの実装

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# データ生成
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Gradient Boostingモデル
gb = GradientBoostingClassifier(
    n_estimators=100,   # ツリーの数
    learning_rate=0.1,  # 学習率
    max_depth=3,        # 各ツリーの深さ
    random_state=42
)

gb.fit(X_train, y_train)

print(“=== Gradient Boostingの結果 ===”)
print(f”訓練精度: {gb.score(X_train, y_train):.4f}”)
print(f”テスト精度: {gb.score(X_test, y_test):.4f}”)

=== Gradient Boostingの結果 === 訓練精度: 0.9714 テスト精度: 0.9567

learning_rateの影響

# learning_rateの影響を確認
learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]

print(“=== learning_rateによる性能の違い ===”)
for lr in learning_rates:
    model = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=lr,
        max_depth=3,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    print(f”learning_rate={lr:.2f}: Train={train_score:.4f}, Test={test_score:.4f}”)

=== learning_rateによる性能の違い === learning_rate=0.01: Train=0.8571, Test=0.8533 learning_rate=0.05: Train=0.9357, Test=0.9267 learning_rate=0.10: Train=0.9714, Test=0.9567 learning_rate=0.50: Train=0.9929, Test=0.9500 learning_rate=1.00: Train=1.0000, Test=0.9400

🔍 learning_rateの役割

learning_rate（学習率）は、各ツリーの貢献度を調整します。

小さい（0.01）：ゆっくり学習、より多くのツリーが必要、過学習しにくい

大きい（1.0）：速く学習、少ないツリーで十分、過学習しやすい

推奨：0.05〜0.1が一般的

⚡ 2. XGBoost（eXtreme Gradient Boosting）

XGBoostとは？

🏆 Kaggleで最も使われるアルゴリズム

XGBoostは、Gradient Boostingを高速化・改善した手法です。

主な特徴：
・高速：並列処理、最適化されたアルゴリズム
・高性能：正則化、欠損値処理
・柔軟：カスタマイズ可能なパラメータ
・人気：Kaggleの上位入賞者の定番

XGBoostの使い方

from xgboost import XGBClassifier

# XGBoostモデル
xgb_model = XGBClassifier(
    n_estimators=100,      # ツリーの数
    learning_rate=0.1,     # 学習率
    max_depth=3,           # 深さ
    subsample=0.8,         # サンプリング比率
    colsample_bytree=0.8,  # 特徴量サンプリング比率
    random_state=42,
    eval_metric=’logloss’  # 評価指標
)

# 訓練
xgb_model.fit(X_train, y_train)

print(“=== XGBoostの結果 ===”)
print(f”訓練精度: {xgb_model.score(X_train, y_train):.4f}”)
print(f”テスト精度: {xgb_model.score(X_test, y_test):.4f}”)

# 比較
print(“\n=== Gradient Boosting vs XGBoost ===”)
print(f”Scikit-learn GB: {gb.score(X_test, y_test):.4f}”)
print(f”XGBoost:         {xgb_model.score(X_test, y_test):.4f}”)

=== XGBoostの結果 === 訓練精度: 0.9929 テスト精度: 0.9633 === Gradient Boosting vs XGBoost === Scikit-learn GB: 0.9567 XGBoost: 0.9633

💡 XGBoostの重要パラメータ

パラメータ	説明
n_estimators	ツリーの数（100〜1000が一般的）
learning_rate	学習率（0.01〜0.3、小さいほど慎重）
max_depth	ツリーの深さ（3〜10、深いほど複雑）
subsample	データサンプリング比率（0.8が一般的）
colsample_bytree	特徴量サンプリング比率（0.8が一般的）
min_child_weight	葉の最小重み（過学習を防ぐ）

💨 3. LightGBM（Light Gradient Boosting Machine）

LightGBMとは？

🚄 Microsoftが開発した超高速ブースティング

LightGBMは、XGBoostよりもさらに高速なアルゴリズムです。

主な特徴：
・超高速：大規模データでもXGBoostより速い
・メモリ効率：少ないメモリで動作
・高精度：XGBoostと同等以上の性能
・カテゴリ変数対応：エンコーディング不要

XGBoost vs LightGBMの違い

【ツリーの成長方法の違い】

XGBoost: Level-wise（レベルごと）
  → すべてのノードを均等に分割
  → 時間がかかる

LightGBM: Leaf-wise（葉ごと）
  → 損失が最も減る葉を優先的に分割
  → より高速、より精度が高い

LightGBMの実装

from lightgbm import LGBMClassifier

# LightGBMモデル
lgbm_model = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    num_leaves=31,        # 葉の数（LightGBM特有）
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1            # ログを抑制
)

# 訓練
lgbm_model.fit(X_train, y_train)

print(“=== LightGBMの結果 ===”)
print(f”訓練精度: {lgbm_model.score(X_train, y_train):.4f}”)
print(f”テスト精度: {lgbm_model.score(X_test, y_test):.4f}”)

# 3つの手法を比較
print(“\n=== 3つの手法の比較 ===”)
print(f”Gradient Boosting: {gb.score(X_test, y_test):.4f}”)
print(f”XGBoost:           {xgb_model.score(X_test, y_test):.4f}”)
print(f”LightGBM:          {lgbm_model.score(X_test, y_test):.4f}”)

=== LightGBMの結果 === 訓練精度: 0.9929 テスト精度: 0.9600 === 3つの手法の比較 === Gradient Boosting: 0.9567 XGBoost: 0.9633 LightGBM: 0.9600

速度比較

import time

# より大きなデータセットで速度を比較
X_large, y_large = make_classification(
    n_samples=10000,
    n_features=50,
    n_informative=30,
    random_state=42
)

X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(
    X_large, y_large, test_size=0.3, random_state=42
)

print(“=== 訓練速度の比較（10,000サンプル、50特徴量） ===\n”)

for name, model in [(‘XGBoost’, XGBClassifier(n_estimators=100, random_state=42, eval_metric=’logloss’)),
                    (‘LightGBM’, LGBMClassifier(n_estimators=100, random_state=42, verbose=-1))]:
    start_time = time.time()
    model.fit(X_train_l, y_train_l)
    elapsed = time.time() – start_time
    accuracy = model.score(X_test_l, y_test_l)
    
    print(f”{name:15s}: {elapsed:.3f}秒, 精度: {accuracy:.4f}”)

=== 訓練速度の比較（10,000サンプル、50特徴量） === XGBoost : 2.456秒, 精度: 0.9567 LightGBM : 0.789秒, 精度: 0.9600

🚀 LightGBMが速い理由

Leaf-wise成長：効率的なツリー構築
ヒストグラムベース：連続値をビンに分割
並列処理：特徴量レベルで並列化
GOSS：勾配ベースのサンプリング

🐱 4. XGBoost vs LightGBM vs CatBoost

💡 3つの比較まとめ

項目	XGBoost	LightGBM	CatBoost
速度	速い	最速	速い
精度	高い	高い	高い
チューニング	必要	必要	少ない
カテゴリ変数	要エンコード	一部対応	自動対応
推奨	汎用的	大規模データ	カテゴリ多い

🎛️ 5. ハイパーパラメータチューニング

GridSearchCVでチューニング

from sklearn.model_selection import GridSearchCV

# XGBoostのパラメータグリッド
param_grid = {
    ‘n_estimators’: [50, 100, 200],
    ‘max_depth’: [3, 5, 7],
    ‘learning_rate’: [0.01, 0.1, 0.3],
    ‘subsample’: [0.8, 1.0]
}

# GridSearch
xgb_grid = XGBClassifier(random_state=42, eval_metric=’logloss’)

grid_search = GridSearchCV(
    xgb_grid,
    param_grid,
    cv=5,
    scoring=’accuracy’,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(“=== 最良パラメータ ===”)
print(grid_search.best_params_)
print(f”\n最良CV スコア: {grid_search.best_score_:.4f}”)
print(f”テストスコア: {grid_search.score(X_test, y_test):.4f}”)

=== 最良パラメータ === {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘n_estimators’: 200, ‘subsample’: 0.8} 最良CV スコア: 0.9643 テストスコア: 0.9700

RandomizedSearchCVで効率的に探索

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# ランダムサーチのパラメータ分布
param_dist = {
    ‘n_estimators’: randint(50, 300),
    ‘max_depth’: randint(3, 10),
    ‘learning_rate’: uniform(0.01, 0.29),  # 0.01〜0.3
    ‘subsample’: uniform(0.7, 0.3),        # 0.7〜1.0
    ‘colsample_bytree’: uniform(0.7, 0.3)
}

# RandomizedSearch
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42, eval_metric=’logloss’),
    param_distributions=param_dist,
    n_iter=20,  # 20回試す
    cv=5,
    scoring=’accuracy’,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(“=== 最良パラメータ ===”)
print(random_search.best_params_)
print(f”\n最良CV スコア: {random_search.best_score_:.4f}”)
print(f”テストスコア: {random_search.score(X_test, y_test):.4f}”)

=== 最良パラメータ === {‘colsample_bytree’: 0.8234, ‘learning_rate’: 0.1567, ‘max_depth’: 6, ‘n_estimators’: 187, ‘subsample’: 0.8912} 最良CV スコア: 0.9671 テストスコア: 0.9733

💡 GridSearch vs RandomizedSearch

GridSearch：すべての組み合わせを試す。確実だが時間がかかる。

RandomizedSearch：ランダムにn_iter回試す。速いが最適解を逃す可能性。パラメータが多い場合に推奨

🏆 6. 実践プロジェクト：Kaggle風コンペティション

タスク：顧客離脱予測

from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# 不均衡データを生成（離脱30%）
X_full, y_full = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X_full, y_full, test_size=0.2, random_state=42, stratify=y_full
)

# 複数のモデルを準備
models_compete = {
    ‘Logistic Regression’: LogisticRegression(max_iter=1000, random_state=42),
    ‘Random Forest’: RandomForestClassifier(n_estimators=100, random_state=42),
    ‘XGBoost’: XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42, eval_metric=’logloss’),
    ‘LightGBM’: LGBMClassifier(n_estimators=100, learning_rate=0.1, random_state=42, verbose=-1),
}

print(“=== モデルの訓練・評価（AUC） ===\n”)

for name, model in models_compete.items():
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    accuracy = model.score(X_test, y_test)
    
    print(f”{name:25s}: AUC={auc:.4f}, Accuracy={accuracy:.4f}”)

=== モデルの訓練・評価（AUC） === Logistic Regression : AUC=0.9234, Accuracy=0.8560 Random Forest : AUC=0.9756, Accuracy=0.9120 XGBoost : AUC=0.9823, Accuracy=0.9280 LightGBM : AUC=0.9812, Accuracy=0.9250

🏆 結果の分析

1位：XGBoost – AUC 0.9823
2位：LightGBM – AUC 0.9812
3位：Random Forest – AUC 0.9756

重要な発見：
・XGBoost、LightGBMが圧倒的に高性能
・ハイパーパラメータチューニングでさらに改善可能
・Kaggleでは、このようなわずかな改善の積み重ねが上位入賞につながる！

📝 練習問題

問題1 やさしい

Gradient Boostingの基本

Gradient Boostingについて、正しい説明をすべて選んでください。

A. 残差（誤差）を予測するモデルを逐次的に追加する
B. 各モデルは並列に訓練される
C. learning_rateが大きいほど過学習しやすい
D. ランダムフォレストの一種である
E. XGBoostはGradient Boostingの改良版である

正解：A、C、E

A（正解）：Gradient Boostingは残差を予測するモデルを追加していきます。
B（誤り）：逐次的に訓練されます（前のモデルに依存）。並列訓練はバギングの特徴です。
C（正解）：learning_rateが大きいと各ツリーの貢献が大きくなり、過学習しやすくなります。
D（誤り）：ランダムフォレストはバギング系、Gradient Boostingはブースティング系で異なる手法です。
E（正解）：XGBoostはGradient Boostingを高速化・改善した手法です。

問題2 やさしい

XGBoost vs LightGBM

以下の特徴は、XGBoostとLightGBMのどちらの特徴ですか？

Leaf-wise（葉ごと）でツリーを成長させる
Level-wise（レベルごと）でツリーを成長させる
大規模データでより高速
Kaggleで最も人気がある

正解：1.LightGBM、2.XGBoost、3.LightGBM、4.XGBoost

1. LightGBM：損失が最も減る葉を優先的に分割するLeaf-wise方式。
2. XGBoost：全ノードを均等に分割するLevel-wise方式。
3. LightGBM：ヒストグラムベースの最適化により、大規模データで高速。
4. XGBoost：最も広く使われており、ドキュメントやコミュニティが充実。

問題3 ふつう

learning_rateとn_estimatorsの関係

learning_rate=0.01とlearning_rate=0.1を使う場合、同程度の性能を得るためにn_estimators（ツリーの数）をどう調整すべきか説明してください。

解答

learning_rateが小さい場合、n_estimatorsを大きくする必要があります。

learning_rate=0.01：各ツリーの貢献が小さいため、多くのツリー（例：1000個）が必要
learning_rate=0.1：各ツリーの貢献が大きいため、少ないツリー（例：100個）で十分

トレードオフ：
・learning_rate小 + n_estimators大 → 過学習しにくいが、計算時間が長い
・learning_rate大 + n_estimators小 → 速いが、過学習しやすい

推奨：learning_rate=0.05〜0.1で始め、Early Stoppingで最適なn_estimatorsを見つける

問題4 ふつう

パラメータチューニングの優先順位

XGBoostのハイパーパラメータをチューニングする際、どのパラメータから順に調整すべきか、優先順位を説明してください。

解答

優先順位（影響が大きい順）：

n_estimators + learning_rate：最も影響が大きい。まず大きめのn_estimatorsと小さめのlearning_rateで試す
max_depth：ツリーの複雑さを制御。3〜10の範囲で試す
subsample, colsample_bytree：過学習を防ぐ。0.6〜1.0の範囲
min_child_weight, gamma：正則化パラメータ。微調整
reg_alpha, reg_lambda：L1/L2正則化。最終調整

実践的なアプローチ：
1. デフォルトパラメータでベースライン
2. n_estimators=1000, learning_rate=0.01でEarly Stopping
3. max_depthを調整
4. subsample, colsample_bytreeを調整
5. その他のパラメータを微調整

問題5 むずかしい

XGBoostとLightGBMの使い分け

以下の状況で、XGBoostとLightGBMのどちらを選ぶべきか、理由とともに説明してください。

データが100万行、100特徴量ある
精度を最大化したい、計算時間は気にしない
カテゴリ変数が多い（50個以上）
モデルの解釈性を重視する

解答

1. LightGBM：大規模データでLightGBMは圧倒的に高速。メモリ効率も良い。
2. どちらでも（または両方試す）：精度は同程度。チューニング込みで両方試して比較するのがベスト。Kaggleではアンサンブルも有効。
3. CatBoost（またはLightGBM）：CatBoostはカテゴリ変数を自動処理。LightGBMもcategorical_featureオプションで対応可能。XGBoostはエンコーディングが必要。
4. XGBoost（または単純なモデル）：XGBoostの方がドキュメントが充実しており、解釈ツールも多い。ただし、解釈性を最重視するなら、決定木やロジスティック回帰の方が良い。

問題6 むずかしい

XGBoostのチューニング実装

XGBoostでGridSearchCVを使い、乳がんデータセットで最適なパラメータを見つけてください。

解答

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier

# データ
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# パラメータグリッド
param_grid = {
    ‘n_estimators’: [50, 100, 200],
    ‘max_depth’: [3, 5, 7],
    ‘learning_rate’: [0.05, 0.1, 0.2]
}

# GridSearch
grid_search = GridSearchCV(
    XGBClassifier(random_state=42, eval_metric=’logloss’),
    param_grid,
    cv=5,
    scoring=’accuracy’,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(“=== 結果 ===”)
print(f”最良パラメータ: {grid_search.best_params_}”)
print(f”最良CV スコア: {grid_search.best_score_:.4f}”)
print(f”テストスコア: {grid_search.score(X_test, y_test):.4f}”)

=== 結果 === 最良パラメータ: {‘learning_rate’: 0.1, ‘max_depth’: 3, ‘n_estimators’: 100} 最良CV スコア: 0.9724 テストスコア: 0.9766

📝 STEP 15 のまとめ

        ✅ このステップで学んだこと
        Gradient Boosting：残差を逐次的に予測
XGBoost：Kaggleで最も人気の手法
LightGBM：超高速で大規模データに強い
CatBoost：カテゴリ変数に強い
ハイパーパラメータチューニングの重要性
実践的なKaggle風コンペティション

    

🎯 実務での推奨フロー

1. ベースライン（ロジスティック回帰、決定木）
   ↓
2. ランダムフォレスト
   ↓
3. XGBoost / LightGBM（デフォルトパラメータ）
   ↓
4. ハイパーパラメータチューニング
   ↓
5. アンサンブル（複数モデルの組み合わせ）

🚀 次のステップへ

次のSTEP 16では、サポートベクターマシン（SVM）を学びます。カーネルトリックという強力な手法で、非線形パターンを捉えます。

❓ よくある質問

Q1. XGBoostとLightGBMはどちらを使うべき？

状況による使い分け：
・XGBoost：汎用的、ドキュメントが充実、コミュニティが大きい
・LightGBM：大規模データ、高速訓練が必要な場合

推奨：まずXGBoostで試し、速度が問題ならLightGBMを検討

Q2. Early Stoppingとは？

検証データのスコアが改善しなくなったら訓練を自動停止する機能です。

使い方：
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=10)

→ 10ラウンド改善しなければ停止

Q3. GPUで高速化できますか？

Yes！XGBoost、LightGBM共に対応：
・XGBoost: tree_method='gpu_hist'
・LightGBM: device='gpu'

大規模データでは10倍以上高速化することもあります。

Q4. 過学習を防ぐには？

主な対策：
・learning_rateを小さくする（0.01〜0.1）
・max_depthを浅くする（3〜6）
・subsample, colsample_bytreeを下げる（0.7〜0.8）
・min_child_weight, gammaを増やす
・Early Stoppingを使う

Q5. CatBoostはいつ使う？

CatBoostが適している場面：
・カテゴリ変数が多い（自動処理）
・チューニングに時間をかけたくない（デフォルトで高性能）
・テキストデータを扱う（テキスト特徴量対応）

注意：インストールに時間がかかる、一部環境で動作しないことがある

📝

学習メモ

機械学習入門 - Step 15

📋 過去のメモ一覧 ▼