🎯 STEP 9: 物体検出の基礎

物体検出とは何か、Bounding Box、IoU、NMS、
主要データセットとアノテーションツールを学びます

📋 このステップで学ぶこと

物体検出とは何か（画像分類との違い）
Bounding Box（バウンディングボックス）の3つの表現方法
IoU（Intersection over Union）の計算と意味
Non-Maximum Suppression（NMS）の仕組みと実装
物体検出の主要データセット（PASCAL VOC、MS COCO）
アノテーションツール（LabelImg、CVAT、Roboflow）
PyTorchでの物体検出データセットの作成

🎯 1. 物体検出とは何か

物体検出（Object Detection）は、画像の中から「何が」「どこに」あるかを特定するタスクです。画像分類が「この画像には猫がいる」と判定するのに対し、物体検出は「この画像の(100, 50)から(300, 200)の位置に猫がいる」というように、位置情報も含めて出力します。

1-1. 3つのCVタスクの違い

【画像分類（Image Classification）】

入力: 画像 1枚
出力: クラスラベル 1つ
例:   “猫”

         ┌─────────────┐
   画像  │     🐱      │  →  “猫” (confidence: 0.95)
         └─────────────┘

用途: 画像に写っている主要な物体が何かを判定
制限: 1画像1物体が前提、位置情報なし

─────────────────────────────────────────────────

【物体検出（Object Detection）】

入力: 画像 1枚
出力: 複数の [クラス + Bounding Box + 信頼度]
例:   [(“猫”, [100, 50, 300, 200], 0.95),
       (“犬”, [400, 100, 550, 280], 0.87)]

         ┌─────────────────────────────┐
   画像  │  ┌─────┐      ┌─────┐      │
         │  │ 🐱  │      │ 🐕  │      │
         │  └─────┘      └─────┘      │
         └─────────────────────────────┘
              ↑              ↑
           猫 0.95        犬 0.87

用途: 複数の物体の検出と位置特定
出力: 矩形（四角形）で物体を囲む

─────────────────────────────────────────────────

【セマンティックセグメンテーション】

入力: 画像 1枚
出力: ピクセルごとのクラスラベル
例:   各ピクセルに “猫”、”犬”、”背景” などを割り当て

         ┌─────────────────────────────┐
   画像  │  ████████      ████████    │
         │  ████████      ████████    │
         │  ████████      ████████    │
         └─────────────────────────────┘
           猫ピクセル    犬ピクセル

用途: 物体の正確な輪郭を特定
出力: ピクセル単位のマスク

1-2. 物体検出が解決する2つのタスク

💡 物体検出 = 分類 + 位置特定

物体検出は、以下の2つのサブタスクを同時に解決します：

① 分類（Classification）
各物体が「何であるか」を判定する。
例：これは猫、これは犬、これは車…

② 位置特定（Localization）
各物体が「どこにあるか」を特定する。
例：猫は画像の左上にいる（座標で表現）

1-3. 物体検出の応用分野

分野	用途	検出対象	要求性能
自動運転	周囲の物体認識、衝突回避	歩行者、車、信号、標識	リアルタイム、高精度
監視カメラ	異常検知、人物追跡	人、車、不審物	24時間稼働、低誤検出
製造業	品質検査、不良品検出	傷、欠陥、異物	高精度、小さい物体検出
医療	病変検出、診断支援	腫瘍、病変部位	高精度、見逃し最小化
小売	商品認識、棚分析	商品、価格タグ	多クラス対応

📦 2. Bounding Box（バウンディングボックス）

Bounding Boxは、画像中の物体を囲む矩形（四角形）のことです。物体検出では、この矩形の座標を予測することで物体の位置を表現します。

2-1. 3つの表現方法

Bounding Boxには主に3つの表現方法があり、ツールやモデルによって異なります。相互変換できるようにしておくことが重要です。

【3つのBounding Box表現方法】

■ 方法1: (x, y, w, h) 形式 — PASCAL VOC、MS COCOで使用
  x:  左上のx座標（ピクセル）
  y:  左上のy座標（ピクセル）
  w:  幅（ピクセル）
  h:  高さ（ピクセル）

  例: (100, 80, 200, 150)
      → 左上が(100, 80)、幅200、高さ150のBox

         (100,80)
            ↓
            ┌──────────────────┐
            │                  │ 高さ
            │      物体        │ 150px
            │                  │
            └──────────────────┘
                  幅 200px

■ 方法2: (x1, y1, x2, y2) 形式 — PyTorchで使用
  x1, y1: 左上の座標
  x2, y2: 右下の座標

  例: (100, 80, 300, 230)
      → 左上(100, 80)、右下(300, 230)

         (100,80)
            ↓
            ┌──────────────────┐
            │                  │
            │      物体        │
            │                  │
            └──────────────────┘
                              ↑
                         (300,230)

■ 方法3: YOLO形式 — YOLOシリーズで使用
  x_center: 中心のx座標（0〜1に正規化）
  y_center: 中心のy座標（0〜1に正規化）
  width:    幅（0〜1に正規化）
  height:   高さ（0〜1に正規化）

  例: (0.3125, 0.3229, 0.3125, 0.3125)
      → 中心が画像の(31.25%, 32.29%)の位置
        幅と高さが画像の31.25%

  ※ 画像サイズに依存しない相対座標

2-2. 形式変換の実装

異なる形式間で変換できる関数を実装しましょう。

※ コードが横に長い場合は横スクロールできます

# ===================================================
# Bounding Box形式変換関数
# ===================================================

def xywh_to_xyxy(bbox):
    “””
    (x, y, w, h) 形式 → (x1, y1, x2, y2) 形式に変換
    
    Args:
        bbox: [x, y, w, h] – 左上座標と幅・高さ
    
    Returns:
        [x1, y1, x2, y2] – 左上と右下の座標
    “””
    x, y, w, h = bbox
    x1 = x          # 左上x = そのまま
    y1 = y          # 左上y = そのまま
    x2 = x + w      # 右下x = 左上x + 幅
    y2 = y + h      # 右下y = 左上y + 高さ
    return [x1, y1, x2, y2]


def xyxy_to_xywh(bbox):
    “””
    (x1, y1, x2, y2) 形式 → (x, y, w, h) 形式に変換
    
    Args:
        bbox: [x1, y1, x2, y2] – 左上と右下の座標
    
    Returns:
        [x, y, w, h] – 左上座標と幅・高さ
    “””
    x1, y1, x2, y2 = bbox
    x = x1              # 左上x = そのまま
    y = y1              # 左上y = そのまま
    w = x2 – x1         # 幅 = 右下x – 左上x
    h = y2 – y1         # 高さ = 右下y – 左上y
    return [x, y, w, h]


def xywh_to_yolo(bbox, image_width, image_height):
    “””
    (x, y, w, h) 形式 → YOLO形式（正規化）に変換
    
    Args:
        bbox: [x, y, w, h] – 左上座標と幅・高さ
        image_width: 画像の幅
        image_height: 画像の高さ
    
    Returns:
        [x_center, y_center, width, height] – 正規化された中心座標と大きさ
    “””
    x, y, w, h = bbox
    
    # 中心座標を計算（ピクセル値）
    x_center_pixel = x + w / 2
    y_center_pixel = y + h / 2
    
    # 0〜1の範囲に正規化
    x_center = x_center_pixel / image_width
    y_center = y_center_pixel / image_height
    width = w / image_width
    height = h / image_height
    
    return [x_center, y_center, width, height]


def yolo_to_xywh(bbox, image_width, image_height):
    “””
    YOLO形式 → (x, y, w, h) 形式に変換
    
    Args:
        bbox: [x_center, y_center, width, height] – 正規化された値
        image_width: 画像の幅
        image_height: 画像の高さ
    
    Returns:
        [x, y, w, h] – ピクセル値の座標
    “””
    x_center, y_center, width, height = bbox
    
    # 正規化を解除してピクセル値に変換
    w = width * image_width
    h = height * image_height
    x_center_pixel = x_center * image_width
    y_center_pixel = y_center * image_height
    
    # 中心座標から左上座標を計算
    x = x_center_pixel – w / 2
    y = y_center_pixel – h / 2
    
    return [x, y, w, h]


# ===================================================
# 変換のテスト
# ===================================================

# 元のBounding Box (x, y, w, h)
original_bbox = [100, 80, 200, 150]
image_width, image_height = 640, 480

print(“元のBox (x, y, w, h):”, original_bbox)

# (x, y, w, h) → (x1, y1, x2, y2)
xyxy = xywh_to_xyxy(original_bbox)
print(“(x1, y1, x2, y2) 形式:”, xyxy)

# (x, y, w, h) → YOLO形式
yolo = xywh_to_yolo(original_bbox, image_width, image_height)
print(“YOLO形式:”, [f”{v:.4f}” for v in yolo])

# YOLO形式 → (x, y, w, h)（逆変換の確認）
back_to_xywh = yolo_to_xywh(yolo, image_width, image_height)
print(“逆変換 (x, y, w, h):”, [f”{v:.1f}” for v in back_to_xywh])

実行結果：

元のBox (x, y, w, h): [100, 80, 200, 150] (x1, y1, x2, y2) 形式: [100, 80, 300, 230] YOLO形式: [‘0.3125’, ‘0.3229’, ‘0.3125’, ‘0.3125’] 逆変換 (x, y, w, h): [‘100.0’, ‘80.0’, ‘200.0’, ‘150.0’]

📐 3. IoU（Intersection over Union）

IoUは、2つのBounding Boxがどれだけ重なっているかを測る指標です。物体検出において、予測したBoxと正解のBoxの一致度を評価するために使われます。

3-1. IoUの計算式

【IoUの計算式】

            交差部分の面積
IoU = ─────────────────────────
            和集合の面積

和集合 = Box A + Box B – 交差部分

【図解】
     ┌─────────────┐
     │   Box A     │
     │      ┌──────┼─────────┐
     │      │ 交差 │         │
     └──────┼──────┘         │
            │      Box B     │
            └────────────────┘

交差部分 = 両方のBoxが重なっている領域
和集合   = どちらかのBoxに含まれる領域全体

【IoU値の解釈】
IoU = 1.0   : 完全に一致（同じBox）
IoU = 0.7〜0.9 : かなり重なっている（良い検出）
IoU = 0.5〜0.7 : まあまあ重なっている（許容範囲）
IoU = 0.3〜0.5 : 少し重なっている
IoU = 0.0   : 全く重なっていない

3-2. IoU計算の実装

def compute_iou(box1, box2):
    “””
    2つのBounding BoxのIoUを計算
    
    Args:
        box1: [x1, y1, x2, y2] 形式のBox
        box2: [x1, y1, x2, y2] 形式のBox
    
    Returns:
        IoU値（0.0〜1.0）
    “””
    # Step 1: 交差部分の座標を計算
    # 交差部分の左上 = 両方の左上の最大値
    x1_inter = max(box1[0], box2[0])
    y1_inter = max(box1[1], box2[1])
    
    # 交差部分の右下 = 両方の右下の最小値
    x2_inter = min(box1[2], box2[2])
    y2_inter = min(box1[3], box2[3])
    
    # Step 2: 交差部分の面積を計算
    # 幅と高さが負の場合は交差していない（0にする）
    inter_width = max(0, x2_inter – x1_inter)
    inter_height = max(0, y2_inter – y1_inter)
    inter_area = inter_width * inter_height
    
    # Step 3: 各Boxの面積を計算
    box1_area = (box1[2] – box1[0]) * (box1[3] – box1[1])
    box2_area = (box2[2] – box2[0]) * (box2[3] – box2[1])
    
    # Step 4: 和集合の面積を計算
    # 和集合 = Box1 + Box2 – 交差部分（交差部分を二重にカウントしないため）
    union_area = box1_area + box2_area – inter_area
    
    # Step 5: IoUを計算
    if union_area > 0:
        iou = inter_area / union_area
    else:
        iou = 0
    
    return iou


# ===================================================
# IoU計算のテスト
# ===================================================

# テストケース1: 部分的に重なるBox
box_a = [50, 30, 200, 150]   # 左上(50,30), 右下(200,150)
box_b = [100, 80, 250, 200]  # 左上(100,80), 右下(250,200)

iou1 = compute_iou(box_a, box_b)
print(f”テスト1 – 部分的に重なる: IoU = {iou1:.4f}”)

# 計算過程を表示
print(“\n計算過程:”)
print(f”  Box A: {box_a}”)
print(f”  Box B: {box_b}”)
print(f”  交差部分: x={max(50,100)}〜{min(200,250)}, y={max(30,80)}〜{min(150,200)}”)
print(f”           = [100, 80, 200, 150]”)
print(f”  交差面積: {100*70} = 7,000″)
print(f”  Box A面積: {150*120} = 18,000″)
print(f”  Box B面積: {150*120} = 18,000″)
print(f”  和集合: 18,000 + 18,000 – 7,000 = 29,000″)
print(f”  IoU: 7,000 / 29,000 = {7000/29000:.4f}”)

# テストケース2: 完全に一致
box_c = [100, 100, 200, 200]
box_d = [100, 100, 200, 200]
iou2 = compute_iou(box_c, box_d)
print(f”\nテスト2 – 完全一致: IoU = {iou2:.4f}”)

# テストケース3: 全く重ならない
box_e = [0, 0, 100, 100]
box_f = [200, 200, 300, 300]
iou3 = compute_iou(box_e, box_f)
print(f”テスト3 – 重なりなし: IoU = {iou3:.4f}”)

実行結果：

テスト1 – 部分的に重なる: IoU = 0.2414 計算過程: Box A: [50, 30, 200, 150] Box B: [100, 80, 250, 200] 交差部分: x=100〜200, y=80〜150 = [100, 80, 200, 150] 交差面積: 7000 = 7,000 Box A面積: 18000 = 18,000 Box B面積: 18000 = 18,000 和集合: 18,000 + 18,000 – 7,000 = 29,000 IoU: 7,000 / 29,000 = 0.2414 テスト2 – 完全一致: IoU = 1.0000 テスト3 – 重なりなし: IoU = 0.0000

3-3. IoUの用途

💡 IoUの3つの主要な用途

① 評価指標として
予測したBoxと正解（Ground Truth）のIoUを計算し、検出が正しいかを判定。
通常、IoU ≥ 0.5 で「正しい検出」とみなす（PASCAL VOC基準）。
MS COCOでは IoU = 0.5, 0.55, …, 0.95 の複数の閾値で評価。

② NMS（Non-Maximum Suppression）で
重複する検出を除去する際、IoUが閾値以上のBoxを削除。
「この2つのBoxは同じ物体を指している」と判断する基準。

③ 損失関数として
YOLOなどでは IoU Loss を使用して、Boxの位置を最適化。
IoU Loss = 1 – IoU（IoUが高いほどLossが低い）

🧹 4. Non-Maximum Suppression（NMS）

NMS（Non-Maximum Suppression）は、重複した検出結果を除去するアルゴリズムです。物体検出モデルは1つの物体に対して複数のBounding Boxを出力することがあるため、NMSで重複を除去して最も確からしい1つだけを残します。

4-1. NMSが必要な理由

【問題：1つの物体に複数のBoxが出力される】

物体検出モデルの出力例：
  Box 1: 猫 [50, 30, 200, 150] 信頼度=0.95
  Box 2: 猫 [55, 35, 205, 155] 信頼度=0.92  ← Box 1とほぼ同じ！
  Box 3: 猫 [52, 32, 198, 148] 信頼度=0.88  ← Box 1とほぼ同じ！
  Box 4: 犬 [300, 100, 450, 250] 信頼度=0.85

         ┌─────────────────────────────────┐
         │  ┌─────┐                        │
         │  │┌────┼┐    ┌─────┐           │
         │  ││ 🐱 ││    │ 🐕  │           │
         │  │└────┼┘    └─────┘           │
         │  └─────┘                        │
         └─────────────────────────────────┘
           ↑
        3つのBoxが重複！

【解決：NMSで重複を除去】
結果:
  Box 1: 猫 [50, 30, 200, 150] 信頼度=0.95  ← 残す
  Box 4: 犬 [300, 100, 450, 250] 信頼度=0.85 ← 残す

4-2. NMSのアルゴリズム

💡 NMSのステップ

Step 1: 全てのBoxを信頼度（スコア）で降順にソート
Step 2: 最も信頼度が高いBoxを選択し、結果リストに追加
Step 3: 選択したBoxと残りの全てのBoxのIoUを計算
Step 4: IoU ≥ 閾値のBoxを削除（同じ物体を指していると判断）
Step 5: 残りのBoxがなくなるまでStep 2〜4を繰り返す

【NMSの具体例】

入力:
  Box 1: [50, 30, 200, 150]   score=0.95
  Box 2: [55, 35, 205, 155]   score=0.92
  Box 3: [300, 100, 450, 250] score=0.85

IoU閾値 = 0.5

─────────────────────────────────────────────────

Round 1:
  ソート後: Box 1(0.95) → Box 2(0.92) → Box 3(0.85)
  Box 1を選択 → 結果: [Box 1]
  
  Box 1 と Box 2 の IoU を計算 → IoU = 0.85 ≥ 0.5 → Box 2 を削除
  Box 1 と Box 3 の IoU を計算 → IoU = 0.00 < 0.5 → Box 3 は残す

  残り: [Box 3]

─────────────────────────────────────────────────

Round 2:
  残り: [Box 3]
  Box 3を選択 → 結果: [Box 1, Box 3]
  
  残りなし → 終了

─────────────────────────────────────────────────

最終結果: [Box 1, Box 3]

4-3. NMSの実装

import numpy as np

def non_maximum_suppression(boxes, scores, iou_threshold=0.5):
    “””
    Non-Maximum Suppression（NMS）を実行
    
    Args:
        boxes: Bounding Boxのリスト [[x1,y1,x2,y2], …]
        scores: 各Boxの信頼度スコア [0.95, 0.92, …]
        iou_threshold: この値以上のIoUを持つBoxを削除
    
    Returns:
        keep: 残すBoxのインデックスリスト
    “””
    boxes = np.array(boxes)
    scores = np.array(scores)
    
    # Boxの座標を取り出す
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    
    # 各Boxの面積を計算
    areas = (x2 – x1) * (y2 – y1)
    
    # スコアで降順にソートしたインデックスを取得
    order = scores.argsort()[::-1]  # argsort()は昇順、[::-1]で降順に
    
    keep = []  # 残すBoxのインデックス
    
    while len(order) > 0:
        # 最もスコアが高いBoxのインデックスを取得
        i = order[0]
        keep.append(i)
        
        # 残り1つになったら終了
        if len(order) == 1:
            break
        
        # 選択したBoxと残りのBoxの交差部分を計算
        # np.maximum/minimum で要素ごとの最大/最小を取得
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])
        
        # 交差部分の幅と高さ（負の場合は0）
        w = np.maximum(0.0, xx2 – xx1)
        h = np.maximum(0.0, yy2 – yy1)
        
        # 交差面積
        inter = w * h
        
        # 和集合の面積
        union = areas[i] + areas[order[1:]] – inter
        
        # IoUを計算
        iou = inter / union
        
        # IoUが閾値未満のBoxだけを残す
        # np.where は条件を満たすインデックスを返す
        inds = np.where(iou <= iou_threshold)[0]
        
        # order を更新（+1 は order[1:] からの相対インデックスを補正）
        order = order[inds + 1]
    
    return keep


# ===================================================
# NMSのテスト
# ===================================================

# テスト用データ
boxes = [
    [50, 30, 200, 150],    # Box 0: 猫1
    [55, 35, 205, 155],    # Box 1: 猫1（重複）
    [52, 32, 198, 148],    # Box 2: 猫1（重複）
    [300, 100, 450, 250],  # Box 3: 犬
]
scores = [0.95, 0.92, 0.88, 0.85]

print("NMS前:")
for i, (box, score) in enumerate(zip(boxes, scores)):
    print(f"  Box {i}: {box}, score={score}")

# NMS実行
keep_indices = non_maximum_suppression(boxes, scores, iou_threshold=0.5)

print(f"\n残すBoxのインデックス: {keep_indices}")

print("\nNMS後:")
for i in keep_indices:
    print(f"  Box {i}: {boxes[i]}, score={scores[i]}")

実行結果：

NMS前: Box 0: [50, 30, 200, 150], score=0.95 Box 1: [55, 35, 205, 155], score=0.92 Box 2: [52, 32, 198, 148], score=0.88 Box 3: [300, 100, 450, 250], score=0.85 残すBoxのインデックス: [0, 3] NMS後: Box 0: [50, 30, 200, 150], score=0.95 Box 3: [300, 100, 450, 250], score=0.85

4-4. NMSのパラメータ調整

🎯 IoU閾値の設定ガイド

IoU閾値が高い（0.7〜0.9）：
・多くのBoxが残る
・密集した物体を個別に検出できる
・誤検出（同じ物体への複数検出）が増える可能性

IoU閾値が低い（0.3〜0.5）：
・少ないBoxが残る
・重複検出を確実に除去
・近接した物体を見逃す可能性

標準的な設定：
・PASCAL VOC: 0.5
・MS COCO: 0.5〜0.7
・密集シーン（群衆など）: 0.7〜0.9

📊 5. 物体検出のデータセット

物体検出モデルを訓練・評価するには、画像とアノテーション（Bounding Boxとクラスラベル）のペアが必要です。ここでは主要なデータセットを紹介します。

5-1. 主要データセット比較

データセット	画像数	クラス数	形式	特徴
PASCAL VOC	11,500枚	20	XML	入門者向け、比較的小規模
MS COCO	330,000枚	80	JSON	現在の標準ベンチマーク
Open Images	900万枚	600	CSV	超大規模、多様なクラス
KITTI	15,000枚	8	TXT	自動運転向け、3D情報あり

5-2. PASCAL VOCのクラス

【PASCAL VOC 20クラス】

動物:      bird, cat, cow, dog, horse, sheep
乗り物:    aeroplane, bicycle, boat, bus, car, motorbike, train
室内:      bottle, chair, diningtable, pottedplant, sofa, tvmonitor
人:        person

【アノテーション形式（XML）】
<annotation>
    <filename>000001.jpg</filename>
    <size>
        <width>640</width>
        <height>480</height>
    </size>
    <object>
        <name>dog</name>
        <bndbox>
            <xmin>48</xmin>
            <ymin>240</ymin>
            <xmax>195</xmax>
            <ymax>371</ymax>
        </bndbox>
    </object>
</annotation>

5-3. MS COCOのクラス

【MS COCO 80クラス（一部）】

人・アクセサリ: person, backpack, umbrella, handbag, tie, suitcase
動物:          bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe
乗り物:        bicycle, car, motorcycle, airplane, bus, train, truck, boat
食べ物:        banana, apple, sandwich, orange, pizza, donut, cake
家具・家電:    chair, couch, bed, dining table, toilet, tv, laptop, cell phone
スポーツ:      frisbee, skis, snowboard, sports ball, kite, baseball bat

【アノテーション形式（JSON）】
{
    “images”: [{“id”: 1, “file_name”: “000001.jpg”, “width”: 640, “height”: 480}],
    “annotations”: [{
        “id”: 1,
        “image_id”: 1,
        “category_id”: 18,
        “bbox”: [100, 80, 200, 150],  // [x, y, width, height]
        “area”: 30000,
        “iscrowd”: 0
    }],
    “categories”: [{“id”: 18, “name”: “dog”}]
}

🖊️ 6. アノテーションツール

自分でデータセットを作成する場合、アノテーションツールを使ってBounding Boxを描画します。

6-1. 主要ツール比較

ツール	形態	対応タスク	出力形式	推奨用途
LabelImg	デスクトップ	物体検出	PASCAL VOC, YOLO	個人、シンプル
CVAT	Webベース	検出、セグメンテーション、動画	多数の形式	チーム、高度な作業
Roboflow	クラウド	検出、セグメンテーション、分類	各種形式に変換可能	データ管理込み
Label Studio	Web/ローカル	CV、NLP、音声など	多数の形式	マルチモーダル

6-2. LabelImgの使い方

# LabelImgのインストールと起動

# 方法1: pipでインストール
pip install labelImg

# 起動
labelImg

# 方法2: GitHubからクローン
git clone https://github.com/tzutalin/labelImg.git
cd labelImg
pip install -r requirements/requirements-linux-python3.txt
python labelImg.py

【LabelImgの基本操作】

1. “Open Dir” で画像フォルダを開く
2. “Change Save Dir” で保存先を設定
3. 画像上でドラッグしてBoxを描画
4. クラス名を入力
5. Ctrl+S で保存
6. D キーで次の画像へ

【ショートカットキー】
W: 新しいBounding Boxを作成
D: 次の画像
A: 前の画像
Ctrl+S: 保存
Del: 選択中のBoxを削除

【出力形式の切り替え】
ツールバーの「PascalVOC」をクリックするとYOLO形式に切り替え可能

6-3. YOLO形式のアノテーションファイル

【YOLO形式のファイル構造】

project/
├── images/
│   ├── train/
│   │   ├── image001.jpg
│   │   └── image002.jpg
│   └── val/
│       └── image003.jpg
└── labels/
    ├── train/
    │   ├── image001.txt    ← 画像と同名の.txtファイル
    │   └── image002.txt
    └── val/
        └── image003.txt

【アノテーションファイルの中身（image001.txt）】
# class_id x_center y_center width height
0 0.5 0.4 0.3 0.5
1 0.2 0.3 0.1 0.2

解説:
・1行 = 1つの物体
・class_id: クラスのインデックス（0始まり）
・x_center, y_center: 中心座標（0〜1に正規化）
・width, height: 幅と高さ（0〜1に正規化）

🔧 7. PyTorchでの物体検出データセット

PyTorchで物体検出のデータを扱う場合、画像と対応するターゲット（Boxとラベル）を返すDatasetクラスを作成します。

7-1. YOLO形式データセットの実装

import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os

class YOLODataset(Dataset):
    “””
    YOLO形式のアノテーションを読み込むデータセット
    
    YOLO形式:
    – 1行 = 1物体
    – class_id x_center y_center width height（全て0〜1に正規化）
    “””
    
    def __init__(self, image_dir, label_dir, class_names, transform=None):
        “””
        Args:
            image_dir: 画像フォルダのパス
            label_dir: ラベル（.txt）フォルダのパス
            class_names: クラス名のリスト [‘cat’, ‘dog’, …]
            transform: 画像変換（データ拡張など）
        “””
        self.image_dir = image_dir
        self.label_dir = label_dir
        self.class_names = class_names
        self.transform = transform
        
        # 画像ファイルのリストを取得
        self.image_files = [f for f in os.listdir(image_dir) 
                           if f.endswith((‘.jpg’, ‘.jpeg’, ‘.png’))]
    
    def __len__(self):
        return len(self.image_files)
    
    def __getitem__(self, idx):
        # 画像を読み込み
        image_name = self.image_files[idx]
        image_path = os.path.join(self.image_dir, image_name)
        image = Image.open(image_path).convert(‘RGB’)
        
        # 元の画像サイズを取得
        img_width, img_height = image.size
        
        # 対応するラベルファイルを読み込み
        label_name = os.path.splitext(image_name)[0] + ‘.txt’
        label_path = os.path.join(self.label_dir, label_name)
        
        boxes = []
        labels = []
        
        if os.path.exists(label_path):
            with open(label_path, ‘r’) as f:
                for line in f.readlines():
                    parts = line.strip().split()
                    if len(parts) == 5:
                        class_id = int(parts[0])
                        x_center = float(parts[1])
                        y_center = float(parts[2])
                        width = float(parts[3])
                        height = float(parts[4])
                        
                        # YOLO形式 → (x1, y1, x2, y2) 形式に変換
                        x1 = (x_center – width / 2) * img_width
                        y1 = (y_center – height / 2) * img_height
                        x2 = (x_center + width / 2) * img_width
                        y2 = (y_center + height / 2) * img_height
                        
                        boxes.append([x1, y1, x2, y2])
                        labels.append(class_id)
        
        # PyTorchのテンソルに変換
        boxes = torch.tensor(boxes, dtype=torch.float32) if boxes else torch.zeros((0, 4))
        labels = torch.tensor(labels, dtype=torch.int64) if labels else torch.zeros((0,), dtype=torch.int64)
        
        # ターゲット辞書（PyTorchの物体検出モデルが期待する形式）
        target = {
            ‘boxes’: boxes,
            ‘labels’: labels,
            ‘image_id’: torch.tensor([idx])
        }
        
        # 画像変換を適用
        if self.transform:
            image = self.transform(image)
        
        return image, target


# ===================================================
# 使用例（実際のデータがある場合）
# ===================================================

# クラス名の定義
class_names = [‘cat’, ‘dog’]

# データセットの作成
# dataset = YOLODataset(
#     image_dir=’data/images/train’,
#     label_dir=’data/labels/train’,
#     class_names=class_names,
#     transform=transforms.ToTensor()
# )

# DataLoaderの作成
# train_loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

print(“YOLODatasetクラスを定義しました”)
print(f”対応クラス: {class_names}”)

実行結果：

YOLODatasetクラスを定義しました対応クラス: [‘cat’, ‘dog’]

7-2. collate_fn（バッチ処理用関数）

物体検出では画像ごとに物体の数が異なるため、通常のバッチ処理ができません。カスタムの collate_fn を使います。

def collate_fn(batch):
    “””
    物体検出用のcollate関数
    
    物体検出では画像ごとに物体の数が異なるため、
    targetsはリストとして返す（テンソルにできない）
    
    Args:
        batch: [(image, target), (image, target), …] のリスト
    
    Returns:
        images: バッチ化された画像テンソル
        targets: ターゲット辞書のリスト
    “””
    images = []
    targets = []
    
    for image, target in batch:
        images.append(image)
        targets.append(target)
    
    # 画像はスタックしてバッチ化
    # ただし、サイズが異なる場合はリストのまま
    try:
        images = torch.stack(images, dim=0)
    except:
        pass  # サイズが異なる場合はリストのまま
    
    return images, targets


# 使用例
# train_loader = DataLoader(
#     dataset, 
#     batch_size=4, 
#     shuffle=True, 
#     collate_fn=collate_fn
# )

print(“collate_fn を定義しました”)
print(“DataLoaderでcollate_fn=collate_fnを指定して使用します”)

📝 練習問題

問題1：Bounding Boxの形式変換（基礎）

以下のBounding Boxを各形式に変換してください。

元のBox (x, y, w, h) = (100, 80, 200, 150)
画像サイズ: 640×480

求めるもの：(x1, y1, x2, y2) 形式と YOLO形式

解答：

(x1, y1, x2, y2) 形式：

x1 = x = 100
y1 = y = 80
x2 = x + w = 100 + 200 = 300
y2 = y + h = 80 + 150 = 230

結果: (100, 80, 300, 230)

YOLO形式：

x_center = (x + w/2) / image_width
        = (100 + 100) / 640
        = 200 / 640 = 0.3125

y_center = (y + h/2) / image_height
        = (80 + 75) / 480
        = 155 / 480 = 0.3229

width = w / image_width = 200 / 640 = 0.3125
height = h / image_height = 150 / 480 = 0.3125

結果: (0.3125, 0.3229, 0.3125, 0.3125)

問題2：IoUの計算（中級）

以下の2つのBounding BoxのIoUを計算してください。

Box A: [50, 30, 200, 150]（x1, y1, x2, y2形式）
Box B: [100, 80, 250, 200]

解答：

Step 1: 交差部分の座標
  x1_inter = max(50, 100) = 100
  y1_inter = max(30, 80) = 80
  x2_inter = min(200, 250) = 200
  y2_inter = min(150, 200) = 150
  
  交差部分: [100, 80, 200, 150]

Step 2: 交差面積
  幅 = 200 – 100 = 100
  高さ = 150 – 80 = 70
  交差面積 = 100 × 70 = 7,000

Step 3: 各Boxの面積
  Box A = (200 – 50) × (150 – 30) = 150 × 120 = 18,000
  Box B = (250 – 100) × (200 – 80) = 150 × 120 = 18,000

Step 4: 和集合の面積
  = 18,000 + 18,000 – 7,000 = 29,000

Step 5: IoU
  = 7,000 / 29,000 = 0.2414

IoU ≈ 0.24（あまり重なっていない）

問題3：NMSの適用（応用）

以下のBounding Boxに対して、NMS（IoU閾値=0.5）を適用した結果を答えてください。

boxes = [
    [50, 30, 200, 150],    # Box 0
    [55, 35, 205, 155],    # Box 1
    [300, 100, 450, 250]   # Box 2
]
scores = [0.95, 0.92, 0.85]

解答：

Step 1: スコア順にソート
  順序: Box 0 (0.95) → Box 1 (0.92) → Box 2 (0.85)

Step 2: Box 0 を選択
  結果リスト: [Box 0]
  
  Box 0 と Box 1 の IoU を計算:
    Box 0: [50, 30, 200, 150]
    Box 1: [55, 35, 205, 155]
    交差部分: [55, 35, 200, 150]
    交差面積: 145 × 115 = 16,675
    Box 0 面積: 150 × 120 = 18,000
    Box 1 面積: 150 × 120 = 18,000
    和集合: 18,000 + 18,000 – 16,675 = 19,325
    IoU = 16,675 / 19,325 ≈ 0.86
    
  IoU = 0.86 ≥ 0.5 → Box 1 を削除

  Box 0 と Box 2 の IoU:
    重なりなし → IoU = 0
    
  IoU = 0 < 0.5 → Box 2 は残す

Step 3: Box 2 を選択
  結果リスト: [Box 0, Box 2]

最終結果:
  Box 0: [50, 30, 200, 150], score=0.95
  Box 2: [300, 100, 450, 250], score=0.85

問題4：データセットクラスの拡張（総合）

YOLODatasetクラスに以下の機能を追加してください：

アノテーションのない画像をスキップする機能
Boxの面積が小さすぎる物体を除外する機能（最小面積を指定）

解答：

class YOLODatasetExtended(Dataset):
    “””拡張版YOLODataset”””
    
    def __init__(self, image_dir, label_dir, class_names, 
                 transform=None, min_box_area=100, skip_empty=True):
        “””
        Args:
            min_box_area: 最小Box面積（ピクセル²）
            skip_empty: アノテーションなしの画像をスキップするか
        “””
        self.image_dir = image_dir
        self.label_dir = label_dir
        self.class_names = class_names
        self.transform = transform
        self.min_box_area = min_box_area
        
        # 画像ファイルのリストを取得
        all_images = [f for f in os.listdir(image_dir) 
                      if f.endswith((‘.jpg’, ‘.jpeg’, ‘.png’))]
        
        # アノテーションがある画像のみをフィルタリング
        if skip_empty:
            self.image_files = []
            for img_name in all_images:
                label_name = os.path.splitext(img_name)[0] + ‘.txt’
                label_path = os.path.join(label_dir, label_name)
                if os.path.exists(label_path):
                    with open(label_path, ‘r’) as f:
                        if f.read().strip():  # 空でないか確認
                            self.image_files.append(img_name)
        else:
            self.image_files = all_images
    
    def __getitem__(self, idx):
        # 画像を読み込み
        image_name = self.image_files[idx]
        image_path = os.path.join(self.image_dir, image_name)
        image = Image.open(image_path).convert(‘RGB’)
        img_width, img_height = image.size
        
        # ラベルを読み込み
        label_name = os.path.splitext(image_name)[0] + ‘.txt’
        label_path = os.path.join(self.label_dir, label_name)
        
        boxes = []
        labels = []
        
        with open(label_path, ‘r’) as f:
            for line in f.readlines():
                parts = line.strip().split()
                if len(parts) == 5:
                    class_id = int(parts[0])
                    x_center = float(parts[1])
                    y_center = float(parts[2])
                    width = float(parts[3])
                    height = float(parts[4])
                    
                    # ピクセル座標に変換
                    w_pixel = width * img_width
                    h_pixel = height * img_height
                    
                    # 最小面積でフィルタリング
                    area = w_pixel * h_pixel
                    if area < self.min_box_area:
                        continue  # 小さすぎるBoxはスキップ
                    
                    x1 = (x_center - width / 2) * img_width
                    y1 = (y_center - height / 2) * img_height
                    x2 = (x_center + width / 2) * img_width
                    y2 = (y_center + height / 2) * img_height
                    
                    boxes.append([x1, y1, x2, y2])
                    labels.append(class_id)
        
        # テンソルに変換
        boxes = torch.tensor(boxes, dtype=torch.float32) if boxes else torch.zeros((0, 4))
        labels = torch.tensor(labels, dtype=torch.int64) if labels else torch.zeros((0,), dtype=torch.int64)
        
        target = {'boxes': boxes, 'labels': labels, 'image_id': torch.tensor([idx])}
        
        if self.transform:
            image = self.transform(image)
        
        return image, target
    
    def __len__(self):
        return len(self.image_files)

📝 STEP 9 のまとめ

✅ このステップで学んだこと

1. 物体検出とは
・「何が、どこに」を特定するタスク
・分類 + 位置特定の2つのサブタスクを同時解決

2. Bounding Box
・(x, y, w, h)：左上座標と幅・高さ
・(x1, y1, x2, y2)：左上と右下の座標
・YOLO形式：正規化された中心座標と大きさ

3. IoU（Intersection over Union）
・2つのBoxの重なり具合を測る指標
・評価、NMS、損失関数で使用
・IoU ≥ 0.5 で「正しい検出」（PASCAL VOC基準）

4. NMS（Non-Maximum Suppression）
・重複検出を除去するアルゴリズム
・スコア順にBoxを選択、IoU閾値以上のBoxを削除

5. データセットとツール
・PASCAL VOC（20クラス）、MS COCO（80クラス）
・LabelImg、CVAT、Roboflow

💡 重要ポイント

物体検出の基礎概念（Bounding Box、IoU、NMS）は、すべての物体検出モデルで共通して使われます。これらをしっかり理解しておくことで、次のSTEP 10以降で学ぶR-CNN系やYOLOなどの具体的なモデルの理解がスムーズになります。

次のSTEP 10では、「R-CNN系モデル」を学びます。物体検出の歴史を変えたR-CNN、Fast R-CNN、Faster R-CNNの仕組みを理解し、実装を行います。

📝

学習メモ

コンピュータビジョン（CV） - Step 9

📋 過去のメモ一覧 ▼