閾値最適化するざんす！

6月 11, 2024

LightGBMモデルの予測確率から2値に分類する際の閾値を最適化するためのコードを以下に示します。このコードでは、F1スコア、F2スコア、F0.5スコアの最適化を行います。具体的には、予測確率に対する各スコアの最大値を与える閾値を見つけるために、複数の閾値を試してそれぞれのスコアを計算します。

以下はそのためのPythonコードです：

python
import numpy as np
import lightgbm as lgb
from sklearn.metrics import f1_score, fbeta_score
from sklearn.model_selection import train_test_split

# ダミーデータの作成（実際のデータを使用してください）
X, y = np.random.rand(1000, 10), np.random.randint(0, 2, 1000)

# 訓練データとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# LightGBMのモデルの訓練
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt'
}
model = lgb.train(params, train_data, num_boost_round=100)

# テストデータに対する予測確率を取得
y_prob = model.predict(X_test)

# 閾値の範囲を設定
thresholds = np.linspace(0.01, 0.99, 100)

# 各スコアを格納するリスト
f1_scores = []
f2_scores = []
f0_5_scores = []

# 各閾値に対してスコアを計算
for threshold in thresholds:
    y_pred = (y_prob >= threshold).astype(int)
    f1_scores.append(f1_score(y_test, y_pred))
    f2_scores.append(fbeta_score(y_test, y_pred, beta=2))
    f0_5_scores.append(fbeta_score(y_test, y_pred, beta=0.5))

# 最大のスコアを与える閾値を見つける
optimal_threshold_f1 = thresholds[np.argmax(f1_scores)]
optimal_threshold_f2 = thresholds[np.argmax(f2_scores)]
optimal_threshold_f0_5 = thresholds[np.argmax(f0_5_scores)]

print(f'Optimal threshold for F1 score: {optimal_threshold_f1}')
print(f'Optimal threshold for F2 score: {optimal_threshold_f2}')
print(f'Optimal threshold for F0.5 score: {optimal_threshold_f0_5}')

説明

データの準備：
- ダミーデータを作成し、訓練データとテストデータに分割します。実際のデータを使用する場合は、そのデータセットを用意してください。
LightGBMモデルの訓練：
- 訓練データでLightGBMモデルを訓練します。
予測確率の取得：
- 訓練したモデルを用いてテストデータに対する予測確率を取得します。
閾値の範囲を設定し、各スコアを計算：
- 0.01から0.99までの閾値を0.01刻みで設定し、それぞれの閾値でF1スコア、F2スコア、F0.5スコアを計算します。
最適な閾値の選定：
- 各スコアが最大となる閾値を見つけ出します。

このコードを実行することで、各スコアに対して最適な閾値を求めることができます。実際のデータを使用して、モデルの性能を評価してください。

LightGBMの予測確率に基づいて2値分類の閾値を最適化するためのPythonコードを以下に示します。F1スコア、F2スコア、F0.5スコアを指標として閾値を調整します。

ライブラリのインポート

まず、必要なライブラリをインポートします。

Python
import lightgbm as lgb
import numpy as np
from sklearn.metrics import f1_score, f2_score, precision_recall_curve
コードは注意してご使用ください。

データの準備

次に、学習済みモデル、実際のラベル、予測確率を用意します。

Python
# 学習済みモデル
model = lgb.Booster(model_file='model.txt')

# 実際のラベル
y_true = np.array([0, 1, 0, 1, 1, 0])

# 予測確率
y_pred_proba = model.predict_proba(X_test)[:, 1]
コードは注意してご使用ください。

F1スコア、F2スコア、F0.5スコアを計算する関数

以下の関数は、指定された閾値におけるF1スコア、F2スコア、F0.5スコアを計算します。

Python
def calculate_scores(y_true, y_pred_proba, threshold):
    y_pred = (y_pred_proba >= threshold).astype(int)
    f1 = f1_score(y_true, y_pred)
    f2 = f2_score(y_true, y_pred)
    f05 = precision_recall_curve(y_true, y_pred_proba)[2][0]  # F0.5スコアはPR曲線から計算
    return f1, f2, f05
コードは注意してご使用ください。

閾値の探索と最適化

以下のコードは、F1スコア、F2スコア、F0.5スコアを最大化する閾値を探索します。

Python
thresholds = np.arange(0.01, 1.0, 0.01)  # 0.01から1.0まで0.01刻みで閾値を探索
best_threshold = 0
best_f1 = 0
best_f2 = 0
best_f05 = 0

for threshold in thresholds:
    f1, f2, f05 = calculate_scores(y_true, y_pred_proba, threshold)
    if f1 > best_f1:
        best_threshold = threshold
        best_f1 = f1
        best_f2 = f2
        best_f05 = f05

print(f"最適な閾値: {best_threshold}")
print(f"F1スコア: {best_f1}")
print(f"F2スコア: {best_f2}")
print(f"F0.5スコア: {best_f05}")

あしとみ独習メモ

閾値最適化するざんす！

説明

ライブラリのインポート

データの準備

F1スコア、F2スコア、F0.5スコアを計算する関数

閾値の探索と最適化

コメント

コメントを投稿

このブログの人気の投稿

【論文】2023 Large Language Models for Software Engineering: Survey and Open Problems

【論文】ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design

COBOLのソースコードをPythonで字句解析