淳安|使用分类权重解决数据不平衡的问题( 二 )

from sklearn.metrics import recall_score

def assess_test_set_performance(model):
# create predictions
yhat = model.predict(X_test[feats
)
# evaluate performance using counts
model_recall = recall_score(y_test yhat)*100
print(\"Classification Test Performance:\")
print(f\"Recall: {model_recall:.2f%\")

# calculate financial performance using the dollar amount associated with each transaction
performance_df = pd.DataFrame({'Amount':X_test['Amount'
'Actual':y_test 'Pred':yhat)
performance_df['fraud_amount'
= performance_df['Amount'
*performance_df['Actual'

performance_df['fraud_prevented'
= performance_df['fraud_amount'
*performance_df['Pred'

performance_df['fraud_realized'
= performance_df['fraud_amount'
- performance_df['fraud_prevented'

financial_recall = (performance_df['fraud_prevented'
.sum() / (performance_df['fraud_prevented'
.sum() + performance_df['fraud_realized'
.sum()))*100

print()
print(\"Financial Test Performance:\")
print(f\"Financial Recall: {financial_recall:.2f%\")
现在我们已经确定了评估函数，让我们训练基线模型并评估它的性能:
from sklearn.linear_model import LogisticRegression

baseline_model = LogisticRegression()
baseline_model.fit(X_train[feats
y_train)

assess_test_set_performance(baseline_model)
我们的基线模型实现了59%的召回和61%的财务召回，这实际上比预期的要好。但是这在实际使用时肯定不好，所以其实我们可以做的更好。
改进模型加入类权重基线模型将两个类设置成同等重要，因为模型不知道我们更关心欺诈的情况，所以我们需要重新定义我们的损失函数。 sklearn API提供了让模型知道对正确识别欺诈的偏好:class_weight参数。
当使用class_weight时，模型接收一个字典，每个类都有一个键，其中的值是该类的权重。我们有两类，为了说明这个例子，我们假设欺诈案的重要性是前者的10倍。在这种情况下，我们可以像这样向class_weight传递一个字典:
fraud_class_weights = {0:1 1:10
但是sklearn API实际上使这个过程更容易。 class_weight参数也可以取'balanced'的值。我们需要做的是使用下面的公式建立一个字典，其中权重与数据中的类分布成比例:
len(X_train) / (2 * numpy.bincount(y_train))
将上面的公式应用到我们的数据中，我们估计正情况实际上比负情况重要575倍。
当我们把这个新的代码放到逻辑回归模型中时，它将更专注于正确地对我们的欺诈交易进行分类。这正是我们想要的结果!
让我们设置它，然后研究性能:
lr_class_weights = LogisticRegression(class_weight='balanced')
lr_class_weights.fit(X_train[feats
y_train)

assess_test_set_performance(lr_class_weights)
这个简单的改变将使我们的召回率提高到90%和93% !因为现在对欺诈更加敏感，比使用基线模型准确地识别更多的欺诈案例。
总结在模型中使用class_weight肯定会帮助从模型中获得更多相关的性能。并且它很容易设置，至少值得试一试。本文中介绍的方法是解决分类不平衡问题的一种过简单的方法，在这个领域中还有许多其他的方法可以讨论，但是为分类设置权重是一个非常好的开始。
https://avoid.overfit.cn/post/13e8cb84f1e1480eb62d9f029647ed3a
作者；Greg J