数据预处理样本选择、交叉验证

作者：violalal_134 | 来源：互联网 | 2023-09-23 23:38

1.样本下采样选择#下采样取样本数据Xdata.ix[:,data.columns!Class]ydata.ix[:,data.columnsClass]#Numberofda

1.样本下采样选择

# 下采样取样本数据
X &＃61; data.ix[:, data.columns !&＃61; &＃39;Class&＃39;]
y &＃61; data.ix[:, data.columns &＃61;&＃61; &＃39;Class&＃39;]# Number of data points in the minority class
number_records_fraud &＃61; len(data[data.Class &＃61;&＃61; 1])
fraud_indices &＃61; np.array(data[data.Class &＃61;&＃61; 1].index)# Picking the indices of the normal classes
normal_indices &＃61; data[data.Class &＃61;&＃61; 0].index# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices &＃61; np.random.choice(normal_indices, number_records_fraud, replace &＃61; False)
random_normal_indices &＃61; np.array(random_normal_indices)# Appending the 2 indices
under_sample_indices &＃61; np.concatenate([fraud_indices,random_normal_indices])# Under sample dataset
under_sample_data &＃61; data.iloc[under_sample_indices,:]X_undersample &＃61; under_sample_data.ix[:, under_sample_data.columns !&＃61; &＃39;Class&＃39;]
y_undersample &＃61; under_sample_data.ix[:, under_sample_data.columns &＃61;&＃61; &＃39;Class&＃39;]# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class &＃61;&＃61; 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class &＃61;&＃61; 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))# 下采样后的数据进行训练、验证数据集拆分
from sklearn.cross_validation import train_test_split# Whole dataset
X_train, X_test, y_train, y_test &＃61; train_test_split(X,y,test_size &＃61; 0.3, random_state &＃61; 0)print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)&＃43;len(X_test))# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample &＃61; train_test_split(X_undersample,y_undersample,test_size &＃61; 0.3,random_state &＃61; 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)&＃43;len(X_test_undersample))

交叉验证选择最优参数&＃xff1a;

#Recall &＃61; TP/(TP&＃43;FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
def printing_Kfold_scores(x_train_data,y_train_data):fold &＃61; KFold(len(y_train_data),5,shuffle&＃61;False) # Different C parametersc_param_range &＃61; [0.01,0.1,1,10,100]results_table &＃61; pd.DataFrame(index &＃61; range(len(c_param_range),2), columns &＃61; [&＃39;C_parameter&＃39;,&＃39;Mean recall score&＃39;])results_table[&＃39;C_parameter&＃39;] &＃61; c_param_range# the k-fold will give 2 lists: train_indices &＃61; indices[0], test_indices &＃61; indices[1]j &＃61; 0for c_param in c_param_range:print(&＃39;-------------------------------------------&＃39;)print(&＃39;C parameter: &＃39;, c_param)print(&＃39;-------------------------------------------&＃39;)print(&＃39;&＃39;)recall_accs &＃61; []for iteration, indices in enumerate(fold,start&＃61;1):# Call the logistic regression model with a certain C parameterlr &＃61; LogisticRegression(C &＃61; c_param, penalty &＃61; &＃39;l1&＃39;)# Use the training data to fit the model. In this case, we use the portion of the fold to train the model# with indices[0]. We then predict on the portion assigned as the &＃39;test cross validation&＃39; with indices[1]
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())# Predict values using the test indices in the training datay_pred_undersample &＃61; lr.predict(x_train_data.iloc[indices[1],:].values)# Calculate the recall score and append it to a list for recall scores representing the current c_parameterrecall_acc &＃61; recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)recall_accs.append(recall_acc)print(&＃39;Iteration &＃39;, iteration,&＃39;: recall score &＃61; &＃39;, recall_acc)# The mean value of those recall scores is the metric we want to save and get hold of.results_table.ix[j,&＃39;Mean recall score&＃39;] &＃61; np.mean(recall_accs)j &＃43;&＃61; 1print(&＃39;&＃39;)print(&＃39;Mean recall score &＃39;, np.mean(recall_accs))print(&＃39;&＃39;)best_c &＃61; results_table.loc[results_table[&＃39;Mean recall score&＃39;].idxmax()][&＃39;C_parameter&＃39;]# Finally, we can check which C parameter is the best amongst the chosen.print(&＃39;*********************************************************************************&＃39;)print(&＃39;Best model to choose from cross validation is with C parameter &＃61; &＃39;, best_c)print(&＃39;*********************************************************************************&＃39;)return best_cbest_c &＃61; printing_Kfold_scores(X_train_undersample,y_train_undersample)

绘制混淆矩阵

def plot_confusion_matrix(cm, classes,title&＃61;&＃39;Confusion matrix&＃39;,cmap&＃61;plt.cm.Blues):"""This function prints and plots the confusion matrix."""plt.imshow(cm, interpolation&＃61;&＃39;nearest&＃39;, cmap&＃61;cmap)plt.title(title)plt.colorbar()tick_marks &＃61; np.arange(len(classes))plt.xticks(tick_marks, classes, rotation&＃61;0)plt.yticks(tick_marks, classes)thresh &＃61; cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j],horizontalalignment&＃61;"center",color&＃61;"white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel(&＃39;True label&＃39;)plt.xlabel(&＃39;Predicted label&＃39;)

import itertools
lr &＃61; LogisticRegression(C &＃61; best_c, penalty &＃61; &＃39;l1&＃39;)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample &＃61; lr.predict(X_test_undersample.values)# Compute confusion matrix
cnf_matrix &＃61; confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision&＃61;2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]&＃43;cnf_matrix[1,1]))# Plot non-normalized confusion matrix
class_names &＃61; [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes&＃61;class_names, title&＃61;&＃39;Confusion matrix&＃39;)
plt.show()

查看不同阈值对应召回率

lr &＃61; LogisticRegression(C &＃61; 0.01, penalty &＃61; &＃39;l1&＃39;)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba &＃61; lr.predict_proba(X_test_undersample.values)thresholds &＃61; [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]plt.figure(figsize&＃61;(10,10))j &＃61; 1
for i in thresholds:y_test_predictions_high_recall &＃61; y_pred_undersample_proba[:,1] > iplt.subplot(3,3,j)j &＃43;&＃61; 1# Compute confusion matrixcnf_matrix &＃61; confusion_matrix(y_test_undersample,y_test_predictions_high_recall)np.set_printoptions(precision&＃61;2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]&＃43;cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names &＃61; [0,1]plot_confusion_matrix(cnf_matrix, classes&＃61;class_names, title&＃61;&＃39;Threshold >&＃61; %s&＃39;%i)

转载于:https://www.cnblogs.com/itbuyixiaogong/p/9850128.html

推荐阅读

post
包含phppdoerrorcode的词条

包含phppdoerrorcode的词条 ... [详细]

蜡笔小新 2024-11-14 12:06:14
function
普通树(每个节点可以有任意数量的子节点)级序遍历

普通树(每个节点可以有任意数量的子节点)级序遍历 ... [详细]

蜡笔小新 2024-11-14 18:53:26
process
Spring Data JdbcTemplate 入门指南

本文将介绍如何使用 Spring JdbcTemplate 进行数据库操作，包括查询和插入数据。我们将通过一个学生表的示例来演示具体步骤。 ... [详细]

蜡笔小新 2024-11-14 10:33:29
web
使用Tkinter构建51Ape无损音乐爬虫UI

本文介绍了如何使用Python的内置模块Tkinter来构建一个简单的用户界面，用于爬取51Ape网站上的无损音乐百度云链接。虽然Tkinter入门相对简单，但在实际开发过程中由于文档不足可能会带来一些不便。 ... [详细]

蜡笔小新 2024-11-15 10:31:11
web
Ubuntu 22.04 安装搜狗输入法详细指南及常见问题解决方案

本文将详细介绍如何在 Ubuntu 22.04 上安装搜狗输入法，并提供常见问题的解决方法。包括下载安装包、更新源、安装依赖项等步骤。 ... [详细]

蜡笔小新 2024-11-15 10:11:27
function
JavaScript中的事件处理机制

事件是程序各部分之间的一种通信方式，也是异步编程的一种实现形式。本文将详细介绍EventTarget接口及其相关方法，以及如何使用监听函数处理事件。 ... [详细]

蜡笔小新 2024-11-15 04:27:01
header
自然语言处理(NLP)——LDA模型:对电商购物评论进行情感分析

目录一、2020数学建模美赛C题简介需求评价内容提供数据二、解题思路三、LDA简介四、代码实现1.数据预处理1.1剔除无用信息1.1.1剔除掉不需要的列1.1.2找出无效评论并剔除 ... [详细]

蜡笔小新 2024-11-14 18:21:21
header
Java 中 com.apollographql.apollo.api.internal.Optional.orNull() 方法详解与示例

本文详细介绍了 com.apollographql.apollo.api.internal.Optional 类中的 orNull() 方法，并提供了多个实际代码示例，帮助开发者更好地理解和使用该方法。 ... [详细]

蜡笔小新 2024-11-14 15:03:23
split
pytorch(一)：torch构建数据集并训练一个神经网络

目录预备知识导包构建数据集神经网络结构训练测试精度可视化计算模型精度损失可视化输出网络结构信息训练神经网络定义参数载入数据载入神经网络结构、损失及优化训练及测试损失、精度可视化qu ... [详细]

蜡笔小新 2024-11-14 13:06:38
search
Android Studio SQLite 数据库增删改查简单（代码参考）

一个建表一个执行crud操作建表代码importandroid.content.Context;importandroid.database.sqlite.SQLiteDat ... [详细]

蜡笔小新 2024-11-14 11:01:49
split
机器学习算法：SVM（支持向量机）

SVM算法（SupportVectorMachine，支持向量机）的核心思想有2点：1、如果数据线性可分，那么基于最大间隔的方式来确定超平面，以确保全局最优， ... [详细]

蜡笔小新 2024-11-14 04:33:58
post
AngularJS 在 IE6 和 IE7 中实现历史记录支持

我在使用 AngularJS 的路由功能开发单页应用 (SPA)，但需要支持 IE7（包括 IE8 的 IE7 兼容模式）。我希望浏览器的历史记录功能能够正常工作，即使需要使用 jQuery 插件。 ... [详细]

蜡笔小新 2024-11-13 20:42:56
process
利用OpenCV和线性SVM实现人脸识别

本文介绍如何使用OpenCV和线性支持向量机（SVM）模型来开发一个简单的人脸识别系统，特别关注在只有一个用户数据集时的处理方法。 ... [详细]

蜡笔小新 2024-11-13 14:50:37
instance
Java DAO模式详解与代码示例

DAO（Data Access Object）模式是一种用于抽象和封装所有对数据库或其他持久化机制访问的方法，它通过提供一个统一的接口来隐藏底层数据访问的复杂性。 ... [详细]

蜡笔小新 2024-11-13 12:25:33
instance
IOS Run loop详解

为什么80%的码农都做不了架构师？转自http:blog.csdn.netztp800201articledetails9240913感谢作者分享Objecti ... [详细]

蜡笔小新 2024-11-13 12:14:35

violalal_134

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章