逻辑回归（Logistic+Regression）经典实例

作者：帅帅考拉_955 | 来源：互联网 | 2024-10-21 19:30

机器学习算法完整版见fenghaootong-github房价预测数据集描述数据共有81个特征SalePrice-theproperty’ssalepriceindollars.T

机器学习算法完整版见fenghaootong-github

房价预测

数据集描述

数据共有81个特征

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
….

导入所需模块

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import math as matfrom scipy import stats from scipy.stats import norm from sklearn import preprocessingimport statsmodels.api as sm from patsy import dmatricesimport warnings warnings.filterwarnings(&＃39;ignore&＃39;) %matplotlib inlineimport sklearn.linear_model as LinReg import sklearn.metrics as metrics

导入数据

#loading the data data_train &＃61; pd.read_csv(&＃39;../DATA/SalePrice_train.csv&＃39;) data_test &＃61; pd.read_csv(&＃39;../DATA/SalePrice_test.csv&＃39;)

数据共有81个特征&＃xff0c;为了便于说明只挑选7个特征
OverallQual
GrLivArea
GarageCars
TotalBsmtSF
1stFlrSF
FullBath
YearBuilt
因为这些数据与房子的售卖价格相关性比较大

具体如何选择特征&＃xff0c;见数据清理

数据预处理

data_train.shape

(1460, 81)

vars &＃61; [&＃39;OverallQual&＃39;, &＃39;GrLivArea&＃39;, &＃39;GarageCars&＃39;, &＃39;TotalBsmtSF&＃39;, &＃39;FullBath&＃39;,&＃39;YearBuilt&＃39;] Y &＃61; data_train[[&＃39;SalePrice&＃39;]] #dim (1460, 1) ID_train &＃61; data_train[[&＃39;Id&＃39;]] #dim (1460, 1) ID_test &＃61; data_test[[&＃39;Id&＃39;]] #dim (1459, 1) #extract only the relevant feature with cross correlation >0.5 respect to SalePrice X_matrix &＃61; data_train[vars] X_matrix.shape #dim (1460,6)X_test &＃61; data_test[vars] X_test.shape #dim (1459,6)

(1459, 6)

查看丢失数据

#check for missing data: #missing data total &＃61; X_matrix.isnull().sum().sort_values(ascending&＃61;False) percent &＃61; (X_matrix.isnull().sum()/X_matrix.count()).sort_values(ascending&＃61;False) missing_data &＃61; pd.concat([total, percent], axis&＃61;1, keys&＃61;[&＃39;Total&＃39;, &＃39;Percent&＃39;]) missing_data.head(20) #no missing data in this training set

	Total	Percent
YearBuilt	0	0.0
FullBath	0	0.0
TotalBsmtSF	0	0.0
GarageCars	0	0.0
GrLivArea	0	0.0
OverallQual	0	0.0

total &＃61; X_test.isnull().sum().sort_values(ascending&＃61;False) percent &＃61; (X_test.isnull().sum()/X_test.count()).sort_values(ascending&＃61;False) missing_data &＃61; pd.concat([total, percent], axis&＃61;1, keys&＃61;[&＃39;Total&＃39;, &＃39;Percent&＃39;]) missing_data.head(20) #missing data in this test set

	Total	Percent
TotalBsmtSF	1	0.000686
GarageCars	1	0.000686
YearBuilt	0	0.000000
FullBath	0	0.000000
GrLivArea	0	0.000000
OverallQual	0	0.000000

#help(mat.ceil) #去上限

使用均值代替缺失的数据

#使用均值代替缺失的数据 X_test[&＃39;TotalBsmtSF&＃39;] &＃61; X_test[&＃39;TotalBsmtSF&＃39;].fillna(X_test[&＃39;TotalBsmtSF&＃39;].mean()) X_test[&＃39;GarageCars&＃39;] &＃61; X_test[&＃39;GarageCars&＃39;].fillna(mat.ceil(X_test[&＃39;GarageCars&＃39;].mean()))total &＃61; X_test.isnull().sum().sort_values(ascending&＃61;False) percent &＃61; (X_test.isnull().sum()/X_test.count()).sort_values(ascending&＃61;False) missing_data &＃61; pd.concat([total, percent], axis&＃61;1, keys&＃61;[&＃39;Total&＃39;, &＃39;Percent&＃39;]) missing_data.head(20)

	Total	Percent
YearBuilt	0	0.0
FullBath	0	0.0
TotalBsmtSF	0	0.0
GarageCars	0	0.0
GrLivArea	0	0.0
OverallQual	0	0.0

X_test.shape

(1459, 6)

然后预处理模块的特征缩放和均值归一化。进一步提供了一个实用类StandardScaler&＃xff0c;它实现了变换方法来计算训练集上的均值和标准差&＃xff0c;以便稍后能够在测试集上重新应用相同的变换。

max_abs_scaler &＃61; preprocessing.MaxAbsScaler() X_train_maxabs &＃61; max_abs_scaler.fit_transform(X_matrix) print(X_train_maxabs)

[[ 0.7 0.30308401 0.5 0.1400982 0.66666667 0.99651741] [ 0.6 0.22367955 0.5 0.20654664 0.66666667 0.98308458] [ 0.7 0.31655441 0.5 0.15057283 0.66666667 0.99552239] ..., [ 0.7 0.41474654 0.25 0.18854337 0.66666667 0.96567164][ 0.5 0.191067 0.25 0.17643208 0.33333333 0.97014925] [ 0.5 0.22261609 0.25 0.20556465 0.33333333 0.97761194]]

X_test_maxabs &＃61; max_abs_scaler.fit_transform(X_test) print(X_test_maxabs)

[[ 0.5 0.17585868 0.2 0.17311089 0.25 0.97562189] [ 0.6 0.26084396 0.2 0.26084396 0.25 0.97412935] [ 0.5 0.31972522 0.4 0.18213935 0.5 0.99353234] ..., [ 0.5 0.24023553 0.4 0.24023553 0.25 0.97512438] [ 0.5 0.19038273 0. 0.17899902 0.25 0.99104478][ 0.7 0.39254171 0.6 0.19548577 0.5 0.99154229]]

模型训练

lr&＃61;LinReg.LinearRegression().fit(X_train_maxabs,Y)

模型预测

Y_pred_train &＃61; lr.predict(X_train_maxabs) print("Los Reg performance evaluation on Y_pred_train") print("R-squared &＃61;", metrics.r2_score(Y, Y_pred_train))

Los Reg performance evaluation on Y_pred_train R-squared &＃61; 0.768647335422

Y_pred_test &＃61; lr.predict(X_test_maxabs) print("Lin Reg performance evaluation on X_test") #print("R-squared &＃61;", metrics.r2_score(Y, Y_pred_test)) print("Coefficients &＃61;", lr.coef_)

Lin Reg performance evaluation on X_test Coefficients &＃61; [[ 205199.68775757 305095.8264889 58585.26325362 178302.68126933-16511.92112734 676458.9666186 ]]

Logistic Regression

导入模块

#导入模块 import pandas as pd import numpy as np

数据预处理

#创建特征列表表头 column_names &＃61; [&＃39;Sample code number&＃39;,&＃39;Clump Thickness&＃39;,&＃39;Uniformity of Cell Size&＃39;,&＃39;Uniformity of Cell Shape&＃39;,&＃39;Marginal Adhesion&＃39;,&＃39;Single Epithelial Cell Size&＃39;,&＃39;Bare Nuclei&＃39;,&＃39;Bland Chromatin&＃39;,&＃39;Normal Nucleoli&＃39;,&＃39;Mitoses&＃39;,&＃39;Class&＃39;] #使用pandas.read_csv函数从网上读取数据集 data &＃61; pd.read_csv(&＃39;DATA/data.csv&＃39;,names&＃61;column_names) #将&＃xff1f;替换为标准缺失值表示 data &＃61; data.replace(to_replace&＃61;&＃39;?&＃39;,value &＃61; np.nan) #丢弃带有缺失值的数据(只要有一个维度有缺失便丢弃) data &＃61; data.dropna(how&＃61;&＃39;any&＃39;) #查看data的数据量和维度 data.shape

(683, 11)

data.head(10)

	Sample code number	Clump Thickness	Uniformity of Cell Size	Uniformity of Cell Shape	Marginal Adhesion	Single Epithelial Cell Size	Bare Nuclei	Bland Chromatin	Normal Nucleoli	Mitoses	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2
5	1017122	8	10	10	8	7	10	9	7	1	4
6	1018099	1	1	1	1	2	10	3	1	1	2
7	1018561	2	1	2	1	2	1	3	1	1	2
8	1033078	2	1	1	1	2	1	1	1	5	2
9	1033078	4	2	1	1	2	1	2	1	1	2

由于原始数据没有提供对应的测试样本用于评估模型性能&＃xff0c;这里对带标记的数据进行分割&＃xff0c;25%作为测试集&＃xff0c;其余作为训练集

#使用sklearn.cross_validation里的train_test_split模块分割数据集 from sklearn.cross_validation import train_test_split #随机采样25%的数据用于测试&＃xff0c;剩下的75%用于构建训练集 X_train,X_test,y_train,y_test &＃61; train_test_split(data[column_names[1:10]],data[column_names[10]],test_size &＃61; 0.25,random_state &＃61; 33) #查看训练样本的数量和类别分布 y_train.value_counts()

2 344 4 168 Name: Class, dtype: int64

#查看测试样本的数量和类别分布 y_test.value_counts()

2 100 4 71 Name: Class, dtype: int64

建立模型&＃xff0c;预测数据

#从sklearn.preprocessing导入StandardScaler from sklearn.preprocessing import StandardScaler #从sklearn.linear_model导入LogisticRegression&＃xff08;逻辑斯蒂回归&＃xff09; from sklearn.linear_model import LogisticRegression #从sklearn.linear_model导入SGDClassifier&＃xff08;随机梯度参数&＃xff09; from sklearn.linear_model import SGDClassifier

ss &＃61; StandardScaler() X_train &＃61; ss.fit_transform(X_train) X_test &＃61; ss.transform(X_test)

lr &＃61; LogisticRegression() #调用逻辑斯蒂回归&＃xff0c;使用fit函数训练模型参数 lr.fit(X_train,y_train) lr_y_predict &＃61; lr.predict(X_test) #调用随机梯度的fit函数训练模型

lr_y_predict

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 4, 2, 4, 4, 4, 4, 4, 2, 2, 4, 4,2, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 2, 4, 2, 2,4, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2,2, 4, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 4, 4,2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 2, 2, 2,2, 4, 2, 4, 2, 4, 2, 2, 2, 2, 2, 4, 4, 2, 4, 4, 2, 4, 2, 2, 2, 2, 4,4, 4, 2, 4, 2, 2, 4, 2, 4, 4], dtype&＃61;int64)

使用线性分类模型进行良/恶性肿瘤预测任务的性能分析

#从sklearn.metrics导入classification_report from sklearn.metrics import classification_report#使用逻辑斯蒂回归模型自带的评分函数score获得模型在测试集上的准确性结果 print(&＃39;Accuracy of LR Classifier:&＃39;,lr.score(X_test,y_test)) #使用classification_report模块获得逻辑斯蒂模型其他三个指标的结果&＃xff08;召回率&＃xff0c;精确率&＃xff0c;调和平均数&＃xff09; print(classification_report(y_test,lr_y_predict,target_names&＃61;[&＃39;Benign&＃39;,&＃39;Malignant&＃39;]))

Accuracy of LR Classifier: 0.988304093567precision recall f1-score supportBenign 0.99 0.99 0.99 100Malignant 0.99 0.99 0.99 71avg / total 0.99 0.99 0.99 171

转:https://www.cnblogs.com/htfeng/p/9931758.html

推荐阅读

list
Transforming the Future of Virtual Worlds

Explore how Matterverse is redefining the metaverse experience, creating immersive and meaningful virtual environments that foster genuine connections and economic opportunities. ... [详细]

蜡笔小新 2024-12-28 09:44:49
list
Go语言基础：Hello World 实践

本文将介绍如何使用 Go 语言编写和运行一个简单的“Hello, World!”程序。内容涵盖开发环境配置、代码结构解析及执行步骤。 ... [详细]

蜡笔小新 2024-12-27 21:29:35
java
新浪笔试题

1:有如下一段程序：packagea.b.c;publicclassTest{privatestaticinti0;publicintgetNext(){return ... [详细]

蜡笔小新 2024-12-27 19:32:17
java
深入理解设计模式与七大原则

本文详细探讨了Java中的24种设计模式及其应用，并介绍了七大面向对象设计原则。通过创建型、结构型和行为型模式的分类，帮助开发者更好地理解和应用这些模式，提升代码质量和可维护性。 ... [详细]

蜡笔小新 2024-12-27 19:10:10
java
Java并发编程：LinkedBlockingQueue的实际应用

本文介绍了Java并发库中的阻塞队列（BlockingQueue）及其典型应用场景。通过具体实例，展示了如何利用LinkedBlockingQueue实现线程间高效、安全的数据传递，并结合线程池和原子类优化性能。 ... [详细]

蜡笔小新 2024-12-27 18:51:49
java
深入解析Spring Cloud Ribbon负载均衡机制

本文详细介绍了Spring Cloud中的Ribbon组件如何实现服务调用的负载均衡。通过分析其工作原理、源码结构及配置方式，帮助读者理解Ribbon在分布式系统中的重要作用。 ... [详细]

蜡笔小新 2024-12-27 16:01:25
java
分页插件3指定到某一页

前言--页数多了以后需要指定到某一页（只做了功能，样式没有细调）html ... [详细]

蜡笔小新 2024-12-27 15:19:01
java
解析Java中Text.splitText()方法及其应用场景

本文详细介绍了Java中org.w3c.dom.Text类的splitText()方法，通过多个代码示例展示了其实际应用。该方法用于将文本节点在指定位置拆分为两个节点，并保持在文档树中。 ... [详细]

蜡笔小新 2024-12-26 18:31:42
include
C++实现经典排序算法

本文详细介绍了七种经典的排序算法及其性能分析。每种算法的平均、最坏和最好情况的时间复杂度、辅助空间需求以及稳定性都被列出，帮助读者全面了解这些排序方法的特点。 ... [详细]

蜡笔小新 2024-12-27 19:25:14
include
使用动态规划算法求解0-1背包问题

本文介绍如何利用动态规划算法解决经典的0-1背包问题。通过具体实例和代码实现，详细解释了在给定容量的背包中选择若干物品以最大化总价值的过程。 ... [详细]

蜡笔小新 2024-12-27 19:17:15
default
CentOS7源码编译安装MySQL5.6

2019独角兽企业重金招聘Python工程师标准一、先在cmake官网下个最新的cmake源码包cmake官网：https:www.cmake.org如此时最新 ... [详细]

蜡笔小新 2024-12-27 17:49:56
java
Dockerfile 编写与 Docker 网络配置详解

本文详细介绍了 Dockerfile 的编写方法及其在网络配置中的应用，涵盖基础指令、镜像构建与发布流程，并深入探讨了 Docker 的默认网络、容器互联及自定义网络的实现。 ... [详细]

蜡笔小新 2024-12-27 17:31:41
java
Java面试题解析

本文详细介绍了Java编程语言中的核心概念和常见面试问题，包括集合类、数据结构、线程处理、Java虚拟机（JVM）、HTTP协议以及Git操作等方面的内容。通过深入分析每个主题，帮助读者更好地理解Java的关键特性和最佳实践。 ... [详细]

蜡笔小新 2024-12-27 13:55:14
ip
ImmutableX Poised to Pioneer Web3 Gaming Revolution

ImmutableX is set to spearhead the evolution of Web3 gaming, with its innovative technologies and strategic partnerships driving significant advancements in the industry. ... [详细]

蜡笔小新 2024-12-27 08:55:17
dll
C#中获取进程主窗口句柄的实现方法

本文介绍了如何在C#中启动一个应用程序，并通过枚举窗口来获取其主窗口句柄。当使用Process类启动程序时，我们通常只能获得进程的句柄，而主窗口句柄可能为0。因此，我们需要使用API函数和回调机制来准确获取主窗口句柄。 ... [详细]

蜡笔小新 2024-12-27 03:39:09

帅帅考拉_955

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章