一个完整的项目流程图_机器学习一个完整的项目过程

作者：钟钟来了_960 | 来源：互联网 | 2023-09-18 19:44

准备数据训练集和测试集的数据来源于很多地方，比如：数据库，csv文件或者其他存储数据的方式，为了操作的简便性，

准备数据

训练集和测试集的数据来源于很多地方&＃xff0c;比如&＃xff1a;数据库&＃xff0c;csv文件或者其他存储数据的方式&＃xff0c;为了操作的简便性&＃xff0c;可以写一些小的脚本来下载并解析这些数据。在本文中&＃xff0c;我们先写一个脚本来演示&＃xff1a;

import os import tarfile from six.moves import urllibDOWNLOAD_ROOT &＃61; &＃39;https://raw.githubusercontent.com/ageron/handson-ml/master/&＃39; HOUSING_PATH &＃61; &＃39;chapter02/datasets/housing&＃39; HOUSING_URL &＃61; DOWNLOAD_ROOT &＃43; &＃39;datasets/housing&＃39; &＃43; &＃39;/housing.tgz&＃39;def fetch_housing_data(housing_url&＃61;HOUSING_URL, housing_path&＃61;HOUSING_PATH):print(housing_url)if not os.path.isdir(housing_path):os.makedirs(housing_path)tgz_path &＃61; os.path.join(housing_path, &＃39;housing.tgz&＃39;)urllib.request.urlretrieve(housing_url, tgz_path)print(tgz_path)housing_tgz &＃61; tarfile.open(tgz_path)housing_tgz.extractall(path&＃61;housing_path)housing_tgz.close()fetch_housing_data()

执行上边的代码后&＃xff0c;数据就已经下载到本地了&＃xff0c;接下来在使用pandas加载数据

import pandas as pddef load_housing_data(housing_path&＃61;HOUSING_PATH):print(housing_path)csv_path &＃61; os.path.join(housing_path, "housing.csv")print(csv_path)return pd.read_csv(csv_path)

数据预览

使用pandas解析后的数据是DataFrames格式&＃xff0c;我们可以调用变量的head()方法&＃xff0c;获取默认的前5条数据

可以看出&＃xff0c;总共有10条属性&＃xff0c;在这5条中&＃xff0c;显示数据都很完整&＃xff0c;没有发现数值有空的情况&＃xff0c;使用info()&＃xff0c;我们可以对整个数据的信息进行预览&＃xff1a;

一共有20640条数据&＃xff0c;这点数据对于ML来说是很小的&＃xff0c;只有total_bedrooms的属性下存在数据为空的情况。

通过观察数据&＃xff0c;我们发现&＃xff0c;除了ocean_proximity之外的属性的值都是数值类型&＃xff0c;数值类型很容易在ML算法中实现&＃xff0c;再次观察上边5条数据的ocean_proximity值&＃xff0c;可以推断出ocean_proximity应该存在几种类型&＃xff0c;跟枚举有点像&＃xff0c;使用value_counts()方法可以查看每个值得数量&＃xff1a;

除此之外&＃xff0c;使用describe()可以查看每一行更多的信息&＃xff1a;

名词解释&＃xff1a;

名称 | 解释 ---|--- count | 数量 mean | 均值 min | 最小值 max | 最大值 std | 标准差 25%/50%.75% | 低于该值所占的比例

如果想查看每个属性更加详细的信息&＃xff0c;我们可以使用hist()方法&＃xff0c;查看每个属性的矩形图&＃xff1a;

%matplotlib inline import matplotlib.pyplot as plt housing.hist(bins&＃61;50, figsize&＃61;(20, 15)) plt.show()

通过观察矩形图可以很容易的看出值的分布情况&＃xff0c;矩形图的x轴表示值&＃xff0c;y轴表示数量。针对我们这份数据&＃xff0c;我们发现了如下信息&＃xff1a;

对于median_income来说&＃xff0c;它的值并不是表示的是真实的收入&＃xff0c;而是通过计算的结果&＃xff0c;取值范围在0.5~15之间&＃xff0c;明白数值是如何计算的&＃xff0c;也很重要。
数据受限的情况&＃xff0c;housing_median_age和median_house_value存在明显的值得限制&＃xff0c;在他们的矩形图的右边有一条很长的条&＃xff0c;这说明存在限制的情况&＃xff0c;这会对ML算法产生一定的影响&＃xff0c;比如&＃xff0c;在使用算法预测的时候&＃xff0c;是否需要也添加该限制&＃xff1f;如果答案是不限制&＃xff0c;需要对当前受限制的数据做进一步的处理&＃xff1a;
- 收集受限制的数据的真实值
- 删除这些受限制的数据
这些属性的取值范围有很大的区别&＃xff0c;这个会在下文中解决这个问题
图形中有存在尾重的现象&＃xff0c;这个也会在下文中解决

创建test集

在创建test set的过程中&＃xff0c; 能够进一步让我们了解数据&＃xff0c;这对选择机器学习算法很有帮助。最简单的就是随机收取大约20%的数据作为test set。

使用随机函数的缺点是&＃xff0c;每次运行程序得到的结果都不一样&＃xff0c;因此&＃xff0c;为处理这个问题&＃xff0c;我们需要给每一行一个唯一的identifier&＃xff0c;然后对identifier进行hash化&＃xff0c;取它的最后一个字节值小于或等于51&＃xff08;20%&＃xff09;就可以了。

在原有的数据中&＃xff0c;并不存在这样的identifier&＃xff0c;因此需要调用reset_index()函数&＃xff0c;为每行添加索引&＃xff0c;作为identifier。

import hashlib import numpy as npdef test_set_check(identifier, test_ratio, hash):return hash(np.int64(identifier)).digest()[-1] <256 * test_ratiodef split_train_test_by_id(data, test_ratio, id_column, hash&＃61;hashlib.md5):ids &＃61; data[id_column]in_test_set &＃61; ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))return data.loc[~in_test_set], data.loc[in_test_set] # 给housing添加index housing_with_id &＃61; housing.reset_index() train_set, test_set &＃61; split_train_test_by_id(housing_with_id, 0.2, "index") print(len(train_set), &＃39;train &＃43;&＃39;, len(test_set), "test")# 也可以使用这种方式来创建id # housing_with_id["id"] &＃61; housing["longitude"] * 1000 &＃43; housing["latitude"] # train_set, test_set &＃61; split_train_test_by_id(housing_with_id, 0.2, "id")

在上边的代码中&＃xff0c;使用index作为identifier有一个缺点&＃xff0c;需要把新的数据拼接到数据整体的最后边&＃xff0c;同时不能删除中间的数据&＃xff0c;解决的方法是&＃xff0c;使用其他属性的组合来计算identifier。

当然sklearn也提供了生成test set的方法

from sklearn.model_selection import train_test_split train_set, test_set &＃61; train_test_split(housing, test_size&＃61;0.2, random_state&＃61;42)

随机抽样比较适用于数据量大的样本&＃xff0c;如果样本不够大&＃xff0c;就会引入很大的抽样偏差。对于当前的数据&＃xff0c;我们采取分层抽样。当你询问专家那个属性最重要的时候&＃xff0c;他回答说median_income最重要&＃xff0c;我们就要考虑基于median_income进行分层抽样。

观察上图&＃xff0c;可以发现&＃xff0c;median_income的值主要集中在几个层次上&＃xff0c;由于层次不够多&＃xff0c;这也侧面说明了不太适合使用随机抽样。

我们为数据新增一个属性&＃xff0c;用于标记每行数据属于哪个层次。对于大于5.0的&＃xff0c;都归到5.0中。

# 随机抽样会在某些情况下存在偏差&＃xff0c;这时候可以考虑分层抽样&＃xff0c;每层的实例个数不能太少&＃xff0c;分层不能太多 housing["income_cat"] &＃61; np.ceil(housing["median_income"] / 1.5) housing["income_cat"].where(housing["income_cat"] <5, 5.0, inplace&＃61;True) print(housing.head(10))

接下来就需要根据income_cat,使用sklearn对数据进行分层抽样。

# 使用sklearn的tratifiedShuffleSplit类进行分层抽样 from sklearn.model_selection import StratifiedShuffleSplitsplit &＃61; StratifiedShuffleSplit(n_splits&＃61;1, test_size&＃61;0.2, random_state&＃61;42) for train_index, test_index in split.split(housing, housing["income_cat"]):strat_train_set &＃61; housing.loc[train_index]strat_test_set &＃61; housing.loc[test_index]print(housing["income_cat"].value_counts() / len(housing))# 得到训练集和测试集后删除income_cat for s in (strat_train_set, strat_test_set):s.drop(["income_cat"], axis&＃61;1, inplace&＃61;True)print(strat_train_set.head(10))

上边的代码在抽样成功后&＃xff0c;删除了income_cat属性&＃xff0c;结果如下&＃xff1a;

如果我们计算test set和原数据的误差&＃xff0c;能够得到下边这张表格&＃xff0c;可以看出&＃xff0c;分层抽样的错误明显小于随机抽样。

发现数据的更多信息

要想找到数据中隐藏的信息&＃xff0c;就要使用可视化的手段&＃xff0c;对于我们的housing数据来说&＃xff0c;它包含经纬度信息&＃xff0c;基于地理位置应该是一个好的切入口。

housing &＃61; strat_train_set.copy() housing.plot(kind&＃61;"scatter", x&＃61;"longitude", y&＃61;"latitude", figsize&＃61;(20, 12))

这张图如果绘制成这样的&＃xff0c;很难发现有什么特点&＃xff0c;我们调整点的透明度试一试。

housing.plot(kind&＃61;"scatter", x&＃61;"longitude", y&＃61;"latitude", alpha&＃61;0.1, figsize&＃61;(20, 12))

这样我们的头脑自动分析后&＃xff0c;很容易得出数据浓度高的地方存在特殊性&＃xff0c;那么这些是否与价格相关&＃xff1f;更进一步&＃xff0c;我们用点的半径表示相应点的人口规模&＃xff0c;用颜色表示价格&＃xff0c;然后绘图&＃xff1a;

housing.plot(kind&＃61;"scatter", x&＃61;"longitude", y&＃61;"latitude", alpha&＃61;0.4, s&＃61;housing["population"]/100, label&＃61;"population", c&＃61;"median_house_value", cmap&＃61;plt.get_cmap("jet"), colorbar&＃61;True, figsize&＃61;(20, 12)) plt.legend()

从这张图&＃xff0c;可以观察到&＃xff0c;价格跟位置和人口密度有很大的关系&＃xff0c;和ocean_proximity同样有关系&＃xff0c;因此&＃xff0c;从直觉上&＃xff0c;我们可以考虑使用聚类算法。

属性组合

在数据中&＃xff0c;可能打个属性的用处并不大&＃xff0c;但是对这些属性做一些特殊的重组后&＃xff0c;会获取到一些有用的信息。

在我们这个例子中&＃xff0c;total_rooms,total_bedrooms单独存在的意义不是很大&＃xff0c;但是如果跟population和households做一些组合后&＃xff0c;就会产生新的有意义的属性。

# 有些属性可能是我们不需要的&＃xff0c;在这里&＃xff0c;bedrooms的总数&＃xff0c;不是我们关心的 # 因此我们可以使用已有的一些属性生成新的组合属性 housing["rooms_per_household"] &＃61; housing["total_rooms"] / housing["households"] housing["bedrooms_per_room"] &＃61; housing["total_bedrooms"] / housing["total_rooms"] housing["population_per_household"] &＃61; housing["population"] / housing["households"] corr_matrix &＃61; housing.corr() corr_matrix["median_house_value"].sort_values(ascending&＃61;False)

bedrooms_per_room比&＃xff0c;total_rooms,total_bedrooms的相关性都要高&＃xff0c;说明我们做的属性重组起到了作用。

对数据的操作是一个循序渐进的过程。

数据清洗

在清洗数据之前&＃xff0c;我们先保存好数据。

# 分离labels housing &＃61; strat_train_set.drop("median_house_value", axis&＃61;1) housing_labels &＃61; strat_train_set["median_house_value"].copy()

在本文上半部分&＃xff0c;我们提到过total_bedrooms有一些值为空的情况&＃xff0c;对于这种情况&＃xff0c;我们一般会采取以下几种方式“

放弃值为空的整行的数据
放弃该属性
重新赋值

通常会采取第三种方式&＃xff0c;为空的值重新附一个新值&＃xff0c;比方说均值。

sklearn提供了一个Imputer来专门处理这个问题&＃xff1a;

# 机器学习算法不能运行在值缺失的情况&＃xff0c;因此需要对值缺失做一些处理 # 1. 放弃那一行数据 2. 放弃整个属性 3. 给缺失的值重新赋值 from sklearn.impute import SimpleImputer# 使用中位数作为策略 imputer &＃61; SimpleImputer(strategy&＃61;"median") # 移除不是数值类型的项 housing_num &＃61; housing.drop("ocean_proximity", axis&＃61;1) # fit只用来计算数据的策略值 imputer.fit(housing_num) print(imputer.statistics_) # 转换数据&＃xff0c;就是补齐missing value X &＃61; imputer.transform(housing_num)

其中imputer的fit()函数&＃xff0c;只是计算了各个属性的均值&＃xff0c;并没有做其他额外的事情&＃xff0c;这就好比对imputer进行了‘训练’&＃xff0c;然后调用transfom()转化数据。

其中均值如下&＃xff1a;

处理text类型的属性

在我们这个例子中,ocean_proximity是text类型&＃xff0c;需要把它转为数值类型。sklearn提供了LabelEncoder模块来把这些text类型的值转换成数值。

# 对于不是数值的属性值&＃xff0c;sk页提供了转换方法 from sklearn.preprocessing import LabelEncoderencoder &＃61; LabelEncoder() housing_cat &＃61; housing["ocean_proximity"] housing_cat_encoded &＃61; encoder.fit_transform(housing_cat) print(housing_cat_encoded) print(encoder.classes_)&＃39;&＃39;&＃39; [3 3 3 ... 1 1 1] [&＃39;<1H OCEAN&＃39; &＃39;INLAND&＃39; &＃39;ISLAND&＃39; &＃39;NEAR BAY&＃39; &＃39;NEAR OCEAN&＃39;] &＃39;&＃39;&＃39;

但是这么做存在的问题是&＃xff0c;在机器学习中&＃xff0c;认为相近的数值往往相似性更高&＃xff0c;为了解决这个问题&＃xff0c;sklearn提供了OneHotEncoder模块&＃xff0c;把整数映射为一个只有0和1的向量&＃xff0c;只有相对的位置是1&＃xff0c;其他都是0&＃xff1a;

# 在上边的例子中有个很大的问题&＃xff0c;ml的算法会任务0和1比较接近&＃xff0c;但是<1H OCEAN和NEAR OCEAN更相似 # 为了解决这个问题&＃xff0c;需要引入one hot的方式&＃xff0c;用所在的位置设为1 from sklearn.preprocessing import OneHotEncoderencoder &＃61; OneHotEncoder() housing_cat_1hot &＃61; encoder.fit_transform(housing_cat_encoded.reshape(-1, 1)) print(housing_cat_1hot.toarray())&＃39;&＃39;&＃39; [[1. 0. 0. 0. 0.][1. 0. 0. 0. 0.][0. 0. 0. 0. 1.]...[0. 1. 0. 0. 0.][1. 0. 0. 0. 0.][0. 0. 0. 1. 0.]]&＃39;&＃39;&＃39;

当然&＃xff0c;sklearn还提供了把上边两步合为一步的模块LabelBinarizer:

# 也可以把label和one hot的步骤合成一个 from sklearn.preprocessing import LabelBinarizerencoder &＃61; LabelBinarizer() housing_cat_1hot &＃61; encoder.fit_transform(housing_cat) print(housing_cat_1hot)

自定义Transforms

尽管sklearn提供了很多有用的transfoms&＃xff0c;但是我们还是希望能够自定义一些transforms&＃xff0c;而且这些自定义的模块&＃xff0c;最好用起来和sklearn提供的一样&＃xff0c;很简单&＃xff0c;下边的代码实现了一个很简单的数据转换&＃xff1a;

之前&＃xff1a;

现在&＃xff1a;

# 自定义Transformation from sklearn.base import BaseEstimator, TransformerMixinrooms_ix, bedrooms_ix, population_ix, household_ix &＃61; 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def __init__(self, add_bedrooms_per_room&＃61;True):self.add_bedrooms_per_room &＃61; add_bedrooms_per_roomdef fit(self, X, y&＃61;None):return selfdef transform(self, X, y&＃61;None):print("&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;")rooms_per_household &＃61; X[:, rooms_ix] / X[:, household_ix]population_per_household &＃61; X[:, population_ix] / X[:, household_ix]if self.add_bedrooms_per_room:bedrooms_per_room &＃61; X[:, bedrooms_ix] / X[:, rooms_ix]print("aaaa", np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room][0])return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household]attr_adder &＃61; CombinedAttributesAdder() housing_extra_attribs &＃61; attr_adder.transform(housing.values) print(len(housing_extra_attribs[0])) # 在每一行的后边拼接了两个值 print(housing_extra_attribs) # 在每一行的后边拼接了两个值&＃39;&＃39;&＃39; [[-121.89 37.29 38.0 ... 4.625368731563422 2.0943952802359880.22385204081632654][-121.93 37.05 14.0 ... 6.008849557522124 2.70796460176991170.15905743740795286][-117.2 32.77 31.0 ... 4.225108225108225 2.02597402597402580.24129098360655737]...[-116.4 34.09 9.0 ... 6.34640522875817 2.7424836601307190.1796086508753862][-118.01 33.82 31.0 ... 5.50561797752809 3.8089887640449440.19387755102040816][-122.45 37.77 52.0 ... 4.843505477308295 1.98591549295774650.22035541195476574]]&＃39;&＃39;&＃39;

这个转换的另一个好处是&＃xff0c;可以很方便的加入到pipeline中&＃xff0c;这个下边也讲到了。

特征缩放

对于机器学习&＃xff0c;数据的scaling同样很重要&＃xff0c;不同scaling的特征&＃xff0c;会产生不同的结果&＃xff0c;在我们的数据中&＃xff0c;就存在scaling不一致的问题&＃xff0c;解决这样的问题一般有两种方式&＃xff1a;

Min-max scaling&＃xff0c;也叫normalization&＃xff0c; 主要是把值压缩到0~1之间&＃xff0c;用值减去最小值后&＃xff0c;再除以最大值减最小值的值
Standardization&＃xff0c;减去均值后再除以方差&＃xff0c;这个跟也叫normalization不一样的地方在于&＃xff0c;他的取值范围不是0~1&＃xff0c;它可以避免数据中存在极大值造成的误差

sklearn提供了StandardScaler模块用于特征缩放&＃xff0c;我们使用的是第二种Standardization。

Transformation Pipelines

我们上边的一系列过程&＃xff0c;包含数据清洗&＃xff0c;属性重组&＃xff0c;数据缩放&＃xff0c;text类型的转换&＃xff0c;都可以使用sklearn的Pipeline来组合成一个整体的过程&＃xff0c;支持异步的方式&＃xff0c;同时进行多个pipeline

# 使用属性组合的方式 from sklearn.pipeline import FeatureUnion from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScalerclass DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names &＃61; attribute_namesdef fit(self, X, y&＃61;None):return selfdef transform(self, X):return X[self.attribute_names].valuesclass CustomLabelBinarizer(BaseEstimator, TransformerMixin):def __init__(self, *args, **kwargs):self.encoder &＃61; LabelBinarizer(*args, **kwargs)def fit(self, x, y&＃61;None):self.encoder.fit(x)return selfdef transform(self, x, y&＃61;None):print(self.encoder.transform(x))return self.encoder.transform(x)num_attribs &＃61; list(housing_num) cat_attribs &＃61; ["ocean_proximity"]num_pipeline &＃61; Pipeline([("selector", DataFrameSelector(num_attribs)), ("imputer", SimpleImputer(strategy&＃61;"median")), ("attribs_adder", CombinedAttributesAdder()), ("std_scaler", StandardScaler())])cat_pipeline &＃61; Pipeline([("selector", DataFrameSelector(cat_attribs)), ("label_binarizer", CustomLabelBinarizer())])full_pipeline &＃61; FeatureUnion(transformer_list&＃61;[("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline)])housing_prepared &＃61; full_pipeline.fit_transform(housing) print(housing_prepared[0])

上边的代码实现了从数据清洗到特征缩放的整个过程。

选择和训练模型

在完成了数据的准备任务后&＃xff0c;我们对数据应该有了很清晰的了解&＃xff0c;接下来就需要选择训练模型&＃xff0c;这个过程也是一个不断选择的过程。

我们首先用linear regression model来试一下&＃xff1a;

# 我们先用线性回归模型试一下 from sklearn.linear_model import LinearRegressionlin_reg &＃61; LinearRegression() lin_reg.fit(housing_prepared, housing_labels)# 准备一些测试数据 some_data &＃61; housing.iloc[:5] some_labels &＃61; housing_labels.iloc[:5] some_data_prepared &＃61; full_pipeline.transform(some_data) print(some_data_prepared) print("Predictions:t", lin_reg.predict(some_data_prepared)) print("Labels:tt,", list(some_labels))

用sklearn写模型还是很简单的&＃xff0c;通过打印&＃xff0c;我们能够看到预测值和观测值还有差距&＃xff0c;这时候&＃xff0c;就需要一个error信息&＃xff0c;来监控错误率

mean_squared_error表示均方误差&＃xff0c;公式为&＃xff1a;

一般使用RMSE进行评估&＃xff08;这个回归分析模型中最常用的评估方法&＃xff09;&＃xff1a;

用代码表示为&＃xff1a;

# 使用RMSE测错误 from sklearn.metrics import mean_squared_errorhousing_predictions &＃61; lin_reg.predict(housing_prepared) lin_mse &＃61; mean_squared_error(housing_labels, housing_predictions) lin_rmse &＃61; np.sqrt(lin_mse) lin_rmse # 这种错误误差已经很大&＃xff0c;说明当前的features不能提供预测的足够的信息或者当前模型不够强大&＃39;&＃39;&＃39; 68628.19819848923 &＃39;&＃39;&＃39;

从本文上部分的分布应该不难看出&＃xff0c;用线性回归的话误差应该很大&＃xff0c;更进步&＃xff0c;我们考虑使用决策树模型来训练试一下。

# 使用决策树来训练数据 from sklearn.tree import DecisionTreeRegressortree_reg &＃61; DecisionTreeRegressor() tree_reg.fit(housing_prepared, housing_labels)tree_predictions &＃61; tree_reg.predict(housing_prepared) tree_mse &＃61; mean_squared_error(housing_labels, tree_predictions) tree_rmse &＃61; np.sqrt(tree_mse) tree_rmse&＃39;&＃39;&＃39; 0.0 &＃39;&＃39;&＃39;

误差为0&＃xff0c;这说明过拟合了。过拟合不是一件好事&＃xff0c;为了解决这个问题&＃xff0c;我们可以对当前的训练数据做交叉验证Cross-Validation。它的本质是把当前的数据分割成n份&＃xff0c;同时生成n个误差。

这里用到的是K-fold Cross Validation叫做K折交叉验证&＃xff0c;和LOOCV的不同在于&＃xff0c;我们每次的测试集将不再只包含一个数据&＃xff0c;而是多个&＃xff0c;具体数目将根据K的选取决定。比如&＃xff0c;如果K&＃61;5&＃xff0c;那么我们利用五折交叉验证的步骤就是&＃xff1a;

将所有数据集分成5份
不重复地每次取其中一份做测试集&＃xff0c;用其他四份做训练集训练模型&＃xff0c;之后计算该模型在测试集上的MSE_i
将5次的MSE_i取平均得到最后的MSE

# 上边出现了error为0的情况&＃xff0c;说明过拟合了&＃xff0c;可以使用sk的交叉验证 # 把训练数据分成一定的分数&＃xff0c;相互验证 from sklearn.model_selection import cross_val_scorescores &＃61; cross_val_score(tree_reg, housing_prepared, housing_labels, scoring&＃61;"neg_mean_squared_error", cv&＃61;10) tree_rmse_scores &＃61; np.sqrt(-scores)def display_scores(scores):print("Scores:", scores)print("Mean:", scores.mean())print("Standard deviation:", scores.std())display_scores(tree_rmse_scores)

可以看出决策树的误差也很高&＃xff0c;我们在对线性回归模型做交叉验证&＃xff1a;

# 使用交叉验证看看回归的error line_scores &＃61; cross_val_score(lin_reg, housing_prepared, housing_labels, scoring&＃61;"neg_mean_squared_error", cv&＃61;10) line_rmse_scores &＃61; np.sqrt(-line_scores)display_scores(line_rmse_scores)

最后&＃xff0c;我们使用随机森林来训练模型&＃xff1a;

# 随机森林 from sklearn.ensemble import RandomForestRegressorrandom_forest &＃61; RandomForestRegressor() random_forest.fit(housing_prepared, housing_labels)forest_predictions &＃61; random_forest.predict(housing_prepared) forest_mse &＃61; mean_squared_error(housing_labels, forest_predictions) forest_rmse &＃61; np.sqrt(forest_mse) forest_rmse&＃39;&＃39;&＃39; 22100.915917968654 &＃39;&＃39;&＃39;

看上去&＃xff0c;这次错误明显小了很多&＃xff0c;这个模型目前来说是比较理想的。

在经历过选择模型后&＃xff0c;我们一般会得到一个模型列表&＃xff0c;只需选择最优的那个就行了。

微调模型

一般来说&＃xff0c;机器学习算法都有一些hyperparameter&＃xff0c;这些参数可以影响结果&＃xff0c;我们对模型的优化也包括如何找到最优的参数。

sklearn的GridSearchCV能够方便的创建参数组合&＃xff0c;比如&＃xff1a;

# 在得到一系列可用的模型列表后&＃xff0c;需要对该模型做微调 # Grid Search 网络搜索&＃xff0c;使用sk对各种不同的参数组合做训练&＃xff0c;获取最佳参数组合 from sklearn.model_selection import GridSearchCVparam_grid &＃61; [{&＃39;n_estimators&＃39;: [3, 10, 30], &＃39;max_features&＃39;: [2, 4, 6, 8]},{&＃39;bootstrap&＃39;: [False], &＃39;n_estimators&＃39;: [3, 10], &＃39;max_features&＃39;: [2, 3, 4]}] forest_reg &＃61; RandomForestRegressor() grid_search &＃61; GridSearchCV(forest_reg, param_grid, cv&＃61;5, scoring&＃61;&＃39;neg_mean_squared_error&＃39;) grid_search.fit(housing_prepared, housing_labels)grid_search.best_params_&＃39;&＃39;&＃39; {&＃39;max_features&＃39;: 8, &＃39;n_estimators&＃39;: 30} &＃39;&＃39;&＃39;

上边的代码中一共尝试了34 &＃43; 23 &＃61; 18种组合。

# 获取最优的estimator grid_search.best_estimator_&＃39;&＃39;&＃39; RandomForestRegressor(bootstrap&＃61;True, criterion&＃61;&＃39;mse&＃39;, max_depth&＃61;None,max_features&＃61;8, max_leaf_nodes&＃61;None, min_impurity_decrease&＃61;0.0,min_impurity_split&＃61;None, min_samples_leaf&＃61;1,min_samples_split&＃61;2, min_weight_fraction_leaf&＃61;0.0,n_estimators&＃61;30, n_jobs&＃61;None, oob_score&＃61;False,random_state&＃61;None, verbose&＃61;0, warm_start&＃61;False) &＃39;&＃39;&＃39;cvres &＃61; grid_search.cv_results_ for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):print(np.sqrt(-mean_score), params)

可以很直观的看到每个参数下的误差。

用测试集验证

最后&＃xff0c;当有了可用的模型后&＃xff0c;就可以对test set进行验证了&＃xff0c;但首先需要使用上文的pipeline对test set进行转换&＃xff1a;

# 使用最终的模型来评估测试数据 final_model &＃61; grid_search.best_estimator_X_test &＃61; strat_test_set.drop("median_house_value", axis&＃61;1) y_test &＃61; strat_test_set["median_house_value"].copy()X_test_prepared &＃61; full_pipeline.transform(X_test)final_predictions &＃61; final_model.predict(X_test_prepared)final_mse &＃61; mean_squared_error(y_test, final_predictions) final_rmse &＃61; np.sqrt(final_mse) final_rmse&＃39;&＃39;&＃39; 47732.7520382174 &＃39;&＃39;&＃39;