分类MNIST

作者：手机用户2502910101 | 来源：互联网 | 2023-09-24 19:49

这是学习《Hands-OnMachineLearningwithScikit-LearnandTensorFlow》的笔记，如果此笔记对该书有侵权内容，请联系我，将其删除。博客出自

这是学习《Hands-On Machine Learning with Scikit-Learn and TensorFlow》的笔记，如果此笔记对该书有侵权内容，请联系我，将其删除。

博客出自：https://www.cnblogs.com/endlesscoding/p/9901539.html，未经博主同意，请忽转载。

这里面的内容目前条理还不是特别清析，后面有时间会更新整理一下。

下面的代码运行环境为jupyter + python3.6

获取数据
`# from sklearn.datasets import fetch_mldata # from sklearn import datasets # mnist = fetch_mldata('MNIST original') # mnist`
好像下载不到它的数据，直接从网上找到它的数据，放到当面目录下的`dataset`目录下。
`from sklearn.datasets import fetch_mldata from sklearn import datasets import numpy as np mnist = fetch_mldata('mnist-original', data_home = './datasets/') mnist`
`{'DESCR': 'mldata.org dataset: mnist-original', 'COL_NAMES': ['label', 'data'], 'target': array([0., 0., 0., ..., 9., 9., 9.]), 'data': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}`
```
网上很多的说法是错误的，只有我这个才是正解。
`X, y = mnist['data'], mnist['target'] print(X.shape) print(y.shape)`
`(70000, 784) (70000,)`
从上面看出来，X是一个\(7000\times784\)的一个矩阵，一般来说，7000行表示有7000个样本，784列，表示样本有784这么多个属性。
`%matplotlib inline import matplotlib import matplotlib.pyplot as plt some_digit = X[36000] some_digit_image = some_digit.reshape(28,28) plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest") plt.axis('off') plt.show()`
说个数看起来像是5，我觉得更像是6，我们可查看一下它的标签。
`y[36000]`
`5.0`
# EXTRA def plot_digits(instances, images_per_row=10, **options): size = 28 images_per_row = min(len(instances), images_per_row) images = [instance.reshape(size,size) for instance in instances] n_rows = (len(instances) - 1) // images_per_row + 1 row_images = [] n_empty = n_rows * images_per_row - len(instances) images.append(np.zeros((size, size * n_empty))) for row in range(n_rows): rimages = images[row * images_per_row : (row + 1) * images_per_row] row_images.append(np.concatenate(rimages, axis=1)) image = np.concatenate(row_images, axis=0) plt.imshow(image, cmap = matplotlib.cm.binary, **options) plt.axis("off")
`plt.figure(figsize=(9,9)) example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]] plot_digits(example_images, images_per_row=10) # save_fig("more_digits_plot") plt.show()`
可能这个标签写错了都不一定，我们得新写一下这个标签，说不定可以提高模型的准确率呢。这只是我个人在这里开玩笑说的，不用当真哈。
在做数据的训练前，应该找出测试集，这里MNIST已经帮我们把测试集做好了。
`X_train, X_test, y_train, y_test = X[:60000],X[60000:],y[:60000],y[60000:]`
MNIST的数据是按数字大小顺序排列的，所我们先要打乱它的顺序，这样可以保证我们的交叉验证是每一次都是相似的。
`import numpy as np shuffle_index = np.random.permutation(60000) shuffle_index`
`array([52603, 56601, 42625, ..., 17778, 24267, 29358])`
np.random.permutation 是随机排列一个序列。上面的例子就是从0~60000的随机序列

`X_train, y_train = X_train[shuffle_index],y_train[shuffle_index]`

训练一个二分类器
先不做一个多类器，我们先不去识别里面的手写数字是0~10中的某一个数。目前做一个最简单的，判断它是否是5，即将数据分成两个类别：“5”和“非5”
`# 这是一个逻辑数组，5：True, 非5：False y_train_5 = (y_train == 5) y_test_5 = (y_test == 5)`
现在开始用一个分类器去训练它。用随机梯度下降分类器SGD。用Scikit-Learn的SGDClassifier类。这个分类器有一个好处是能够高效地处理非常大的数据集。部分原因是它每次只处理一条数据。
`from sklearn.linear_model import SGDClassifier sgd_clf = SGDClassifier(random_state = 32) sgd_clf.fit(X_train, y_train_5)`
`SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5, random_state=32, shuffle=True, tol=None, validation_fraction=0.1, verbose=0, warm_start=False)`
`sgd_clf.predict([some_digit])`
`array([ True])`
这个模型的准确度好像受随机种子的影响比较大，如果我将模型的随机种改为42，我们再来看一下它预测的结果是不是正确的
`sgd_clf = SGDClassifier(random_state = 42) sgd_clf.fit(X_train, y_train_5) sgd_clf.predict([some_digit])`
`array([ True])`

对性能的评估
可以到这个时候它又预测错了。下面来整体评估一下这个分类的性能。

使用交叉验证测量准确性

在交叉验证过程中，有时候我们会需要更多的控制权，相较于函数cross_val_score()或者其他相似函数所提供的功能。下面代码做了和cross_val_score()相同的事情

from sklearn.model_selection import StratifiedKFold from sklearn.base import clone skfolds = StratifiedKFold(n_splits = 3, random_state = 42) clone_clf = clone(sgd_clf) for train_index, test_index in skfolds.split(X_train, y_train_5): X_train_folds = X_train[train_index] y_train_folds = (y_train_5[train_index]) X_test_fold = X_train[test_index] y_test_fold = (y_train_5[test_index]) clone_clf.fit(X_train_folds, y_train_folds) y_pred = clone_clf.predict(X_test_fold) n_correct = sum(y_pred == y_test_fold) print(n_correct / len(y_pred))

0.9612 0.9531 0.9688

StratfiedKFold 类实现了分层采样，生成的折包含了各类相应比例的样例。在每一次迭代，上述代码生成分类器的一个克隆，在克隆的模型上训练，在测试折上进行预测

下面直接使用sklearn中的库进行交叉评估。使用cross_val_score函数来评估SGDClassifier模型。

from sklearn.model_selection import cross_val_score cross_val_score(sgd_clf, X_train, y_train_5, cv = 3, scoring = "accuracy")

array([0.9612, 0.9531, 0.9688])

这精度看起来还不错，有大于95%的精度，有点让人兴奋，感觉做个分类还是挺容易的，不难。

我们再来看下一个非常简单的分类器去分类，看看它在“非5”这个类上的表现。

from sklearn.base import BaseEstimator # 这个模型的预测的策略就是将所有的数据都认为是'非5' class Never5Classifier(BaseEstimator): def fit(self,X,y=None): pass def predict(self,X): return np.zeros((len(X),1), dtype=bool)

np.zeros((2,1), dtype=bool)

array([[False], [False]])

never_5_clf = Never5Classifier() cross_val_score(never_5_clf, X_train, y_train_5, cv = 3, scoring = "accuracy")

array([0.90815, 0.9124 , 0.9084 ])

这么一个简单的分类器也有90%的精度，这是因为只有10%的样本是5，其它都是非5，所以只我们一直猜这个图像不是5，当然有90%的精度，这叫数据不平衡。就像我们如果在日本，站到大街上，见到人就猜他是一个日本人，我们几乎肯定是正确的。

所以精度并不是一个好的性能度量指标，特别是在我们数据不平衡的时候。

混淆矩阵

对一般分类器来说，一人好得多的性能评估指标是混淆矩阵。大体思路是：输出类别A被分成类别B的次数。

为了计算混淆矩阵，首先你需要有一系列的预测值，这样才能将预测值与真实值做比较。你或许想在测试集上做预测。

from sklearn.model_selection import cross_val_predict y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv = 3)

from sklearn.metrics import confusion_matrix confusion_matrix(y_train_5, y_train_pred)

array([[54306, 273], [ 2065, 3356]], dtype=int64)

混淆矩阵中的每一行表示一个实际的类，而每一列表一个预测的类。该矩阵的第一行认为"非5"中的53993张被正确地归类为非5（这被称为真反例，true negatives）,而其余586被错误归类为5（这被称为假正例，false positive），其余3905正确分类为"5"类（真正例，true positive）。一个完美的分类器将只有真反例和真正例，所混淆矩阵的非零值仅在其主对角线（左上至右下）。

# confusion_matrix(y_train_5, y_train_perfect_predictions)

混淆矩阵可以提供很多信息。有时候你会想要更加简明的指标。一个有趣的指标是正例预测的精度，也叫做分类器的准确率(precision)

\[
precision = \frac{TP}{TP + FP}
\tag{3-1}
\]

其中\(TP\)是真正例的数目，\(FP\)是假正例的数目。

以准确率一般会伴随另一个指标一起使用，这个指标叫做召回率（recall），也叫做敏感度（sensitivity）或者真正例率（true positive rate, TPR）。这是正例被分类器正确探测出的比率。

\[
recall = \frac{TP}{TP+FN}
\tag{3-2}
\]

\(FN\)是假反例的数目。

from sklearn.metrics import precision_score, recall_score print(precision_score(y_train_5, y_train_pred)) print(recall_score(y_train_5, y_train_pred))

0.924772664645908 0.6190739715919572

这样看起，这个分类器的准确率并不高，只有56.8%左右，而且只是分成两类的一个分类器，这跟我们猜差不多。

通常结合准确率和召回率会更加方便，这个指标叫做F1值，特别是当你需要一个简单的方法去比较两个分类器的优劣的时时候。F1值是准确率和召回率的调和平均。

\[
F1 = \frac{2}{\frac{1}{precision}+\frac{1}{recall}} = 2 \times \frac{precision \times recall}{precision + recall} = \frac{TP}{TP + \frac{FN+FP}{2}}
\tag{3-3}
\]

计算F1值，简单调用f1_score()即可。

from sklearn.metrics import f1_score f1_score(y_train_5, y_train_pred)

0.7416574585635358

F1支持那些有着相近准确率和召回率的分类（意思是只有当准确率和召回率一样大的时个，F1值才会大）。但并不是所的时候，我们都关心F1值，有时候我们只关心准确率（precision），或者有时候我们只关心召回率（recall）。

这里，我们再次理解一下准确率的含义：如果一个分类器的每次几乎都能把我们所要分的类别准确地分类出来，那么无疑，这个分类器的准确率是高的；什么时候准备率低呢，就是它把我们所要分的类，预测错了。比如我们这里的例子，我们要预测这张手写图片的数字是否是5，如果那张图真的是5，而我们的分类器预测它是5，那么它预测对了，当然预测对了，不是我们区分准确率与召回率的情况。如果将一张不是5的图片预测成5，那么我们会说它个分类器不是很准，它有低准确率。
什么是召回率？当我们将一张是5的图片预测成不是5，说明这个分类器还是比较严格的，那和它有较低的如回率。
总的来说，准确率低的原因就产将那些看起来像5（只是像，实际并不是5）的预测成了5；而召回率低的原因是把那些看起来不像5（实际上是5，只是可能那个5写得比较丑）预测成不是5。

在这里，我以自己的理解，举两个例子，比如公司想找个人当总经理，有一群人来应聘它。我们这时候的目标是，找到的这个人肯定是能够当总经理的，就算有的人看起来像是能当总经理，但是为了确保万无一失，我们要找一个看起来非常非常像能够当总经理的人。这个时候我们当然有着很高的准确率，因为我们找的人几乎肯定是能够当总经理的，但是此时，我们会犯另一个错误，就是有些人确实有能力当总经理，只是我们没有看出来（人不可貌像），所以我们拒绝他，因此我们有低的召回率，这在统计学上被称为犯了第一类错误，即弃真。这样做是合理的，因为即使弃真，但我们保真了。

另一种情况是，比如警察在一群人中想找出几个犯罪的人，这个时候我们就不能要超高的准确率了，因为有可能把真正的犯人放走。找犯人的原则一般是，只要他看起来像个犯人，都应该审查一下，即使最后真像大白后，他真的不是一个犯人。我们平时听到的宁可错杀一千，不可放走一个说的就是这个道理，因此这有着比较低的准确率，但是有高的召回率，这在统计学上被称为犯了第二类错误，即取伪。

准备率/召回率之间的折中

y_scores = sgd_clf.decision_function([some_digit]) y_scores

array([15905.22111141])

threshold = 0 y_some_digit_pred = (y_scores > threshold) y_some_digit_pred

array([ True])

y_scores = cross_val_predict(sgd_clf, X_train,y_train_5,cv=3, method = "decision_function")

from sklearn.metrics import precision_recall_curve precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds): plt.plot(thresholds, precisions[:-1], "b--", label = "Precision") plt.plot(thresholds, recalls[:-1], "g-", label = "Recall") plt.xlabel("Threshold") plt.legend(loc="upper left") plt.ylim([0,1.1]) plot_precision_recall_vs_threshold(precisions,recalls,thresholds) plt.grid() plt

ROC曲线

受试者工作特征(ROC)曲线是另一个二分类器常用的工具。它非常类似与准确率/召回率曲线，但不是画出准确率对召回率的曲线，，ROC曲线是真正例率（true positive rate，另一个名字叫做召回率）对假正例率（false positive rate, FPR）的曲线。FPR是反例被错误分成正例的比率。它等于1减去真反例率（true negative rate，TNR）。TNR是反例被正确分类的比率。TNR也叫做特异性。

为了画出ROC曲线，你首先需要计算各种不同阈值下的TPR、FPR，使用roc_curve()函数：

from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_train_5, y_scores) def plot_roc_curve(fpr, tpr, label = None): plt.plot(fpr,tpr, linewidth = 2, label = label) plt.plot([0,1],[0,1],'k--') plt.axis([0,1,0,1]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plot_roc_curve(fpr,tpr) plt

一个比较分类器之间优劣的方法是：测量ROC曲线下的面积（AUC）。一个完美的分类器的 ROC AUC 等于1，而一个纯随机分类器的ROC AUC等于0.5。Scikit-Learn提供了一个函数来计算ROC AUC：

from sklearn.metrics import roc_auc_score roc_auc_score(y_train_5,y_scores)

0.9623990527630832

from sklearn.ensemble import RandomForestClassifier forest_clf = RandomForestClassifier(random_state = 42) y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method = "predict_proba")

y_scores_forest = y_probas_forest[:,1] fpr_forest, tpr_forest, thresholds_forest=roc_curve(y_train_5,y_scores_forest) plt.plot(fpr,tpr,"b:",label="SGD") plot_roc_curve(fpr_forest,tpr_forest,"Random Forest") plt.legend(loc="bottom right") plt