热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

TensorFlow2利用泰坦尼克号获救CSV数据集完成数据预处理

本案例使用TensorFlow2加载CSV数据到tf.data.Dataset中,使用经典的数据集:泰坦尼克乘客数据。1.导入所需的库importt

本案例使用TensorFlow2加载CSV数据到tf.data.Dataset中,使用经典的数据集:泰坦尼克乘客数据。

1. 导入所需的库

import tensorflow as tf
import numpy as np
import pandas as pdimport functoolsfor i in [tf,np,pd]:print(i.__name__,": ",i.__version__,sep="")

输出:

tensorflow: 2.2.0
numpy: 1.17.4
pandas: 0.25.3

2. 下载并导入数据


2.1 下载数据到本地

trainDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
testDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"trainFilePath = tf.keras.utils.get_file("trainTitanic.csv",trainDataUrl)
testFilePath = tf.keras.utils.get_file("testTitanic.csv",testDataUrl)

输出:

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 1s 29us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 28us/step

Windows系统中下载的文件保存在:系统盘:\users\用户名.keras\datasets目录下

2.2 加载数据

labelColumn = "survived" # 指定数据标签的列名
labels = [0,1]def getDataset(filePath, **kwargs):dataset = tf.data.experimental.make_csv_dataset(filePath,batch_size=5,label_name=labelColumn,na_value="?",num_epochs=1,ignore_errors=True,**kwargs)return datasetrawTrainData = getDataset(trainFilePath)
rawTestData = getDataset(testFilePath)

def showBatch(dataset):for batch, label in dataset.take(1):for key, value in batch.items():print("{:20s}:{}".format(key,value.numpy()))print("{:20s}:{}".format("label",label.numpy()))showBatch(rawTrainData)

输出:

sex :[b'male' b'female' b'male' b'male' b'male']
age :[50. 30. 28. 31. 27.]
n_siblings_spouses :[0 0 0 1 0]
parch :[0 0 0 0 0]
fare :[ 13. 106.425 8.4583 52. 8.6625]
class :[b'Second' b'First' b'Third' b'First' b'Third']
deck :[b'unknown' b'unknown' b'unknown' b'B' b'unknown']
embark_town :[b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton' b'Southampton']
alone :[b'y' b'y' b'y' b'n' b'y']
label :[0 1 0 0 1]

3. 数据预处理

通过CSV文件导入的数据每列的数据类型可能不一样,这就需要将数据喂给模型前进行数据预处理。可以使用sklearn等工具进行前处理,再将数据传给TensorFlow。也可以使用TensorFlow内置的tf.feature_column工具,使用该工具的优点是如果训练的模型需要保存或分享给他人,则数据预处理的部分也会被保存。

3.1 特征选择——连续性数据

selectColumns = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare'] # 选择其中的几列进行分析
defaults = [0, 0.0, 0.0, 0.0, 0.0]
tempDataset = getDataset(trainFilePath,select_columns=selectColumns,column_defaults=defaults)showBatch(tempDataset)

输出:

age :[25. 43. 18. 55.5 47. ]
n_siblings_spouses :[1. 0. 0. 0. 1.]
parch :[2. 0. 0. 0. 1.]
fare :[151.55 8.05 7.7958 8.05 52.5542]
label :[0 0 0 0 1]

example_batch, labels_batch = next(iter(tempDataset))# 将所有列打包到一起
def pack(features, label):return tf.stack(list(features.values()),axis=1),labelpacked_dataset = tempDataset.map(pack)for fetures, labels in packed_dataset.take(1):print(fetures.numpy(),labels.numpy(),sep="\n\n")

输出:

[[28. 1. 0. 14.4542][28. 0. 0. 7.2292][52. 1. 0. 78.2667][48. 1. 2. 65. ][30.5 0. 0. 8.05 ]][0 0 1 1 0]

# 定义一个通用的预处理类:选择部分特征并打包到单列中
class PackNumericFeatures(object):def __init__(self, names):self.names = namesdef __call__(self, features, labels):numeric_features = [features.pop(name) for name in self.names]numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]numeric_features = tf.stack(numeric_features, axis=1)features["numeric"] = numeric_featuresreturn features, labelsnumericFeatures = ["age","n_siblings_spouses","parch","fare"]packed_train_data = rawTrainData.map(PackNumericFeatures(numericFeatures))
packed_test_data = rawTestData.map(PackNumericFeatures(numericFeatures))showBatch(packed_train_data)

输出:

sex :[b'male' b'male' b'male' b'male' b'male']
class :[b'Third' b'Third' b'Third' b'Third' b'Second']
deck :[b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town :[b'Cherbourg' b'Southampton' b'Queenstown' b'Cherbourg' b'Southampton']
alone :[b'n' b'y' b'n' b'y' b'y']
numeric :[[15. 1. 1. 7.2292][25. 0. 0. 7.05 ][ 7. 4. 1. 29.125 ][28. 0. 0. 7.8958][28. 0. 0. 13. ]]
label :[0 0 0 0 1]

example_batch, labels_batch = next(iter(packed_train_data))

3.2 连续性数据归一化

连接性数据通常需要做归一化操作。

desc = pd.read_csv(trainFilePath)[numericFeatures].describe()
desc

输出:

mean = np.array(desc.T["mean"])
std = np.array(desc.T["std"])print(mean, type(mean))
print(std, type(std))

输出:

[29.63130781 0.54545455 0.37958533 34.38539856]
[12.51181763 1.1510896 0.79299921 54.5977305 ]

def normalization(data, mean, std):return (data - mean)/stdnormalizer = functools.partial(normalization,mean=mean, std=std)numeric_column = tf.feature_column.numeric_column("numeric",normalizer_fn=normalizer,shape=[len(numericFeatures)])
numeric_columns = [numeric_column]
numeric_column

输出:

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(, mean=array([29.63130781, 0.54545455, 0.37958533, 34.38539856]), std=array([12.51181763, 1.1510896 , 0.79299921, 54.5977305 ])))
In [49]:

example_batch["numeric"]

输出:

array([[28. , 1. , 0. , 24.15],[51. , 0. , 0. , 8.05],[ 6. , 0. , 1. , 33. ],[26. , 0. , 0. , 10.5 ],[16. , 0. , 0. , 26. ]], dtype=float32)>

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

输出:

array([[-0.13038135, 0.39488277, -0.4786705 , -0.18746932],[ 1.7078807 , -0.47385937, -0.4786705 , -0.4823534 ],[-1.888719 , -0.47385937, 0.7823648 , -0.02537466],[-0.2902302 , -0.47385937, -0.4786705 , -0.4374797 ],[-1.0894746 , -0.47385937, -0.4786705 , -0.15358512]],dtype=float32)

3.3 特征选择——离散性数据

CATEGORIES = {'sex': ['male', 'female'],'class' : ['First', 'Second', 'Third'],'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],'alone' : ['y', 'n']
}categorical_columns = []
for feature, vocab in CATEGORIES.items():cat_col = tf.feature_column.categorical_column_with_vocabulary_list(key=feature, vocabulary_list=vocab)categorical_columns.append(tf.feature_column.indicator_column(cat_col))categorical_columns

输出:

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

输出:

[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]

3.4 合并连续数据和离散数据层

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)print(preprocessing_layer(example_batch).numpy()[0])

输出:

[ 0. 1. 0. 0. 1. 0.0. 0. 0. 0. 0. 0.0. 0. 0. 0. 0. 1.-0.13038135 0.39488277 -0.4786705 -0.18746932 1. 0. ]

4. 构建模型

model = tf.keras.Sequential([preprocessing_layer,tf.keras.layers.Dense(128,activation="relu"),tf.keras.layers.Dense(128,activation="relu"),tf.keras.layers.Dense(1)
])model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer="adam",metrics=["accuracy"])

5. 训练及评估模型

train_data = packed_train_data.shuffle(500)
test_data = packed_test_datamodel.fit(train_data, epochs=20)

输出:

Epoch 1/20
126/126 [==============================] - 0s 3ms/step - loss: 0.4894 - accuracy: 0.7544
Epoch 2/20
126/126 [==============================] - 0s 984us/step - loss: 0.4147 - accuracy: 0.8230
Epoch 3/20
126/126 [==============================] - 0s 921us/step - loss: 0.4002 - accuracy: 0.8309
Epoch 4/20
126/126 [==============================] - 0s 849us/step - loss: 0.3894 - accuracy: 0.8325
Epoch 5/20
126/126 [==============================] - 0s 841us/step - loss: 0.3812 - accuracy: 0.8485
Epoch 6/20
126/126 [==============================] - 0s 833us/step - loss: 0.3729 - accuracy: 0.8341
Epoch 7/20
126/126 [==============================] - 0s 849us/step - loss: 0.3716 - accuracy: 0.8421
Epoch 8/20
126/126 [==============================] - 0s 984us/step - loss: 0.3647 - accuracy: 0.8453
Epoch 9/20
126/126 [==============================] - 0s 770us/step - loss: 0.3472 - accuracy: 0.8501
Epoch 10/20
126/126 [==============================] - 0s 794us/step - loss: 0.3470 - accuracy: 0.8533
Epoch 11/20
126/126 [==============================] - 0s 841us/step - loss: 0.3449 - accuracy: 0.8421
Epoch 12/20
126/126 [==============================] - 0s 873us/step - loss: 0.3360 - accuracy: 0.8485
Epoch 13/20
126/126 [==============================] - 0s 857us/step - loss: 0.3313 - accuracy: 0.8565
Epoch 14/20
126/126 [==============================] - 0s 857us/step - loss: 0.3293 - accuracy: 0.8533
Epoch 15/20
126/126 [==============================] - 0s 873us/step - loss: 0.3236 - accuracy: 0.8644
Epoch 16/20
126/126 [==============================] - 0s 897us/step - loss: 0.3336 - accuracy: 0.8581
Epoch 17/20
126/126 [==============================] - 0s 770us/step - loss: 0.3185 - accuracy: 0.8565
Epoch 18/20
126/126 [==============================] - 0s 778us/step - loss: 0.3118 - accuracy: 0.8596
Epoch 19/20
126/126 [==============================] - 0s 794us/step - loss: 0.3130 - accuracy: 0.8581
Epoch 20/20
126/126 [==============================] - 0s 929us/step - loss: 0.3099 - accuracy: 0.8644
Out[58]:

test_loss, test_accuracy = model.evaluate(test_data)print("\nTest Loss: {}, Test Accuracy: {}".format(test_loss,test_accuracy))

输出:

53/53 [==============================] - 0s 906us/step - loss: 0.4588 - accuracy: 0.8561Test Loss: 0.45877906680107117, Test Accuracy: 0.8560606241226196
In [60]:

predictions = model.predict(test_data)for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):prediction = tf.sigmoid(prediction).numpy()print("Predicted survied: {:.2%}".format(prediction[0]),"| Actual outcome: ",("survived" if bool(survived) else "died"))

输出:

Predicted survied: 39.23% | Actual outcome: survived
Predicted survied: 99.98% | Actual outcome: survived
Predicted survied: 86.77% | Actual outcome: died
Predicted survied: 79.44% | Actual outcome: survived
Predicted survied: 30.34% | Actual outcome: died

 

 

 

 

 

 

 

 

 

 


推荐阅读
author-avatar
JackY-小袋鼠
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有